Instructions to use answerdotai/ModernBERT-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use answerdotai/ModernBERT-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="answerdotai/ModernBERT-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-base") - Notebooks
- Google Colab
- Kaggle
Dataset
#78
by hakim1510 - opened
The paper reads, "Both ModernBERT models are trained on 2 trillion tokens of primarily English data from a variety of data sources, including web documents, code, and scientific literature, following common modern data mixtures. We choose the final data mixture based on a series of ablations."
Is the dataset publicly released somewhere? Or, is there any description on which prior datasets they incorporated?