Is this model meant for full bfloat16, AMP bfloat16 or no bfloat16?

by umarbutler - opened Dec 21, 2024

Discussion

umarbutler

Dec 21, 2024

The paper does not make it clear.

umarbutler

Jan 5, 2025

Bump

bwarner

Jan 6, 2025

We trained ModernBERT with amp_bf16. We'll add that detail to our next arxiv preprint update. I imagine ModernBERT will work fine with fp32, amp_bf16, or bf16. Although, the latter might need additional finetuning depending on the usecase.

ymoslem

Dec 7, 2025

•

edited Dec 7, 2025

@bwarner @umarbutler Does this mean we should load the model with fp32 and ignore the flash attention warning that it needs fp16 or bf16, or do you have a better suggestion? I just noticed that loading the model with bf16 and fine-tuning it with a small dataset leads to worse results than when the model is just loaded without it (i.e. with fp32). Thanks!

umarbutler

Dec 7, 2025

@ymoslem after having used ModernBERT quite extensively, I can recommend:

Always train with AMP (mixed precision) bfloat16.
Training in full bfloat16 is not a good idea as you are likely to see instability.
You will see little to no degradation in performance when doing inference in full bfloat16 versus AMP bfloat16.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment