Pretraining Using HF Tokenizers and Transformers

#36

by akhooli - opened Dec 30, 2024

Dec 30, 2024

I looked for an end to end example of pretraining a fresh ModernBERT model including the tokenizer (ex. a new language), or fine-tuning an existing checkpoint (ex. ModernBERT-Base) using a custom tokenizer (to account for a different vocabulary of another language family).
A HuggingFace implementation is preferred (saw this but current code is not working).

NohTow

Jan 3, 2025

Hello,

The pre-training codebase should do the trick, it is its main purpose and is optimized. While it is using Composer, you should be able to leverage HF models and tokenizers.
For continued pre-training, someone reported having issue with loading the weights of ModernBERT, so we will investigate and potentially release Composer checkpoints alongside the HF ones when we release all the pre-training checkpoints (which, as stated in the issue, should be better starting points than the post-decay ones).

akhooli

Jan 4, 2025

•

edited Jan 6, 2025

Thanks. I had a look again at the repo and noticed the FlexBert uses the old bert-base tokenizer. I guess I should wait a bit as the HF way of doing it may require some additional tweaks - ex. issue 163.
Update: got inspiration from this discussion and trained a tiny model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment