| --- |
| language: |
| - et |
| license: cc-by-4.0 |
| library_name: nemo |
| tags: |
| - automatic-speech-recognition |
| - speech |
| - nemo |
| - parakeet |
| - estonian |
| - transducer |
| - FastConformer |
| - TDT |
| datasets: |
| - mozilla-foundation/common_voice_17_0 |
| - yuriyvnv/synthetic_asr_et_sl |
| base_model: nvidia/parakeet-tdt-0.6b-v3 |
| pipeline_tag: automatic-speech-recognition |
| model-index: |
| - name: parakeet-tdt-0.6b-estonian |
| results: |
| - task: |
| type: automatic-speech-recognition |
| name: Speech Recognition |
| dataset: |
| name: Common Voice 17.0 (et) - Test |
| type: mozilla-foundation/common_voice_17_0 |
| config: et |
| split: test |
| metrics: |
| - type: wer |
| value: 21.03 |
| name: WER (raw) |
| - type: wer |
| value: 18.51 |
| name: WER (normalized) |
| - type: cer |
| value: 4.64 |
| name: CER (raw) |
| - task: |
| type: automatic-speech-recognition |
| name: Speech Recognition |
| dataset: |
| name: FLEURS (et) - Test |
| type: google/fleurs |
| config: et_ee |
| split: test |
| metrics: |
| - type: wer |
| value: 35.29 |
| name: WER (raw) |
| - type: wer |
| value: 12.36 |
| name: WER (normalized) |
| - type: cer |
| value: 7.06 |
| name: CER (raw) |
| --- |
| |
| # Parakeet-TDT-0.6B Estonian |
|
|
| Fine-tuned [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) for Estonian automatic speech recognition, augmented with TTS-generated synthetic data. |
|
|
| This model is part of the paper: *"Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper"* (Interspeech 2026). Paper coming soon. |
|
|
| ## Model Description |
|
|
| - **Architecture:** FastConformer encoder + Token-and-Duration Transducer (TDT) decoder |
| - **Parameters:** 0.6B |
| - **Tokenizer:** 8,192-token SentencePiece BPE |
| - **Base model:** `nvidia/parakeet-tdt-0.6b-v3` |
| - **Fine-tuning data:** CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS) |
| - **Training config:** CV + Synth All (full synthetic corpus with quality filtering) |
|
|
| ## Evaluation Results |
|
|
| ### Raw WER/CER (no text normalization) |
|
|
| | Test Set | WER | CER | |
| |----------|-----|-----| |
| | CommonVoice 17 Test | **21.03** | **4.64** | |
| | CommonVoice 17 Val | **20.18** | **4.21** | |
| | FLEURS Test | **35.29** | **7.06** | |
|
|
| ### Normalized WER/CER (lowercase + punctuation removal) |
|
|
| | Test Set | WER | CER | |
| |----------|-----|-----| |
| | CommonVoice 17 Test | **18.51** | **4.13** | |
| | CommonVoice 17 Val | **17.91** | **3.78** | |
| | FLEURS Test | **12.36** | **3.24** | |
|
|
| ### Improvement over baselines |
|
|
| | Comparison | CV17 Test (WER) | FLEURS Test (WER) | |
| |-----------|-----------------|-------------------| |
| | vs. Zero-shot | -6.16 pp | -3.85 pp | |
| | vs. CV-only fine-tuning | -1.32 pp | -1.30 pp | |
|
|
| All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000). |
|
|
| ## Usage |
|
|
| ```python |
| import nemo.collections.asr as nemo_asr |
| |
| # Load model |
| model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-estonian") |
| |
| # Transcribe |
| transcriptions = model.transcribe(["audio.wav"]) |
| print(transcriptions[0].text) |
| ``` |
|
|
| ## Training Details |
|
|
| - **Optimizer:** AdamW (lr=5e-5, betas=[0.9, 0.98], weight_decay=0.001) |
| - **Schedule:** Cosine annealing with 10% linear warmup |
| - **Batch size:** 32 |
| - **Early stopping:** patience 10 epochs on val_wer |
| - **Best epoch:** 74 (val_wer = 0.2002) |
| - **Precision:** bf16-mixed |
| - **Seed:** 42 |
| |
| ## Synthetic Data Augmentation |
| |
| The synthetic training data was generated using a three-stage pipeline: |
| 1. **Text generation:** GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories |
| 2. **LLM-as-judge validation:** Each sentence validated for grammaticality, naturalness, and language purity |
| 3. **Speech synthesis:** OpenAI gpt-4o-mini-tts with 11-voice rotation |
| |
| Dataset: [yuriyvnv/synthetic_asr_et_sl](https://huggingface.co/datasets/yuriyvnv/synthetic_asr_et_sl) |
| |
| ## Acknowledgments |
| |
| - Base model: [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
| - Training data: [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/) |
| - Evaluation: [Google FLEURS](https://huggingface.co/datasets/google/fleurs) |
| |