metadata
language:
- et
license: cc-by-4.0
library_name: nemo
tags:
- automatic-speech-recognition
- speech
- nemo
- parakeet
- estonian
- transducer
- FastConformer
- TDT
datasets:
- mozilla-foundation/common_voice_17_0
- yuriyvnv/synthetic_asr_et_sl
base_model: nvidia/parakeet-tdt-0.6b-v3
pipeline_tag: automatic-speech-recognition
model-index:
- name: parakeet-tdt-0.6b-estonian
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Common Voice 17.0 (et) - Test
type: mozilla-foundation/common_voice_17_0
config: et
split: test
metrics:
- type: wer
value: 21.03
name: WER (raw)
- type: wer
value: 18.51
name: WER (normalized)
- type: cer
value: 4.64
name: CER (raw)
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: FLEURS (et) - Test
type: google/fleurs
config: et_ee
split: test
metrics:
- type: wer
value: 35.29
name: WER (raw)
- type: wer
value: 12.36
name: WER (normalized)
- type: cer
value: 7.06
name: CER (raw)
Parakeet-TDT-0.6B Estonian
Fine-tuned NVIDIA Parakeet-TDT-0.6B-v3 for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.
This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.
Model Description
- Architecture: FastConformer encoder + Token-and-Duration Transducer (TDT) decoder
- Parameters: 0.6B
- Tokenizer: 8,192-token SentencePiece BPE
- Base model:
nvidia/parakeet-tdt-0.6b-v3 - Fine-tuning data: CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS)
- Training config: CV + Synth All (full synthetic corpus with quality filtering)
Evaluation Results
Raw WER/CER (no text normalization)
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 21.03 | 4.64 |
| CommonVoice 17 Val | 20.18 | 4.21 |
| FLEURS Test | 35.29 | 7.06 |
Normalized WER/CER (lowercase + punctuation removal)
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 18.51 | 4.13 |
| CommonVoice 17 Val | 17.91 | 3.78 |
| FLEURS Test | 12.36 | 3.24 |
Improvement over baselines
| Comparison | CV17 Test (WER) | FLEURS Test (WER) |
|---|---|---|
| vs. Zero-shot | -6.16 pp | -3.85 pp |
| vs. CV-only fine-tuning | -1.32 pp | -1.30 pp |
All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).
Usage
import nemo.collections.asr as nemo_asr
# Load model
model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-estonian")
# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0].text)
Training Details
- Optimizer: AdamW (lr=5e-5, betas=[0.9, 0.98], weight_decay=0.001)
- Schedule: Cosine annealing with 10% linear warmup
- Batch size: 32
- Early stopping: patience 10 epochs on val_wer
- Best epoch: 74 (val_wer = 0.2002)
- Precision: bf16-mixed
- Seed: 42
Synthetic Data Augmentation
The synthetic training data was generated using a three-stage pipeline:
- Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
- LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
- Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation
Dataset: yuriyvnv/synthetic_asr_et_sl
Acknowledgments
- Base model: NVIDIA Parakeet-TDT-0.6B-v3
- Training data: Mozilla Common Voice 17.0
- Evaluation: Google FLEURS