yuriyvnv's picture
Add model card
9f2135a verified
---
language:
- et
license: cc-by-4.0
library_name: nemo
tags:
- automatic-speech-recognition
- speech
- nemo
- parakeet
- estonian
- transducer
- FastConformer
- TDT
datasets:
- mozilla-foundation/common_voice_17_0
- yuriyvnv/synthetic_asr_et_sl
base_model: nvidia/parakeet-tdt-0.6b-v3
pipeline_tag: automatic-speech-recognition
model-index:
- name: parakeet-tdt-0.6b-estonian
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: Common Voice 17.0 (et) - Test
type: mozilla-foundation/common_voice_17_0
config: et
split: test
metrics:
- type: wer
value: 21.03
name: WER (raw)
- type: wer
value: 18.51
name: WER (normalized)
- type: cer
value: 4.64
name: CER (raw)
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: FLEURS (et) - Test
type: google/fleurs
config: et_ee
split: test
metrics:
- type: wer
value: 35.29
name: WER (raw)
- type: wer
value: 12.36
name: WER (normalized)
- type: cer
value: 7.06
name: CER (raw)
---
# Parakeet-TDT-0.6B Estonian
Fine-tuned [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.
This model is part of the paper: *"Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper"* (Interspeech 2026). Paper coming soon.
## Model Description
- **Architecture:** FastConformer encoder + Token-and-Duration Transducer (TDT) decoder
- **Parameters:** 0.6B
- **Tokenizer:** 8,192-token SentencePiece BPE
- **Base model:** `nvidia/parakeet-tdt-0.6b-v3`
- **Fine-tuning data:** CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS)
- **Training config:** CV + Synth All (full synthetic corpus with quality filtering)
## Evaluation Results
### Raw WER/CER (no text normalization)
| Test Set | WER | CER |
|----------|-----|-----|
| CommonVoice 17 Test | **21.03** | **4.64** |
| CommonVoice 17 Val | **20.18** | **4.21** |
| FLEURS Test | **35.29** | **7.06** |
### Normalized WER/CER (lowercase + punctuation removal)
| Test Set | WER | CER |
|----------|-----|-----|
| CommonVoice 17 Test | **18.51** | **4.13** |
| CommonVoice 17 Val | **17.91** | **3.78** |
| FLEURS Test | **12.36** | **3.24** |
### Improvement over baselines
| Comparison | CV17 Test (WER) | FLEURS Test (WER) |
|-----------|-----------------|-------------------|
| vs. Zero-shot | -6.16 pp | -3.85 pp |
| vs. CV-only fine-tuning | -1.32 pp | -1.30 pp |
All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).
## Usage
```python
import nemo.collections.asr as nemo_asr
# Load model
model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-estonian")
# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0].text)
```
## Training Details
- **Optimizer:** AdamW (lr=5e-5, betas=[0.9, 0.98], weight_decay=0.001)
- **Schedule:** Cosine annealing with 10% linear warmup
- **Batch size:** 32
- **Early stopping:** patience 10 epochs on val_wer
- **Best epoch:** 74 (val_wer = 0.2002)
- **Precision:** bf16-mixed
- **Seed:** 42
## Synthetic Data Augmentation
The synthetic training data was generated using a three-stage pipeline:
1. **Text generation:** GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
2. **LLM-as-judge validation:** Each sentence validated for grammaticality, naturalness, and language purity
3. **Speech synthesis:** OpenAI gpt-4o-mini-tts with 11-voice rotation
Dataset: [yuriyvnv/synthetic_asr_et_sl](https://huggingface.co/datasets/yuriyvnv/synthetic_asr_et_sl)
## Acknowledgments
- Base model: [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- Training data: [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/)
- Evaluation: [Google FLEURS](https://huggingface.co/datasets/google/fleurs)