Best Fine-Tuned ASR Models
Collection
This collection serves to reflect the best models fine-tuned during several experiments in the task of Automatic Speech Recognition. • 6 items • Updated
Fine-tuned NVIDIA Parakeet-TDT-0.6B-v3 for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.
This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.
nvidia/parakeet-tdt-0.6b-v3| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 21.03 | 4.64 |
| CommonVoice 17 Val | 20.18 | 4.21 |
| FLEURS Test | 35.29 | 7.06 |
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 18.51 | 4.13 |
| CommonVoice 17 Val | 17.91 | 3.78 |
| FLEURS Test | 12.36 | 3.24 |
| Comparison | CV17 Test (WER) | FLEURS Test (WER) |
|---|---|---|
| vs. Zero-shot | -6.16 pp | -3.85 pp |
| vs. CV-only fine-tuning | -1.32 pp | -1.30 pp |
All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).
import nemo.collections.asr as nemo_asr
# Load model
model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-estonian")
# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0].text)
The synthetic training data was generated using a three-stage pipeline:
Dataset: yuriyvnv/synthetic_asr_et_sl
Base model
nvidia/parakeet-tdt-0.6b-v3