Parakeet-TDT-0.6B Estonian

Fine-tuned NVIDIA Parakeet-TDT-0.6B-v3 for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.

This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.

Model Description

Architecture: FastConformer encoder + Token-and-Duration Transducer (TDT) decoder
Parameters: 0.6B
Tokenizer: 8,192-token SentencePiece BPE
Base model: nvidia/parakeet-tdt-0.6b-v3
Fine-tuning data: CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS)
Training config: CV + Synth All (full synthetic corpus with quality filtering)

Evaluation Results

Raw WER/CER (no text normalization)

Test Set	WER	CER
CommonVoice 17 Test	21.03	4.64
CommonVoice 17 Val	20.18	4.21
FLEURS Test	35.29	7.06

Normalized WER/CER (lowercase + punctuation removal)

Test Set	WER	CER
CommonVoice 17 Test	18.51	4.13
CommonVoice 17 Val	17.91	3.78
FLEURS Test	12.36	3.24

Improvement over baselines

Comparison	CV17 Test (WER)	FLEURS Test (WER)
vs. Zero-shot	-6.16 pp	-3.85 pp
vs. CV-only fine-tuning	-1.32 pp	-1.30 pp

All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).

Usage

import nemo.collections.asr as nemo_asr

# Load model
model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-estonian")

# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0].text)

Training Details

Optimizer: AdamW (lr=5e-5, betas=[0.9, 0.98], weight_decay=0.001)
Schedule: Cosine annealing with 10% linear warmup
Batch size: 32
Early stopping: patience 10 epochs on val_wer
Best epoch: 74 (val_wer = 0.2002)
Precision: bf16-mixed
Seed: 42

Synthetic Data Augmentation

The synthetic training data was generated using a three-stage pipeline:

Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation

Dataset: yuriyvnv/synthetic_asr_et_sl

Acknowledgments

Base model: NVIDIA Parakeet-TDT-0.6B-v3
Training data: Mozilla Common Voice 17.0
Evaluation: Google FLEURS

Downloads last month: 1

Model tree for yuriyvnv/parakeet-tdt-0.6b-estonian

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

(35)

this model

Datasets used to train yuriyvnv/parakeet-tdt-0.6b-estonian

Collection including yuriyvnv/parakeet-tdt-0.6b-estonian

Best Fine-Tuned ASR Models

Collection

This collection serves to reflect the best models fine-tuned during several experiments in the task of Automatic Speech Recognition. • 6 items • Updated 2 days ago

Evaluation results

WER (raw) on Common Voice 17.0 (et) - Test
test set self-reported

21.030
WER (normalized) on Common Voice 17.0 (et) - Test
test set self-reported

18.510
CER (raw) on Common Voice 17.0 (et) - Test
test set self-reported

4.640
WER (raw) on FLEURS (et) - Test
test set self-reported

35.290
WER (normalized) on FLEURS (et) - Test
test set self-reported

12.360
CER (raw) on FLEURS (et) - Test
test set self-reported

7.060