Add model card

9f2135a verified about 1 month ago

4.17 kB

	---
	language:
	- et
	license: cc-by-4.0
	library_name: nemo
	tags:
	- automatic-speech-recognition
	- speech
	- nemo
	- parakeet
	- estonian
	- transducer
	- FastConformer
	- TDT
	datasets:
	- mozilla-foundation/common_voice_17_0
	- yuriyvnv/synthetic_asr_et_sl
	base_model: nvidia/parakeet-tdt-0.6b-v3
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: parakeet-tdt-0.6b-estonian
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: Common Voice 17.0 (et) - Test
	type: mozilla-foundation/common_voice_17_0
	config: et
	split: test
	metrics:
	- type: wer
	value: 21.03
	name: WER (raw)
	- type: wer
	value: 18.51
	name: WER (normalized)
	- type: cer
	value: 4.64
	name: CER (raw)
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: FLEURS (et) - Test
	type: google/fleurs
	config: et_ee
	split: test
	metrics:
	- type: wer
	value: 35.29
	name: WER (raw)
	- type: wer
	value: 12.36
	name: WER (normalized)
	- type: cer
	value: 7.06
	name: CER (raw)
	---

	# Parakeet-TDT-0.6B Estonian

	Fine-tuned [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.

	This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.

	## Model Description

	- Architecture: FastConformer encoder + Token-and-Duration Transducer (TDT) decoder
	- Parameters: 0.6B
	- Tokenizer: 8,192-token SentencePiece BPE
	- Base model: `nvidia/parakeet-tdt-0.6b-v3`
	- Fine-tuning data: CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS)
	- Training config: CV + Synth All (full synthetic corpus with quality filtering)

	## Evaluation Results

	### Raw WER/CER (no text normalization)

	\| Test Set \| WER \| CER \|
	\|----------\|-----\|-----\|
	\| CommonVoice 17 Test \| 21.03 \| 4.64 \|
	\| CommonVoice 17 Val \| 20.18 \| 4.21 \|
	\| FLEURS Test \| 35.29 \| 7.06 \|

	### Normalized WER/CER (lowercase + punctuation removal)

	\| Test Set \| WER \| CER \|
	\|----------\|-----\|-----\|
	\| CommonVoice 17 Test \| 18.51 \| 4.13 \|
	\| CommonVoice 17 Val \| 17.91 \| 3.78 \|
	\| FLEURS Test \| 12.36 \| 3.24 \|

	### Improvement over baselines

	\| Comparison \| CV17 Test (WER) \| FLEURS Test (WER) \|
	\|-----------\|-----------------\|-------------------\|
	\| vs. Zero-shot \| -6.16 pp \| -3.85 pp \|
	\| vs. CV-only fine-tuning \| -1.32 pp \| -1.30 pp \|

	All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).

	## Usage

	```python
	import nemo.collections.asr as nemo_asr

	# Load model
	model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-estonian")

	# Transcribe
	transcriptions = model.transcribe(["audio.wav"])
	print(transcriptions[0].text)
	```

	## Training Details

	- Optimizer: AdamW (lr=5e-5, betas=[0.9, 0.98], weight_decay=0.001)
	- Schedule: Cosine annealing with 10% linear warmup
	- Batch size: 32
	- Early stopping: patience 10 epochs on val_wer
	- Best epoch: 74 (val_wer = 0.2002)
	- Precision: bf16-mixed
	- Seed: 42

	## Synthetic Data Augmentation

	The synthetic training data was generated using a three-stage pipeline:
	1. Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
	2. LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
	3. Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation

	Dataset: [yuriyvnv/synthetic_asr_et_sl](https://huggingface.co/datasets/yuriyvnv/synthetic_asr_et_sl)

	## Acknowledgments

	- Base model: [NVIDIA Parakeet-TDT-0.6B-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
	- Training data: [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/)
	- Evaluation: [Google FLEURS](https://huggingface.co/datasets/google/fleurs)