yuriyvnv's picture
Add model card
9f2135a verified
metadata
language:
  - et
license: cc-by-4.0
library_name: nemo
tags:
  - automatic-speech-recognition
  - speech
  - nemo
  - parakeet
  - estonian
  - transducer
  - FastConformer
  - TDT
datasets:
  - mozilla-foundation/common_voice_17_0
  - yuriyvnv/synthetic_asr_et_sl
base_model: nvidia/parakeet-tdt-0.6b-v3
pipeline_tag: automatic-speech-recognition
model-index:
  - name: parakeet-tdt-0.6b-estonian
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: Common Voice 17.0 (et) - Test
          type: mozilla-foundation/common_voice_17_0
          config: et
          split: test
        metrics:
          - type: wer
            value: 21.03
            name: WER (raw)
          - type: wer
            value: 18.51
            name: WER (normalized)
          - type: cer
            value: 4.64
            name: CER (raw)
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: FLEURS (et) - Test
          type: google/fleurs
          config: et_ee
          split: test
        metrics:
          - type: wer
            value: 35.29
            name: WER (raw)
          - type: wer
            value: 12.36
            name: WER (normalized)
          - type: cer
            value: 7.06
            name: CER (raw)

Parakeet-TDT-0.6B Estonian

Fine-tuned NVIDIA Parakeet-TDT-0.6B-v3 for Estonian automatic speech recognition, augmented with TTS-generated synthetic data.

This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.

Model Description

  • Architecture: FastConformer encoder + Token-and-Duration Transducer (TDT) decoder
  • Parameters: 0.6B
  • Tokenizer: 8,192-token SentencePiece BPE
  • Base model: nvidia/parakeet-tdt-0.6b-v3
  • Fine-tuning data: CommonVoice 17.0 Estonian + ~5,850 synthetic sentences (LLM-generated text + OpenAI TTS)
  • Training config: CV + Synth All (full synthetic corpus with quality filtering)

Evaluation Results

Raw WER/CER (no text normalization)

Test Set WER CER
CommonVoice 17 Test 21.03 4.64
CommonVoice 17 Val 20.18 4.21
FLEURS Test 35.29 7.06

Normalized WER/CER (lowercase + punctuation removal)

Test Set WER CER
CommonVoice 17 Test 18.51 4.13
CommonVoice 17 Val 17.91 3.78
FLEURS Test 12.36 3.24

Improvement over baselines

Comparison CV17 Test (WER) FLEURS Test (WER)
vs. Zero-shot -6.16 pp -3.85 pp
vs. CV-only fine-tuning -1.32 pp -1.30 pp

All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).

Usage

import nemo.collections.asr as nemo_asr

# Load model
model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-estonian")

# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0].text)

Training Details

  • Optimizer: AdamW (lr=5e-5, betas=[0.9, 0.98], weight_decay=0.001)
  • Schedule: Cosine annealing with 10% linear warmup
  • Batch size: 32
  • Early stopping: patience 10 epochs on val_wer
  • Best epoch: 74 (val_wer = 0.2002)
  • Precision: bf16-mixed
  • Seed: 42

Synthetic Data Augmentation

The synthetic training data was generated using a three-stage pipeline:

  1. Text generation: GPT-5-mini generates diverse sentences across paraphrase, domain expansion, and morphological categories
  2. LLM-as-judge validation: Each sentence validated for grammaticality, naturalness, and language purity
  3. Speech synthesis: OpenAI gpt-4o-mini-tts with 11-voice rotation

Dataset: yuriyvnv/synthetic_asr_et_sl

Acknowledgments