NbAiLab / nb-asr-beta-Parakeet-RNNT-XXL-1.1b-verbatim
Norwegian verbatim ASR checkpoint for the NB-ASR beta program
This repository hosts a Norwegian verbatim ASR checkpoint built from a Parakeet RNNT XXL 1.1B training run and packaged for NB-ASR beta evaluation.
Internal run reference: dgx-8gpu-eval4-20260404-1053
Attribution
This model is derived from NVIDIA Parakeet RNNT checkpoints and adapted by NbAiLab for Norwegian NB-ASR beta use.
- Base model family:
nvidia/parakeet-rnnt-1.1b - Original model provider: NVIDIA
- Modifications by: NbAiLab / NB-ASR project (fine-tuning, packaging, and evaluation setup)
- This repository license: CC-BY-4.0
When redistributing or referencing this model, keep attribution to both NVIDIA (base model source) and NbAiLab (derived checkpoint work).
Checkpoint source (local training artifact):
/nfs/datastore0/nb-asr-export/parakeet-runs/dgx-8gpu-eval4-20260404-1053/2026-04-04_08-53-58/checkpoints
Main files from that run:
dgx-8gpu-eval4-20260404-1053.nemo
What This Model Is For
- Norwegian speech-to-text (verbatim output)
- checkpoint validation and benchmarking
- timestamped transcription workflows
- downstream diarization + speaker-attributed transcript generation
Installation
pip install -U "nemo_toolkit[asr]" soundfile huggingface_hub
For GPU systems, install a CUDA-compatible PyTorch build first.
Environment Setup (Recommended)
Use a fresh environment to avoid mixed dependency stacks.
python -m venv .venv-nb-asr-beta
source .venv-nb-asr-beta/bin/activate
export PYTHONNOUSERSITE=1
python -m pip install -U pip setuptools wheel
python -m pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision torchaudio
python -m pip install "nemo_toolkit[asr]" soundfile huggingface_hub
python -m pip install --force-reinstall --no-deps "lightning==2.4.0" "pytorch-lightning==2.4.0"
If your machine only supports CUDA 12.4 drivers, use the cu124 index URL instead of cu128.
Verify GPU is active:
python - << 'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
PY
Canonical Load Method For This Repo
Use hf_hub_download(...) + EncDecRNNTBPEModel.restore_from(...) for this model.
Do not use:
nemo_asr.models.ASRModel.from_pretrained("NbAiLab/nb-asr-beta-Parakeet-RNNT-XXL-1.1b-verbatim")
That path may trigger a NeMo cache resolution issue for this repository shape and fail with:
FileNotFoundError: .../hf_hub_cache/.../model_config.yaml
cuDNN Initialization Troubleshooting
If you see:
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
use this GPU-safe command:
python run_demo.py --audio audio/audio.wav --disable-cudnn
Quick Start: Transcription
Two GPU paths are supported:
- Path A (default): standard cuDNN behavior.
- Path B (optional fallback): disable cuDNN with
--disable-cudnnif needed.
Path A: Default GPU Path
from huggingface_hub import hf_hub_download
from nemo.collections.asr.models import EncDecRNNTBPEModel
MODEL_ID = "NbAiLab/nb-asr-beta-Parakeet-RNNT-XXL-1.1b-verbatim"
MODEL_FILE = "dgx-8gpu-eval4-20260404-1053.nemo"
AUDIO = "audio/audio.wav"
nemo_file = hf_hub_download(repo_id=MODEL_ID, filename=MODEL_FILE)
asr_model = EncDecRNNTBPEModel.restore_from(restore_path=nemo_file)
results = asr_model.transcribe([AUDIO])
print(results[0].text)
Path B: GPU With cuDNN Disabled (Optional Fallback)
import torch
from huggingface_hub import hf_hub_download
from nemo.collections.asr.models import EncDecRNNTBPEModel
MODEL_ID = "NbAiLab/nb-asr-beta-Parakeet-RNNT-XXL-1.1b-verbatim"
MODEL_FILE = "dgx-8gpu-eval4-20260404-1053.nemo"
AUDIO = "audio/audio.wav"
torch.backends.cudnn.enabled = False
nemo_file = hf_hub_download(repo_id=MODEL_ID, filename=MODEL_FILE)
asr_model = EncDecRNNTBPEModel.restore_from(restore_path=nemo_file, map_location="cuda:0")
results = asr_model.transcribe([AUDIO])
print(results[0].text)
Timestamping (Word + Segment)
NeMo supports timestamps for Parakeet models, including RNNT.
import nemo.collections.asr as nemo_asr
from huggingface_hub import hf_hub_download
from nemo.collections.asr.models import EncDecRNNTBPEModel
MODEL_ID = "NbAiLab/nb-asr-beta-Parakeet-RNNT-XXL-1.1b-verbatim"
MODEL_FILE = "dgx-8gpu-eval4-20260404-1053.nemo"
AUDIO = "audio/audio.wav"
nemo_file = hf_hub_download(repo_id=MODEL_ID, filename=MODEL_FILE)
asr_model = EncDecRNNTBPEModel.restore_from(restore_path=nemo_file)
# return_hypotheses=True gives direct access to timestamp metadata
hyp = asr_model.transcribe([AUDIO], timestamps=True, return_hypotheses=True)[0]
print("TEXT:", hyp.text)
time_stride = 8 * asr_model.cfg.preprocessor.window_stride
for w in hyp.timestamp.get("word", []):
# Some models return seconds directly; others return frame offsets.
if "start" in w and "end" in w:
start = w["start"]
end = w["end"]
else:
start = w.get("start_offset", 0.0) * time_stride
end = w.get("end_offset", 0.0) * time_stride
token = w.get("word", w.get("char", ""))
print(f"{start} -> {end} : {token}")
Speaker Diarization + ASR Merge
Recommended diarization model
For new projects, prefer:
nvidia/diar_streaming_sortformer_4spk-v2.1
It is the newer NVIDIA diarizer and supports direct Python usage with NeMo.
NVIDIA also provides nvidia/diar_sortformer_4spk-v1 as an alternative.
Important licensing note
nvidia/parakeet-rnnt-1.1bandnvidia/parakeet-tdt-0.6b-v3are published under CC-BY-4.0.nvidia/diar_sortformer_4spk-v1is CC-BY-NC-4.0 (non-commercial).nvidia/diar_streaming_sortformer_4spk-v2.1is under NVIDIA Open Model License.
If you redistribute a combined workflow or bundle model artifacts, verify that your intended usage complies with each model's license.
License Compatibility
This repository is part of the open nb-asr-beta group. Licensing is fixed as follows:
- This ASR model repo: CC-BY-4.0
- Diarization companion used in this README:
nvidia/diar_streaming_sortformer_4spk-v2.1(NVIDIA Open Model License) nvidia/diar_sortformer_4spk-v1is available from NVIDIA under CC-BY-NC-4.0 (non-commercial)
What this means for users
- This repo's model artifacts are distributed under CC-BY-4.0.
- Diarization usage follows the diarizer model's own license terms.
diar_streaming_sortformer_4spk-v2.1avoids the non-commercial restriction ofdiar_sortformer_4spk-v1, but users must follow NVIDIA Open Model License obligations.
Why license this repo as CC-BY-4.0
This checkpoint is derived from NVIDIA Parakeet models published as CC-BY-4.0. The nb-asr-beta decision is to keep this repository under CC-BY-4.0 and provide explicit attribution to NVIDIA and NbAiLab.
This keeps distribution terms straightforward for downstream users and avoids NC restrictions at the ASR-model level.
End-to-end example (ASR words + diarization segments)
from dataclasses import dataclass
from typing import List, Dict, Any
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.models import SortformerEncLabelModel
from huggingface_hub import hf_hub_download
from nemo.collections.asr.models import EncDecRNNTBPEModel
ASR_MODEL_ID = "NbAiLab/nb-asr-beta-Parakeet-RNNT-XXL-1.1b-verbatim"
ASR_MODEL_FILE = "dgx-8gpu-eval4-20260404-1053.nemo"
DIAR_MODEL_ID = "nvidia/diar_streaming_sortformer_4spk-v2.1"
AUDIO = "audio/audio_all.wav"
@dataclass
class Segment:
start: float
end: float
speaker: str
def to_float(x, default=0.0):
try:
return float(x)
except Exception:
return default
def parse_diar_segments(raw_segments: List[Any]) -> List[Segment]:
out = []
for s in raw_segments:
# Covers common NeMo outputs:
# - "start end speaker" strings
# - tuple/list [start, end, speaker]
# - object-style fields
if isinstance(s, str):
parts = s.strip().split()
if len(parts) >= 3:
start, end, speaker = parts[0], parts[1], parts[2]
else:
start, end, speaker = 0.0, 0.0, "speaker_unknown"
elif isinstance(s, (tuple, list)) and len(s) >= 3:
start, end, speaker = s[0], s[1], s[2]
else:
start = getattr(s, "start", 0.0)
end = getattr(s, "end", 0.0)
speaker = getattr(s, "speaker", "speaker_unknown")
out.append(Segment(to_float(start), to_float(end), str(speaker)))
return out
def overlap(a0, a1, b0, b1):
return max(0.0, min(a1, b1) - max(a0, b0))
def attach_speakers(word_stamps: List[Dict[str, Any]], diar_segments: List[Segment]):
enriched = []
time_stride = 8 * asr_model.cfg.preprocessor.window_stride
for w in word_stamps:
if "start" in w and "end" in w:
ws = to_float(w.get("start", 0.0))
we = to_float(w.get("end", ws))
else:
ws = to_float(w.get("start_offset", 0.0)) * time_stride
we = to_float(w.get("end_offset", 0.0)) * time_stride
word = w.get("word", w.get("char", "")).strip()
best_spk = "speaker_unknown"
best_ov = -1.0
for seg in diar_segments:
ov = overlap(ws, we, seg.start, seg.end)
if ov > best_ov:
best_ov = ov
best_spk = seg.speaker
enriched.append({"start": ws, "end": we, "word": word, "speaker": best_spk})
return enriched
# 1) ASR with timestamps
asr_nemo_file = hf_hub_download(repo_id=ASR_MODEL_ID, filename=ASR_MODEL_FILE)
asr_model = EncDecRNNTBPEModel.restore_from(restore_path=asr_nemo_file)
hyp = asr_model.transcribe([AUDIO], timestamps=True, return_hypotheses=True)[0]
word_stamps = hyp.timestamp.get("word", [])
# 2) Diarization
diar_model = SortformerEncLabelModel.from_pretrained(DIAR_MODEL_ID)
diar_model.eval()
raw = diar_model.diarize(audio=[AUDIO], batch_size=1)[0]
segments = parse_diar_segments(raw)
# 3) Merge words with speaker labels
speaker_words = attach_speakers(word_stamps, segments)
for row in speaker_words:
print(f"[{row['start']:.2f}-{row['end']:.2f}] {row['speaker']}: {row['word']}")
Included Files
audio/audio.wav(single-utterance example file)audio/audio2.mp3andaudio/audio3.mp3(source clips)audio/audio2.wavandaudio/audio3.wav(wav16-mono conversions)audio/audio_all.wav(concatenated wav16-mono file for diarization demos)audio/aimilliarden.wav(long-form example file)- inference package file:
dgx-8gpu-eval4-20260404-1053.nemo run_demo.py(CLI demo for ASR + timestamps + optional diarization)
Included audio files used in the examples are sourced from https://huggingface.co/datasets/NbAiLab/NST and follow this repository's license.
Training-state checkpoint files (*.ckpt, optimizer/scheduler states, etc.) are not distributed in this HF repository.
One-Command Demo Script
Run ASR + timestamps with the included audio:
python run_demo.py --audio audio/audio.wav
Optional fallback (with the flag):
python run_demo.py --audio audio/audio.wav --disable-cudnn
Run ASR + timestamps + diarization and save JSON:
python run_demo.py \
--audio audio/audio_all.wav \
--disable-cudnn \
--with-diarization \
--output demo_output.json
Run long-form transcription example:
python run_demo.py \
--audio audio/aimilliarden.wav \
--disable-cudnn \
--output demo_output_long.json
Run from local .nemo artifact (instead of HF pull):
python run_demo.py \
--audio audio/audio_all.wav \
--disable-cudnn \
--asr-local-nemo dgx-8gpu-eval4-20260404-1053.nemo \
--with-diarization
Readable output examples:
jq '.speaker_turns' demo_output.json
jq -r '.speaker_turns[] | "[\(.start)-\(.end)] \(.speaker): \(.text)"' demo_output.json
Intended Scope
This is a beta checkpoint for controlled NB-ASR evaluation and integration. It is not yet a final public production release.
Acknowledgements
This model is based on the NVIDIA NeMo Parakeet RNNT family and adapted by the NB-ASR project at the National Library.
- Downloads last month
- 31
Model tree for NbAiLab/nb-asr-beta-Parakeet-RNNT-XXL-1.1b-verbatim
Base model
nvidia/parakeet-rnnt-1.1b