Papers
arxiv:2605.09167

WorldSpeech: A Multilingual Speech Corpus from Around the World

Published on May 9
Authors:
,
,
,

Abstract

WorldSpeech, a large-scale multilingual speech corpus with 65k hours of aligned audio-transcript data across 76 languages, significantly improves ASR performance when used to fine-tune existing models.

AI-generated summary

Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.09167
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.09167 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.09167 in a Space README.md to link it from this page.

Collections including this paper 1