Karakalpak Speech Corpus
Аннотация
The Karakalpak Speech Corpus is the first large-scale, publicly available speech-to-text dataset for the Karakalpak language, designed to support the development, evaluation, and benchmarking of automatic speech recognition (ASR) systems for this low-resource Turkic language. Research hypothesis The core hypothesis behind this dataset is that high-quality, carefully curated speech–text pairs, even at moderate scale, can enable state-of-the-art self-supervised models (such as Wav2Vec 2.0) to achieve strong recognition performance for low-resource languages. By providing sufficient phonetic, lexical, and speaker diversity, the corpus aims to bridge the data gap that has historically limited Karakalpak speech technology. What the data contains The dataset consists of: Speech recordings in WAV format (16 kHz, 16-bit PCM) Manually verified transcriptions in standard Karakalpak Latin orthography Speaker-independent splits for training, validation, and testing Each audio file corresponds to a single utterance, making the corpus suitable for end-to-end ASR, forced alignment, pronunciation modeling, and acoustic analysis. The recordings include: Read speech Conversational and narrative sentences Phonetically rich word sequences Numbers, commands, and daily expressions This ensures broad coverage of Karakalpak phonology, morphology, and vocabulary. How the data was gathered The corpus was collected from native Karakalpak speakers under controlled recording conditions. All recordings were made in quiet indoor environments using consumer-grade microphones and laptops at 16 kHz. Speakers were instructed to read predefined texts clearly and naturally. All transcriptions were manually checked and normalized to remove spelling inconsistencies, Unicode artifacts, and non-Karakalpak characters. This results in a clean and reproducible linguistic representation of spoken Karakalpak. What the data shows The dataset demonstrates that: Karakalpak phonemes and special letters (á, ó, ú, ı, ń, ś, ǵ) can be reliably captured and modeled A consistent orthography and vocabulary can be established for ASR training Speaker-independent evaluation is feasible When used to fine-tune Wav2Vec 2.0 models, the corpus produces low word error rates (WER) and character error rates (CER), confirming that the dataset contains sufficient acoustic and linguistic information for high-quality speech recognition.
Перевод пока недоступен