Другое

UzbekPOS: Multi-domain Part-Of-Speech Dataset for the Uzbek Language

Maksud SharipovElmurod KuriyozovAl Xorazmiy nomidagi Urganch davlat universiteti

Mendeley Datarepository2026

ABI

Аннотация

UzbekPOS: Multi-domain Part-Of-Speech Dataset for the Uzbek Language UzbekPOS is a manually annotated, multi-domain Part-of-Speech (POS) tagging dataset for the Uzbek language, created to support research and development in Natural Language Processing (NLP), computational linguistics, and corpus linguistics. Uzbek is a morphologically rich and under-resourced Turkic language, and this dataset addresses the lack of large-scale, high-quality annotated resources for fundamental linguistic tasks. The dataset contains 4,412 sentences and 53113 token–tag pairs, collected from 25 diverse domains, including literature, news, science, education, law, medicine, technology, social interaction, and public discourse. This wide domain coverage ensures linguistic, stylistic, and topical diversity, making the corpus suitable for both academic research and applied NLP systems. All sentences were manually tokenized and POS-tagged by expert annotators using a carefully designed tagset based on the Universal Dependencies (UD) UPOS framework, with adaptations for Uzbek-specific grammatical features. The standard DET (Determiner) tag was omitted due to the absence of articles in Uzbek, and a language-specific MOD (Modal) tag was introduced to better capture Uzbek functional grammar. The final tagset consists of 16 POS tags. To guarantee high annotation quality, each sentence was processed through a three-stage validation pipeline: initial annotation, independent cross-verification by a second expert, and final adjudication by a senior linguist in cases of disagreement. This process ensures the dataset represents a gold-standard POS resource. The UzbekPOS dataset is distributed in multiple widely used formats to maximize accessibility and reuse: Raw annotated text (.txt) with / structure Tab-Separated Values (.tsv) for easy inspection and processing JSON Lines (.jsonl) for scalable programmatic use CoNLL-U (.conllu) format, fully compatible with UD-based NLP tools In addition, predefined train, development, and test splits are included to support standardized benchmarking and reproducible experiments. UzbekPOS can be used for: Training and evaluating POS taggers for Uzbek Morphological and syntactic analysis Cross-lingual and typological studies of Turkic languages Transfer learning and low-resource NLP research Educational purposes in NLP and corpus linguistics This dataset is one of the largest openly available POS-tagged corpora for Uzbek and provides a solid foundation for future Uzbek and Turkic language technology development.

Идентификаторы

DOI: 10.17632/55f889ncnx

Цитирования и источники

Цитирований: 0Использованных источников: 0

Показатели — AkademScholar · Скоро