UzbekPOS: A multi-domain dataset for Uzbek part-of-speech tagging
Аннотация
In this paper, we introduce UzbekPOS — a part-of-speech (POS) tagged dataset manually annotated for the Uzbek language, designed for natural language processing, artificial intelligence models, and corpus linguistics applications. This tagged corpus is currently the largest publicly available POS-tagged corpus for the Uzbek language . The dataset comprises sentences drawn from a diverse range of Uzbek text sources, including literature, news outlets, science, education, and public speaking, to reflect linguistic and topical diversity. The sentences are tokenized and annotated by professional annotators, utilizing a finely grained POS tagset which integrates standard Universal Dependencies with additional labels that are specific to the morphological and syntactic features of the Uzbek language, comprising 16 tags in total. The UzbekPOS contains almost 4.5K sentences and more than 53K token/tag pairs, with each annotation cross-verified by at least two annotators for highest reliability. It also comes with both raw (txt) and generally accepted formats of distribution (TSV, JSON), as well as the universal POS-tagging format (conllu). This resource is one of the first and the largest openly published POS-tagged dataset for Uzbek, an under-resourced and morphologically complex Turkic language. This dataset can also act as a key foundation for training POS taggers, as a test set for machine learning models, and as a source for linguistic studies. The resource also bears the reusability potential for tasks of related kinds, such as morphological analysis, syntactic parsing, and transfer learning across languages of the Turkic family. Furthermore, this dataset can serve as seed material for creating similar corpora of POS for other Turkic languages and can help conduct cross-linguistic analyses and tool building.