Maqola

Annotated universal dependencies dataset for literary and educational Uzbek texts

Sanatbek MatlatipovThe National University of Uzbekistan is named after Mirzo Ulugbek. Universitet Street, 4, Olmazor District 100174, Tashkent City, UzbekistanMersaid AripovThe National University of Uzbekistan is named after Mirzo Ulugbek. Universitet Street, 4, Olmazor District 100174, Tashkent City, UzbekistanMakhmud BobokandovThe National University of Uzbekistan is named after Mirzo Ulugbek. Universitet Street, 4, Olmazor District 100174, Tashkent City, UzbekistanGayrat MatlatipovUrgench State University named after Abu Rayhn Biruni, Khamid Alimdjan 14 220100, Urgench City, Uzbekistan

Data in Briefjournal2026en

ABI

Annotatsiya

This data article describes an Uzbek Universal Dependencies (UD) treebank released as a manually curated gold-standard dataset. The resource contains 681 sentences (7542 tokens) drawn from literary and educational Uzbek texts, providing a domain-specific complement to previously available web-based or news-oriented materials. Annotation was carried out in the INCEpTION environment by a five-member team comprising three linguists and two NLP engineers. The workflow followed the UD v2 framework and included calibration-stage agreement assessment, full-corpus double annotation, and adjudication to improve annotation consistency. Agreement measured on the shared calibration material was high across lemmatization, universal part-of-speech annotation, and complete morphological feature-value bundles. The released dataset contains final adjudicated gold-standard annotations, including lemmas, UPOS tags, morphological features, and basic dependency relations in standard CoNLL-U format, and has been validated for compatibility with the Universal Dependencies ecosystem. As an openly reusable Uzbek syntactic resource, it can support the development and evaluation of POS taggers, morphological analyzers, and dependency parsers, while also enabling comparative and cross-lingual studies for low-resource languages.

Mavzular

Text and Document Classification Technologies Text Readability and Simplification Topic Modeling

Identifikatorlar

DOI: 10.1016/j.dib.2026.112857

Iqtiboslar va manbalar

0 ta iqtibos4 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar · Tez orada