Article

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches

Nilufar AbdurakhmonovaNational University of Uzbekistan named after Mirzo Ulugbek, 4, University str., Tashkent city 100174, UzbekistanRaima ShirinovaNational University of Uzbekistan named after Mirzo Ulugbek, 4, University str., Tashkent city 100174, UzbekistanRano SayfullayevaNational University of Uzbekistan named after Mirzo Ulugbek, 4, University str., Tashkent city 100174, UzbekistanDavlatyor MenglievUrgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, 110, al-Khwarizmi str., Urgench city 220100, UzbekistanBahodir IbragimovUrgench State University, 14, Kh.Alimdjan str., Urgench city 220100, UzbekistanManzura ErnazarovaNavoi State University, 45, Ibn Sino str., Navoi city 210100, Uzbekistan

Data in Briefjournal2025en

ABI

Abstract

This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis approaches were implemented and compared: a user-defined rule-based stemming algorithm and a conditional random fields (CRF)-based machine learning model. Additionally, comprehensive genre testing was conducted on legal, political-economic, and educational texts to assess generalizability. The dataset is publicly available in Excel format and is intended as a base resource for further research in the field of natural language processing in Uzbek, including applications in text generation, semantic analysis, and grammar correction.

Topics

Natural Language Processing Techniques

Identifiers

DOI: 10.1016/j.dib.2025.111702

Citations and references

Cited by 5 21 references

Metrics — AkademScholar · Coming soon