Article

A Dataset of Homonymous Affixes in Uzbek for Improving Lemmatization and Question Answering Systems

Ogabek SobirovTashkent State University of Economics,Department of Specialized, Social-Humanitarian and Exact Sciences,Tashkent,UzbekistanBobur ShermatovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,UzbekistanShaxodat QodirberganovaUzbekistan State World Languages University,Department of Modern Information Technologies,Tashkent,UzbekistanDilshodbek Mansurovich AllashkurovTashkent State University of Economics,Department of Specialized, Social-Humanitarian and Exact Sciences,Tashkent,UzbekistanUmarbek RakhmatullayevTashkent State University of Economics,Department of Specialized, Social-Humanitarian and Exact Sciences,Tashkent,Uzbekistan

2025

ABI

Abstract

This paper presents a manually compiled dataset of Uzbek homonymous affixes aimed at advancing natural language processing (NLP) research in morphologically rich languages. The dataset, titled Uzbek Homonym Affixes, contains 56 entries, each representing a unique affix and its occurrences across multiple parts of speech, including nouns, adjectives, verbs, adverbs, and auxiliary words. The resource captures both derivational (so‘z yasovchi) and inflectional (shakl yasovchi) uses of affixes, enabling detailed morphological and lexical analysis. It was manually collected from lexical sources and structured into six columns, allowing straightforward integration with NLP frameworks such as morphological analysis, part-of-speech tagging, and word sense disambiguation. Beyond descriptive linguistics, this dataset provides valuable input for computational applications like lemmatization and question-answering systems, where handling homonymy and morphological ambiguity is critical. By offering a structured linguistic foundation, the dataset contributes to the broader development of Uzbek language technologies and serves as a cornerstone for future research in morphology-based natural language understanding. The dataset is publicly accessible on the Hugging Face platform at https://huggingface.co/datasets/dasturbek/uzbek_homonym_affixes.

Topics

Natural Language Processing Techniques Topic Modeling Second Language Acquisition and Learning

Identifiers

DOI: 10.1109/apeie66761.2025.11289308

Citations and references

Cited by 012 references

Metrics — AkademScholar · Coming soon