Skip to main content
Article

Development of a Lemmatization Algorithm: Interpreting Open Compound Words

Abdusobir SaidovTashkent University of Information Technologies Named After Muhammad Al-Khwarizmi,Deportment Convergence Of Digital Technologies,Tashkent,UzbekistanMaksud SharipovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science And Artificial Intelligence Technologies,Urgench,Uzbekistan
2025
ABI

Abstract

Open compound words pose significant challenges in the lemmatization process for morphologically rich and low-resource languages such as Uzbek, where multi-word expressions often carry unified meanings but appear as separate tokens. This research aims to address this issue by developing a lemmatization algorithm tailored specifically for Uzbek open compound words, which are underrepresented in existing NLP resources. The method relies on a dictionary-based approach supported by a finite-state machine (FSM) to recognize and normalize compound words without requiring prior tokenization. The compiled lexicon includes over 900 compound units across various parts of speech, including verbs, adverbs, pronouns, interjections, and auxiliary verb constructions. Evaluation of the algorithm on structured textual data demonstrates its ability to accurately identify compound forms and map them to their base lemmas with high precision. The approach does not require large annotated corpora or complex machine learning models, making it transparent, reproducible, and adaptable for resource-scarce languages. These findings highlight the potential of rule-based lexical methods combined with finite-state modeling in improving lemmatization performance and suggest promising directions for further development of NLP tools for the Uzbek language and other agglutinative languages with similar structural characteristics.

Topics

Identifiers

Citations and references

Cited by 05 references
Metrics — AkademScholar · Coming soon