Development of a Lemmatization Algorithm: Interpreting Open Compound Words
Abstract
Open compound words pose significant challenges in the lemmatization process for morphologically rich and low-resource languages such as Uzbek, where multi-word expressions often carry unified meanings but appear as separate tokens. This research aims to address this issue by developing a lemmatization algorithm tailored specifically for Uzbek open compound words, which are underrepresented in existing NLP resources. The method relies on a dictionary-based approach supported by a finite-state machine (FSM) to recognize and normalize compound words without requiring prior tokenization. The compiled lexicon includes over 900 compound units across various parts of speech, including verbs, adverbs, pronouns, interjections, and auxiliary verb constructions. Evaluation of the algorithm on structured textual data demonstrates its ability to accurately identify compound forms and map them to their base lemmas with high precision. The approach does not require large annotated corpora or complex machine learning models, making it transparent, reproducible, and adaptable for resource-scarce languages. These findings highlight the potential of rule-based lexical methods combined with finite-state modeling in improving lemmatization performance and suggest promising directions for further development of NLP tools for the Uzbek language and other agglutinative languages with similar structural characteristics.