Maqola

Developing a Dictionary-Based Algorithm for finding Lemmas of Uzbek Words

Ogabek SobirovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,UzbekistanRaykhona KurbonovaUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,UzbekistanElmurod KuriyozovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,Uzbekistan

2025

ABI

Annotatsiya

Lemmatization reduces each word to its base form, that is the mutually agreed form of the word is the lemma or the word we find in the dictionary. Lemmatization plays a crucial role in natural language processing (NLP) and is used for many of its tasks, such as text normalization, information retrieval and so on. In this work, we evolved lemmatization algorithm which is based of dictionary. The dictionary consists of lemmas of Uzbek part-of-speeches (POS) and include of 68921 lemmas. To keep the size of the dictionary small, similar words are separated by slashes and written as one, in this case there are 67427 of them. The algorithm was originally developed separately for the verb, the results were published in another work. A full-state algorithm was developed for all words, and its results are presented in this paper. Based on the algorithm, a python program was developed and tested with 23,000 different words. The results were analyzed by experts and the algorithm found the lemma of the words with 98% accuracy. The dictionary, results and the program are available at github.com/ddasturbek/UzbekLemma.

Mavzular

Natural Language Processing Techniques Algorithms and Data Compression Text and Document Classification Technologies

Identifikatorlar

DOI: 10.1109/apeie66761.2025.11289405

Iqtiboslar va manbalar

0 ta iqtibos8 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar · Tez orada