Developing a Dictionary-Based Algorithm for finding Lemmas of Uzbek Words
Annotatsiya
Lemmatization reduces each word to its base form, that is the mutually agreed form of the word is the lemma or the word we find in the dictionary. Lemmatization plays a crucial role in natural language processing (NLP) and is used for many of its tasks, such as text normalization, information retrieval and so on. In this work, we evolved lemmatization algorithm which is based of dictionary. The dictionary consists of lemmas of Uzbek part-of-speeches (POS) and include of 68921 lemmas. To keep the size of the dictionary small, similar words are separated by slashes and written as one, in this case there are 67427 of them. The algorithm was originally developed separately for the verb, the results were published in another work. A full-state algorithm was developed for all words, and its results are presented in this paper. Based on the algorithm, a python program was developed and tested with 23,000 different words. The results were analyzed by experts and the algorithm found the lemma of the words with 98% accuracy. The dictionary, results and the program are available at github.com/ddasturbek/UzbekLemma.