Advancing Oceanology Studies in Karakalpak: A Named Entity Recognition Algorithmic Framework
Abstract
This paper presents an algorithm for recognizing named entities in texts written in the Karakalpak language related to the field of oceanology. The algorithm is based on the dictionary approach using a database of dictionaries of marked words. The total number of words in the database reaches 10,500, where there are 1,000 named entities related to oceanology. The article also describes a method for morphological analysis of undetected words using affixes embedded in the algorithm. The developed algorithm was tested on three text corpora, consisting of a total of 300 sentences. The testing results demonstrated high rates, in particular, the percentage of accuracy and recall varies from 91 % to 100%. In addition, the authors conducted research on similar scientific works, studied alternative or similar solutions that could fully or partially solve the problem. Moreover, for the most complete understanding of the presented material, as well as the problem under consideration, the authors included information on the Karakalpak language.