Automated Recognition of Named Entities and Dialect Standardization in Uzbek Legal Texts
Аннотация
This study presents the development of a tool for identifying named entities in Uzbek legal texts. It should be noted, that besides of detecting named entities, the authors developed an algorithm, which is able to standardize word forms by replacing the detected dialect words (Karluk, Kypchak and Oghuz) with their formal forms. This will help to fix popular grammatical mistakes among native speakers from different regions of the Uzbekistan. The proposed hybrid approach combines the traditional approach, which is used in the preprocessing (standardization of word forms), where a dictionary with more than 10 thousand marked words is actively used. At the same time, a custom language model is used to work with detecting named entities, which was trained on 2000 legal sentences. The testing results showed quite high indicators, in particular, the language model detected named entities with an accuracy of 90%, and the recall reached 94%. Moreover, the algorithm used to standardize dialect word forms showed even higher rates, ranging from 90% to 100<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">%</sup> depending on the dialect.