Identification of Named Entities from Uzbek Historical Texts: A Multilingual BERT Approach
Annotatsiya
This paper presents an algorithm for recognizing named entities in Uzbek historical texts dating back to 1928– 1940. To accomplish the task, we used the Multilingual BERT deep learning model, which was trained on a custom dataset. It should be noted that this dataset was formed from 5,500 sentences, each of which was annotated using the BIOES scheme. The authors argued that this annotation scheme was chosen because it is one of the most popular annotation schemes for named entity detection tasks. Organizations, persons, and locations were selected as categories of named entities. The model was trained using the early stopping mechanism, which allowed us to select the best metric weights obtained at the 11th training epoch. For an objective assessment, testing was conducted on various thematic historical texts and modern Uzbek texts, which once again confirmed the high efficiency of the model for historical data and revealed a significant decrease in accuracy on modern texts.