Article

A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language

Davlatyor MenglievNovosibirsk State University, 2, Pirogova str., Novosibirsk city 630090, RussiaVladimir BarakhninFederal Research Center for Information and Computational Technologies, 6, Academician M.A. Lavrentiev avenue, Novosibirsk 630090, RussiaMukhriddin EshkulovJizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city 130100, UzbekistanBahodir IbragimovUrgench State University, 14, Kh.Alimdjan str., Urgench city 220100, UzbekistanShohrux MadirimovTashkent institute of textile and light industry, 5, Shoxdjaxon str., Tashkent city 100100, Uzbekistan

Data in Briefjournal2024en

ABI

Abstract

In this study, the authors presented a dataset for named entity recognition in the Uzbek language. The dataset consists of 2000 sentences and 25,865 words, and the sources were legal documents and hand-crafted sentences annotated using the BIOES scheme. The study is complemented by the fact that the authors demonstrated the applications of the created dataset by training a language model using the CNN + LSTM architecture, which achieves high accuracy in NER tasks, with an F1 score of 90.8 %, precision of 93.9 %, and recall of 88.0 % on the test set. The proposed dataset and trained model contribute to the development of natural language processing in the Uzbek language. In addition, the authors also conducted an analysis of existing works, as well as a comparative analysis, which will help to identify the distinctive features and novelty of the proposed work. Moreover, in conclusion, the authors propose possible scenarios for the development of the work, in the form of further scaling of the dataset, as well as the use of other neural network architectures.

Topics

Topic Modeling Natural Language Processing Techniques Text and Document Classification Technologies

Identifiers

DOI: 10.1016/j.dib.2024.111249

Citations and references

Cited by 15 18 references

Metrics — AkademScholar · Coming soon