Skip to main content
AkademIndex

Products

For developers

AkademBasesoonOpen API for the ecosystem
Latin
English
Article

Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

Davlatyor MenglievNovosibirsk State University, 2, Pirogova str., Novosibirsk city, 630090, RussiaVladimir BarakhninFederal Research Center for Information and Computational Technologies, 6, Academician M.A. Lavrentiev avenue, Novosibirsk, 630090, RussiaNilufar AbdurakhmonovaNational University of Uzbekistan named after Mirzo-Ulugbek, 4, Universitet str., Olmazor distr., Tashkent city, 100174, UzbekistanMukhriddin EshkulovJizzakh polytechnic institute, 4, Islom Karimov str., Jizzakh city, 130100, Uzbekistan
Data in Briefjournal2024en
ABI

Abstract

This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resource-constrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,000 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary-based NER or Machine learning NER in other low-resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed.

Topics

Identifiers

Citations and references

Metrics — AkademScholar · Coming soon