Мақола

Leveraging mBERT with Rule-Driven Pre-Processing for Domain-Specific Karakalpak Text Classification

Nilufar AbdurakhmonovaNational University of Uzbekistan Named After Mirzo Ulugbek,Tashkent,UzbekistanRаnо SаyfullаevаNational University of Uzbekistan Named After Mirzo Ulugbek,Tashkent,UzbekistanRaima ShirinovaNational University of Uzbekistan Named After Mirzo Ulugbek,Tashkent,Uzbekistan

2025

ABI

Аннотация

The article presents a hybrid approach to classifying texts in the Karakalpak language in four subject areas - housing and communal services, electric power industry, law and economics. The method combines rule-oriented preprocessing and additional training of the multilingual mBERT model on a dataset of 20,000 sentences. The work demonstrates the result of the joint work of the rule-oriented algorithm with the neural network model to achieve the best performance. As a result of testing the language model, it was revealed that the subject of the text plays an important role during the analysis, affecting the final result of the model. Thus, legal and economic texts were classified the worst, while the sphere of housing and communal services and the sphere of electric power industry were worked out the best. In addition, the authors conducted a study of existing works that solve the same or similar problem for texts in the Karakalpak language. At the same time, the article has a section of the Karakalpak language, allowing readers to better understand the nature of the language and the limitations of the algorithms.

Мавзулар

Natural Language Processing Techniques Text and Document Classification Technologies Linguistics and Cultural Studies

Идентификаторлар

DOI: 10.1109/ubmk67458.2025.11207028

Иқтибослар ва манбалар

0 та иқтибос16 та фойдаланилган манба

Кўрсаткичлар — AkademScholar · Тез орада