Leveraging mBERT with Rule-Driven Pre-Processing for Domain-Specific Karakalpak Text Classification
Аннотация
The article presents a hybrid approach to classifying texts in the Karakalpak language in four subject areas - housing and communal services, electric power industry, law and economics. The method combines rule-oriented preprocessing and additional training of the multilingual mBERT model on a dataset of 20,000 sentences. The work demonstrates the result of the joint work of the rule-oriented algorithm with the neural network model to achieve the best performance. As a result of testing the language model, it was revealed that the subject of the text plays an important role during the analysis, affecting the final result of the model. Thus, legal and economic texts were classified the worst, while the sphere of housing and communal services and the sphere of electric power industry were worked out the best. In addition, the authors conducted a study of existing works that solve the same or similar problem for texts in the Karakalpak language. At the same time, the article has a section of the Karakalpak language, allowing readers to better understand the nature of the language and the limitations of the algorithms.