Hybrid Approach to Genre Classification of Karakalpak Texts in Telecommunications and Energy Domains
Abstract
In this research paper, a hybrid approach combining rule-oriented preprocessing algorithms and modern neural network methods is proposed for classifying texts in the Karakalpak language. Telecommunications and energy were selected as the classified domains. It should be noted that at the preprocessing stage, texts are standardized from Cyrillic to Latin, which allows unifying the entire text for easier operation of the algorithm complex. Subsequently, the texts were classified using a neural network model built on the initially empty SpaCy model, which was trained on a dataset of 10,000 sentences. As a result of testing the model, it was found that the average F1 score was about 87%, while after applying postprocessing, the indicator increased and reached 95.1%, which confirms the effectiveness of the proposed approach. Besides, authors included existed and relevant researches in order to reflect actuality of the paper. Moreover, there are plenty of information about nature and morphology of the Karakalpak language which help readers to understand all difficulties about language processing.