Detecting Allusions in the Karakalpak Language Using mBERT
Аннотация
This paper investigates the problem of allusion detection in Karakalpak texts using neural network technologies such as mBERT. Although this problem has been studied well enough for such world languages as English, Russian, Chinese, etc., there are almost no studies for low-resource languages. The proposed solution includes not only the preparation of a language model, but also the formation of a corpus of literary texts from more than 5,000 sentences. To prevent overfitting, an early stopping mechanism was used, which allowed us to identify the most optimal model indicators. Empirical results on two different test sets show that the model works reliably on literary texts, but demonstrates a noticeable drop in performance when working with texts of various topics. In addition, a comparative analysis of existing solutions is carried out in order to emphasize the relevance of the work done. Moreover, the authors note that they plan to expand the dataset with a variety of literary topics, as well as informal genre texts, in order to further actualize the developed solution in applied.