Статья

Hybrid Machine Learning Model for Dialect Identification in Multilingual Educational Settings

Jahongir AbdupattohovTashkent State University of Oriental Studies,Higher School of Arabic StudiesDaniyor RajabovKhorezm Mamun Academy,Khiva,Uzbekistan,220900Elyor ToshtemirovAndijan State Institute of Foreign Languages,Andijan,UzbekistanMohira TursunovaNamangan State Institute of Foreign Languages,Namangan,Uzbekistan,160123Shahlo KhalimovaAlfraganus University,Faculty of Philology,Department of Oriental PhilologyYulduz KarimovaJizzakh State Pedagogical University,Uzbekistan

2025en

ABI

Аннотация

Building multilingual educational institutions requires correct dialect identification. Providing adaptive language support and personalized information is beneficial. The research being conducted aims to construct a Hybrid Machine Learning Model (HyML-DI) that can accurately classify closely related dialects and linguistic variants in multilingual educational environments. This technique obtains many language attributes from the WiLI-2018 dataset. Written writings in 235 languages and dialects are included. Everything came from Wikipedia. Our model combines statistics and deep learning. The design utilizes TF-IDF and n-gram classifiers, as well as Transformer-based models such as BERT and XLMRoBERTa, to integrate contextual language. After removing macro-language families, a multi-level classification pipeline classifies dialects using a bidirectional LSTM ensemble improved by attention. This method is repeated until the desired classification is attained. Experimental results showed that the hybrid model had a macro F1-score of 92.1% when distinguishing related languages and dialects. HyML-DI indicates that the hybrid model outperforms baseline systems. The research findings should encourage multilingual educational systems to be more inclusive and language-sensitive. They also discuss how interpretable machine learning and deep contextual awareness may help identify dialects.

Перевод пока недоступен

Темы

Natural Language Processing Techniques

Идентификаторы

DOI: 10.1109/icngcs64900.2025.11183438

Цитирования и источники

Цитирований: 0Использованных источников: 4

Показатели — AkademScholar