Статья

Enhancing Bangla Spam SMS Detection with BERT: A Deep Learning Perspective

Jobaida AbedinInternational Islamic, University Chittagong,Dept.of CSE,Chittagong,Bangladesh,4318Yasha FatemaInternational Islamic, University Chittagong,Dept.of CSE,Chittagong,Bangladesh,4318Farzana TasnimInternational Islamic, University Chittagong,Dept.of CSE,Chittagong,Bangladesh,4318Fariba Tasnia KhanInternational Islamic, University Chittagong,Dept.of CSE,Chittagong,Bangladesh,4318Tanjim MahmudRangamati Science and Technology University,Dept. of CSE,Rangamati,Bangladesh,4500Rakhimjon Rajapboyevich RakhimovUrgench State University,Urgench,UzbekistanValisher Sapayev Odilbek UgluMamun University,UzbekistanAbubokor HanipMA HossainUniversity of Chittagong,Dept. of CSE,Chittagong,Bangladesh,4331

2025

ABI

Аннотация

Spam messages pose a persistent problem in digital communication, particularly for low-resource languages like Bangla. The complex grammatical structure of Bangla, frequent code-mixing with English, and informal writing styles make spam detection far more challenging than in high-resource languages such as English. This study explores a comparative deep learning approach for classifying Bangla spam SMS messages. The models evaluated include Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), Gated Recurrent Unit (GRU), Bidirectional Encoder Representations from Transformers (BERT), BanglaBERT, XLM-BERT, and a hybrid BiLSTM + CNN model. A balanced dataset of 2,004 Bangla SMS messages- 1,030 spam and 971 ham-has been prepared for training and evaluation. Among the tested models, BanglaBERT has achieved the highest accuracy of 98.5 %. This outcome has demonstrated the strength of language-specific transformer models in capturing the contextual and structural nuances of Bangla text. Explainable AI (XAI) methods, including LIME and SHAP, have been applied to interpret the model's internal decision-making process and ensure transparency. The analysis has revealed that the words “ফ্রি”, “অফার”, and “জয়” are the most influential indicators for identifying spam messages. The proposed framework has achieved an effective balance between performance and interpretability, setting a new benchmark for Bangla spam SMS detection. This work contributes to the growing field of Bangla Natural Language Processing (NLP) and illustrates that combining sequential learning with contextual modeling can effectively address the challenges of spam detection in low-resource languages.

Перевод пока недоступен

Темы

Spam and Phishing Detection Misinformation and Its Impacts Mental Health via Writing

Идентификаторы

DOI: 10.1109/wiecon-ece69386.2025.11525933

Цитирования и источники

Цитирований: 0Использованных источников: 6

Показатели — AkademScholar