Maqola

An Automatic Approach for the Identification of Offensive Language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation

Salah Ud DinDepartment of Computer Science, University of Peshawar, Peshawar, PakistanShah KhusroDepartment of Computer Science, University of Peshawar, Peshawar, PakistanFarman Ali KhanDepartment of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, PakistanMunir AhmadDepartment of Computer Sciences, National College of Business Administration and Economics, Lahore, PakistanOualid AliCollege of Arts and Science, Applied Science University, Manama, Kingdom of BahrainTaher M. GhazalDepartment of Networks and Cybersecurity, Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University, Amman, Jordan

2025en

ABI

Annotatsiya

Offensive language is a type of unacceptable language that is impolite amongst individuals, specific community groups, and society as well. With the advent of various social media platforms, offensive language usage has been widely reported, thus developing a toxic online environment that has real-life endangers within society. Therefore, to foster a culture of respect and acceptance, a prompt response is needed to combat offensive content. On the other hand, the identification of offensive language has become a challenging task, specifically in low-resource languages such as Urdu. Urdu text poses challenges because of its unique features, complex script, and rich morphology. Applying methods directly that work in other languages is difficult. It also requires exploring new linguistic features and computational techniques on a relatively large dataset to ensure the results can be generalized effectively. Unfortunately, the Urdu language got very limited attention from the research community due to the scarcity of language resources and the non-availability of high-quality datasets and models. This study addresses those challenges, firstly by collecting and annotating a dataset of 12020 Urdu tweets using OLID taxonomy as a benchmark. Secondly, by extracting character-level and word-level features based on bag-of-words, n-grams and TFIDF representation. Finally, an extensive series of experiments were conducted on the extracted features using seven machine learning classifiers to identify the most effective features and classifiers. The experimental findings indicate that word unigrams, character trigrams, and word TFIDF are the most prominent ones. Similarly, among the classifiers, logistic regression and support vector machine attained the highest accuracy of 86% and F1-Score of 75%.

Hali tarjima qilinmagan

Identifikatorlar

DOI: 10.1109/access.2025.3534662

Iqtiboslar va manbalar

10 ta iqtibos0 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar