Integrating Three Machine Learning Algorithms in Ensemble Learning Model for Improving Content-based Spam Email Recognition
Аннотация
Email spam refers to junk files, images, or data sent through email that might contain links leading to phishing websites.This email is often sent repeatedly to random users, and sometimes it may be dangerous.The objective of this study is to predict and recognize whether the emails sent to users are spam or not by using machine learning classification algorithms.Email Spam Classification (ESC) datasets are used in this study for spam detection tests.The ESC datasets contain 5172 rows and 3002 columns of spam and non-spam features.The methodology used in this study is the CRISP-DM to guide the process of evaluating the performance of three machine learning algorithms: Naive Bayes (NB), Logistic Regression (LR), and Random Forest (RF).Subsequently, an ensemble model that integrates the three machine learning algorithms is proposed to improve the performance of spam email recognition.The selected evaluation metrics are F1-Score, accuracy, precision, and recall.Based on the results, the RF algorithm has the highest accuracy of 97.3% in classifying spam emails, with an F1 score of 96.8%, precision of 96.2%, and recall of 96.0%.The NB achieves the best second results, which are slightly different from the RF, and the LR achieves considerably lower results than the other two algorithms.The ensemble model that integrates the three algorithms performs best in classifying spam emails with 98.9% accuracy, 97.6% precision, 97.4% recall, and 96.7% F1-score.
Перевод пока недоступен