Student Dropout Risk Classification Using CatBoost Algorithm in Higher Education Retention Systems
Abstract
A lot of college students don't finish their degrees, which is bad for both the students and the schools. Finding at-risk students quickly is very important so they can get the help and resources they need. The CatBoost algorithm is good at working with datasets that aren't balanced and have categorical features. This study suggests a dropout risk classification model that uses it. The dataset has information about the demographics, level of engagement, and academic performance of college students. After choosing the features and cleaning up the data, the CatBoost model was trained and tested against more standard classifiers. CatBoost was the most accurate of the three methods tested, with an F1-score of 93.6%, compared to 91.2% for Random Forest, SVM, and Logistic Regression. Using SHAP (Shapley Additive exPlanations) values for feature interpretation, we found that attendance and academic performance were two of the most important factors that predicted dropout. The suggested model makes it easier for schools to use data to make decisions and keep students in school. This study demonstrates the importance of gradient boosting methods in developing systems that help students stay in school longer.