Machine Learning-Based Predictive Modelling for Early Diagnosis of Type 2 Diabetes Mellitus: A Comparative Analysis of Supervised Classification Algorithms
Аннотация
Background: Type 2 diabetes mellitus (T2DM) constitutes one of the most rapidly expanding metabolic disorders globally, with projections indicating that the affected population will surpass 783 million individuals by 2045. Timely and accurate prediction of T2DM at the pre-diabetic or early symptomatic stage is essential for reducing morbidity, limiting healthcare expenditure, and enabling targeted preventive interventions. Conventional clinical risk stratification tools often lack sufficient discriminatory power, underscoring the pressing need for robust computational approaches. Objective: The present study aimed to develop, train, and validate multiple supervised machine learning (ML) classification models to predict T2DM incidence using a curated dataset of clinical, biochemical, and lifestyle parameters, and to identify the optimal algorithm for clinical deployment. Methods: A retrospective dataset comprising 10,892 patient records was assembled from the PIMA Indian Diabetes Database supplemented with clinical registry data, encompassing features including fasting plasma glucose, glycated haemoglobin (HbA1c), body mass index (BMI), blood pressure, age, physical activity index, dietary quality score, family history, and socioeconomic indicators. Post-preprocessing entailing missing value imputation, z-score normalization, and onehot encoding seven ML classifiers were trained: Logistic Regression, K-Nearest Neighbours (KNN), Support Vector Machine (SVM), Decision Tree, Random Forest, eXtreme Gradient Boosting (XGBoost), Multilayer Perceptron (MLP), and a Stacking Ensemble. Stratified 10-fold cross-validation was applied, and models were evaluated on Accuracy, Precision, Recall, F1score, Specificity, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Results: The Stacking Ensemble model achieved the highest overall performance with accuracy of 91.6%, precision of 90.8%, recall of 89.4%, F1-score of 90.1%, specificity of 92.3%, and AUC-ROC of 0.948. XGBoost performed second best (AUC-ROC = 0.931; accuracy = 89.3%), followed by the Neural Network (MLP; AUC-ROC = 0.922). Glucose concentration and HbA1c emerged as the most predictive features via SHAP-based importance analysis. Conclusion: The proposed ensemble framework demonstrates superior discriminatory capability for early T2DM prediction and offers a scalable, non-invasive adjunct to conventional diagnostic protocols. Integration of such models within electronic health record systems and wearable health-monitoring platforms holds significant promise for population-level diabetes prevention.
Ҳали таржима қилинмаган