Importance of Data Preprocessing for Accurate and Effective Prediction of Breast Cancer: Evaluation of Model Performance in Novel Data
Аннотация
Breast cancer is one of the leading causes of mortality among women globally, and an early and accurate diagnosis is essential for effective treatment and improved survival rates. Traditional diagnostic techniques often struggle to differentiate between benign and malignant tumors due to overlapping visual characteristics, resulting in false positives or delayed detection. For efficient breast cancer detection with machine learning, it is vital to identify the most significant features because those features play the most important roles in the treatment process. This study addresses this challenge by evaluating and comparing the performance of ten state-of-the-art machine learning classifiers for breast cancer detection using image-derived features. Initially, 30 features were extracted from a novel tertiary hospital dataset, and models were evaluated based on accuracy, precision, recall, and F-measure. To enhance model performance and reduce dimensionality, the Correlation-based Feature Selection (CFS) method was applied, leading to the identification of 11 highly informative features. Our experimental results demonstrate that, while models such as SVM and Logistic Regression achieved the highest accuracy (97.7%) on the full feature set, the Neural Network exhibited a superior performance (97.2%) on the reduced feature set, with a substantial reduction in training time. Most classifiers maintained comparable or improved accuracy with fewer features, indicating effective dimensionality reduction. Furthermore, pairwise statistical significance testing confirmed that ensemble and kernel-based classifiers achieved a statistically superior performance over simpler models. These findings highlight the importance of effective feature selection in developing accurate, efficient, and scalable breast cancer prediction systems.
Перевод пока недоступен