The Impact of Normalization on Regression-Based Crop Yield Prediction: Accuracy and Efficiency Analysis
Abstract
Normalization is a fundamental step in machine learning data preprocessing, often greatly influencing model performance. This research focuses on the effects of different normalization methods when applied to regression problems by considering both predictive accuracy and computational efficiency. As case studies, this study uses two publicly available datasets of crop yield prediction challenges provided through Kaggle. A total of four widely applied regression algorithms are trained and tested in two experimental settings: with and without data normalization. Their performance was measured using common metrics for the assessment of regression models (RMSE, MAE, and R<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup>). To account for computational efficiency, indicators such as training time, memory used, and convergence behavior were considered. The experimental results indicated that for gradient-based algorithms, namely Linear Regression and MLP, normalization mostly played a critical role, enhancing both the accuracy and convergence speed when working with datasets that had heterogeneous feature scales. On the other hand, the effect on tree-based models (RF and GBRT) was insignificant due to their intrinsic scale invariance. Secondly, this study highlights that poor preprocessing choices may lead to a longer training or unstable learning dynamics in neural systems. These findings emphasize the importance of aligning normalization strategies with the nature of the dataset and the algorithm chosen. The results provide practical guidelines for optimizing regression chains of agricultural yield prediction and will generalize to other data-driven decision-making problems in science and industry.