Exploring the Impact of Dietary Habits and Physical Activity on Obesity Rates Using Apache Spark
Abstract
Obesity is a major modern health issue in the United States, and it is a multifaceted problem that is caused by the interplay of lifestyle, socio-economic, and demographic factors. In this paper, a massive analysis of the survey of the National Institute of Diabetes and Digestive and Kidney Diseases titled Nutrition, Physical Activity, and Obesity was conducted using Apache Spark. The data includes the food habits, exercise intensity and self-reported obesity indicators among different population groups. The relationship between variables was analyzed using the distributed computing and MLib framework that is offered by Spark and linear regression and random forest regression models. Results indicate that higher intake of fruit and vegetables is not strongly correlated with lower prevalence of obesity, whereas socio-economic variables such as income and education level were strongly correlated with physical activity and consequently the prevalence of obesity. Such findings suggest the multifactorial nature of obesity and mean that the most effective interventions in the context of obesity-oriented public health interventions should extend past dietary recommendations and encompass more socio-economic disparities. This study can be used to analyze the complex trends of epidemiological data and provide evidence-based policy-making solutions by showing the usefulness of Apache Spark in processing and modeling massive health data.