Diabetes Prediction Using Traditional Machine Learning Techniques
Keywords:
Logistics Regression, Stratified Sampling, Pima Indians Diabetes Dataset, Diabetes Prediction.Abstract
This study examines the capability of traditional machine learning (ML) algorithms to predict
the onset of diabetes using the Pima Indians diabetes dataset. It employed decision trees, naive
bayes, k-Nearest Neighbors (kNN), and logistic regression classifiers were evaluated using the
performance metrics of accuracy, precision, recall, F1 score and ROC AUC. The data was preprocessed
to amend implausible values and stratified sampling was performed to facilitate
balancing classes when splitting the data. The naive bayes algorithm achieves the best accuracy
(72.7%) while logistic regression obtains the best class separability (ROC AUC of 0.813). The
project shows that interpretable models can provide actionable insights for early identification,
supporting Sustainable Development Goal 3 (Good Health and Well-Being), particularly by
promoting preventive healthcare and informed decision-making in resource-constrained
environments.