Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gliuck/diabetesprediction

Machine learning exam project, focused on predicting diabetes based on health and demographic data. The project uses models like Logistic Regression, KNN, SVM and NN to analyze and predict the likelihood of diabetes in individuals.
https://github.com/gliuck/diabetesprediction

machine-learning machine-learning-models numpy-library pandas-library prediction-model python scikit-learn

Last synced: 1 day ago
JSON representation

Machine learning exam project, focused on predicting diabetes based on health and demographic data. The project uses models like Logistic Regression, KNN, SVM and NN to analyze and predict the likelihood of diabetes in individuals.

Awesome Lists containing this project

README

        

# MACHINE LEARNING DIABETES PREDICTION 📊

## Introduction 📝
This project focuses on predicting diabetes using machine learning techniques. The dataset includes various health and demographic attributes, which are used to predict whether an individual is diabetic.

## Task 🎯
The primary task is to build a predictive model that determines if an individual has diabetes based on features such as age, sex, cholesterol level, BMI, smoking habits, and more.

## Dataset 📁
The dataset contains the following columns:
- **Age**: Age of the individual.
- **Sex**: Gender of the individual.
- **HighChol**: Indicator if the individual has high cholesterol.
- **CholCheck**: Indicator if the individual had a cholesterol check in the past five years.
- **BMI**: Body Mass Index.
- **Smoker**: Indicator if the individual is a smoker.
- **HeartDiseaseorAttack**: Indicator if the individual has had a heart disease or attack.
- **PhysActivity**: Indicator if the individual engages in physical activity.
- **Fruits**: Indicator if the individual consumes fruits regularly.
- **Veggies**: Indicator if the individual consumes vegetables regularly.
- **HvyAlcoholConsump**: Indicator if the individual is a heavy alcohol consumer.
- **GenHlth**: General health indicator.
- **MentHlth**: Mental health indicator.
- **PhysHlth**: Physical health indicator.
- **DiffWalk**: Indicator if the individual has difficulty walking.
- **Stroke**: Indicator if the individual has had a stroke.
- **HighBP**: Indicator if the individual has high blood pressure.
- **Diabetes**: The target variable indicating if the individual has diabetes.

## Models Used 🧠
In this project, four different models were implemented:
1. **Logistic Regression**
2. **K-Nearest Neighbors (KNN)**
3. **Support Vector Machine (SVM)**
4. **Neural Network (NN)**

## Exploratory Data Analysis (EDA) 📊
EDA was conducted to understand the distribution of the features and identify any correlations or patterns. Visualizations were created using `Matplotlib` and `Seaborn`.

## Performance Metrics 📈
The following metrics were used to evaluate the performance of the models:
- **Accuracy Score**
- **F1 Score**
- **Confusion Matrix**

## Conclusion ✅
The results indicate that the models achieved different levels of accuracy and F1 scores. However, there is room for improvement, particularly in the fine-tuning of the model's hyper-parameters.

## Libraries Used 📚
- `pandas`
- `numpy`
- `matplotlib`
- `seaborn`
- `scikit-learn`
- `pickle`

## Future Work
Potential improvements could include experimenting with different models such as Random Forests or Gradient Boosting, and applying techniques like cross-validation and hyperparameter optimization.

## Credits
The dataset used in this project was obtained from [Kaggle](https://www.kaggle.com/datasets/prosperchuks/health-dataset/data?select=diabetes_data.csv).