https://github.com/ksganni/diabetes-prediction-using-machine-learning
🔬 A machine learning project that predicts diabetes likelihood using patient health metrics and multiple classification algorithms.
https://github.com/ksganni/diabetes-prediction-using-machine-learning
jupyter-notebook matplotlib numpy pandas python scikitlearn-machine-learning seaborn
Last synced: about 2 months ago
JSON representation
🔬 A machine learning project that predicts diabetes likelihood using patient health metrics and multiple classification algorithms.
- Host: GitHub
- URL: https://github.com/ksganni/diabetes-prediction-using-machine-learning
- Owner: ksganni
- License: mit
- Created: 2025-08-09T01:03:18.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-09T03:46:58.000Z (11 months ago)
- Last Synced: 2025-08-19T14:40:04.522Z (10 months ago)
- Topics: jupyter-notebook, matplotlib, numpy, pandas, python, scikitlearn-machine-learning, seaborn
- Language: Jupyter Notebook
- Homepage:
- Size: 485 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# 🩺 Diabetes Prediction Project
A beginner-friendly machine learning project that predicts the likelihood of diabetes using the Pima Indians Diabetes Database. This project demonstrates a complete ML workflow from data exploration to model deployment.
---
## 📋 Project Overview
This project aims to predict diabetes occurrence using various patient health metrics. The workflow includes data preprocessing, model training, evaluation, and comparison of multiple classification algorithms.
---
## 🎯 Goals
- Demonstrate end-to-end machine learning workflow
- Compare performance of multiple classification algorithms
- Provide clear visualizations and interpretable results
- Create a foundation for more advanced diabetes prediction models
---
## 🛠️ Tech Stack
- **Python** 3.8+
- **Data Manipulation**: pandas, numpy
- **Machine Learning**: scikit-learn
- **Visualization**: matplotlib, seaborn
- **Environment**: Jupyter Notebook
---
## 📁 Project Structure
```
diabetes-prediction/
├── diabetes_prediction.ipynb # Main notebook with complete workflow
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── diabetes.csv # Dataset file
└── results/ # Generated outputs
├── feature_histograms.png
├── correlation_heatmap.png
├── model_accuracy_comparison.png
├── roc_curves.png
└── best_model.pkl
```
---
## 📊 Dataset
**Source**: Pima Indians Diabetes Database from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
**Features**:
- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration
- `BloodPressure`: Diastolic blood pressure (mm Hg)
- `SkinThickness`: Triceps skin fold thickness (mm)
- `Insulin`: 2-Hour serum insulin (mu U/ml)
- `BMI`: Body mass index (weight in kg/(height in m)^2)
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age in years
**Target**: `Outcome` (0 = No diabetes, 1 = Diabetes)
---
## 🚀 Getting Started
---
### Installation
1. **Clone the repository**
```bash
git clone https://github.com/yourusername/diabetes-prediction.git
cd diabetes-prediction
```
2. **Create a virtual environment**
```bash
python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
```
3. **Install dependencies**
```bash
pip install -r requirements.txt
```
4. **Download the dataset**
- Download `diabetes.csv` from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
- Place it in the project root directory
---
### Running the Project
1. **Launch Jupyter Notebook**
```bash
jupyter notebook
```
2. **Open and run the notebook**
- Open `diabetes_prediction.ipynb`
- Run cells sequentially from top to bottom
- Results will be saved in the `results/` directory
---
## 📈 Model Performance
The project implements and compares three machine learning algorithms:
| Model | Accuracy | Precision | Recall | F1 Score | ROC AUC |
|-------|----------|-----------|--------|----------|---------|
| **Random Forest** | 77.9% | 0.717 | 0.611 | 0.66 | 0.818 |
| **K-Nearest Neighbors** | 75.3% | 0.660 | 0.611 | 0.635 | - |
| **Logistic Regression** | 70.8% | 0.600 | 0.500 | 0.545 | 0.813 |
---
## 📊 Key Visualizations
The notebook generates several insightful visualizations:
- **Feature Distribution Histograms**: Understanding data distributions
- **Correlation Heatmap**: Feature relationships and multicollinearity
- **Model Accuracy Comparison**: Side-by-side performance metrics
- **ROC Curves**: Model discrimination capability
- **Feature Importance Plot**: Most influential predictors
---