https://github.com/krish57-bit/diabetes-prediction-
A comprehensive machine learning pipeline to predict the onset of diabetes using the PIMA Indian Diabetes dataset. This includes data cleaning, visualization, outlier detection, standardization, SMOTE-based imbalance handling, and multiple classification algorithms (Logistic Regression, Naive Bayes, and KNN).
https://github.com/krish57-bit/diabetes-prediction-
classification data-science diabetes healthcare jupyter-notebook machine-learning python scikit-learn smote
Last synced: about 2 months ago
JSON representation
A comprehensive machine learning pipeline to predict the onset of diabetes using the PIMA Indian Diabetes dataset. This includes data cleaning, visualization, outlier detection, standardization, SMOTE-based imbalance handling, and multiple classification algorithms (Logistic Regression, Naive Bayes, and KNN).
- Host: GitHub
- URL: https://github.com/krish57-bit/diabetes-prediction-
- Owner: krish57-bit
- Created: 2025-06-17T14:11:42.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-17T14:24:22.000Z (about 1 year ago)
- Last Synced: 2025-06-17T15:28:36.567Z (about 1 year ago)
- Topics: classification, data-science, diabetes, healthcare, jupyter-notebook, machine-learning, python, scikit-learn, smote
- Language: Jupyter Notebook
- Homepage:
- Size: 407 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ง Diabetes Prediction using Machine Learning
This project demonstrates a full machine learning workflow to predict diabetes using the **PIMA Indian Diabetes Dataset**. It includes detailed data preprocessing, visualization, outlier handling, and classification using various ML models.
---
## ๐ Dataset
- **Source**: [PIMA Indian Diabetes Dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)
- **Size**: 768 samples, 8 features + 1 binary target (`Outcome`)
- **Target**: `Outcome` (1 = Diabetic, 0 = Non-Diabetic)
---
## โ๏ธ Workflow Overview
### 1. ๐ฅ Data Loading & Exploration
- Checked for missing/zero values
- Descriptive statistics using `.describe()`
- Correlation heatmap (`sns.heatmap`)
### 2. ๐ ๏ธ Data Imputation
- Replaced 0s in critical columns (Insulin, BMI, etc.) with median/mean values based on distribution
### 3. ๐ฆ Outlier Detection
- Used IQR method
- Boxplot visualization for identifying outliers
### 4. ๐งผ Feature Scaling
- Applied `StandardScaler` to normalize all features
### 5. ๐งช Train-Test Split
- Used 67% training and 33% testing with `train_test_split()`
### 6. โ๏ธ Imbalanced Data Handling
- Applied **SMOTE (Synthetic Minority Oversampling Technique)** to balance the target classes
### 7. ๐ Model Training
#### โ
Logistic Regression
#### โ
Gaussian Naive Bayes
#### โ
K-Nearest Neighbors (KNN)
### 8. ๐ Model Evaluation
- Accuracy, Confusion Matrix, and Classification Report (Precision, Recall, F1-score)
---
## ๐ Visualizations
- โ
Correlation Matrix Heatmap
- โ
Feature Distribution using `sns.distplot`
- โ
Boxplots before and after standardization
All saved under `images/`.
---
## ๐ง Models
| Model | Evaluation Metric | Balanced with SMOTE |
|---------------------|------------------------|----------------------|
| Logistic Regression | Accuracy + Recall | โ
|
| Gaussian NB | Confusion Matrix + F1 | โ
|
| KNN Classifier | Accuracy + Classification Report | โ
|
---
## ๐งพ Requirements
Install with:
```bash
pip install -r requirements.txt