An open API service indexing awesome lists of open source software.

https://github.com/krish57-bit/diabetes-prediction-

A comprehensive machine learning pipeline to predict the onset of diabetes using the PIMA Indian Diabetes dataset. This includes data cleaning, visualization, outlier detection, standardization, SMOTE-based imbalance handling, and multiple classification algorithms (Logistic Regression, Naive Bayes, and KNN).
https://github.com/krish57-bit/diabetes-prediction-

classification data-science diabetes healthcare jupyter-notebook machine-learning python scikit-learn smote

Last synced: about 2 months ago
JSON representation

A comprehensive machine learning pipeline to predict the onset of diabetes using the PIMA Indian Diabetes dataset. This includes data cleaning, visualization, outlier detection, standardization, SMOTE-based imbalance handling, and multiple classification algorithms (Logistic Regression, Naive Bayes, and KNN).

Awesome Lists containing this project

README

          

# ๐Ÿง  Diabetes Prediction using Machine Learning

This project demonstrates a full machine learning workflow to predict diabetes using the **PIMA Indian Diabetes Dataset**. It includes detailed data preprocessing, visualization, outlier handling, and classification using various ML models.

---

## ๐Ÿ“ Dataset

- **Source**: [PIMA Indian Diabetes Dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)
- **Size**: 768 samples, 8 features + 1 binary target (`Outcome`)
- **Target**: `Outcome` (1 = Diabetic, 0 = Non-Diabetic)

---

## โš™๏ธ Workflow Overview

### 1. ๐Ÿ“ฅ Data Loading & Exploration
- Checked for missing/zero values
- Descriptive statistics using `.describe()`
- Correlation heatmap (`sns.heatmap`)

### 2. ๐Ÿ› ๏ธ Data Imputation
- Replaced 0s in critical columns (Insulin, BMI, etc.) with median/mean values based on distribution

### 3. ๐Ÿ“ฆ Outlier Detection
- Used IQR method
- Boxplot visualization for identifying outliers

### 4. ๐Ÿงผ Feature Scaling
- Applied `StandardScaler` to normalize all features

### 5. ๐Ÿงช Train-Test Split
- Used 67% training and 33% testing with `train_test_split()`

### 6. โš–๏ธ Imbalanced Data Handling
- Applied **SMOTE (Synthetic Minority Oversampling Technique)** to balance the target classes

### 7. ๐Ÿ” Model Training
#### โœ… Logistic Regression
#### โœ… Gaussian Naive Bayes
#### โœ… K-Nearest Neighbors (KNN)

### 8. ๐Ÿ“ˆ Model Evaluation
- Accuracy, Confusion Matrix, and Classification Report (Precision, Recall, F1-score)

---

## ๐Ÿ“Š Visualizations

- โœ… Correlation Matrix Heatmap
- โœ… Feature Distribution using `sns.distplot`
- โœ… Boxplots before and after standardization

All saved under `images/`.

---

## ๐Ÿง  Models

| Model | Evaluation Metric | Balanced with SMOTE |
|---------------------|------------------------|----------------------|
| Logistic Regression | Accuracy + Recall | โœ… |
| Gaussian NB | Confusion Matrix + F1 | โœ… |
| KNN Classifier | Accuracy + Classification Report | โœ… |

---

## ๐Ÿงพ Requirements

Install with:

```bash
pip install -r requirements.txt