https://github.com/steffin12-git/logistic-regression-daibetics
Built an interpretable Logistic Regression model to predict diabetes from clinical features; produced reproducible EDA, model validation, and visual diagnostics.
https://github.com/steffin12-git/logistic-regression-daibetics
insights matplotlib-pyplot model-evaluation pandas python seaborn sklearn
Last synced: about 1 month ago
JSON representation
Built an interpretable Logistic Regression model to predict diabetes from clinical features; produced reproducible EDA, model validation, and visual diagnostics.
- Host: GitHub
- URL: https://github.com/steffin12-git/logistic-regression-daibetics
- Owner: Steffin12-git
- Created: 2025-08-16T03:50:11.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-08-16T04:16:26.000Z (3 months ago)
- Last Synced: 2025-08-16T06:24:21.418Z (3 months ago)
- Topics: insights, matplotlib-pyplot, model-evaluation, pandas, python, seaborn, sklearn
- Language: Jupyter Notebook
- Homepage:
- Size: 104 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🩺 Logistic Regression — Diabetes Prediction
**Repository:** *Logistic Regression for Diabetics*
**Notebook:** `Logistic regression for diabetics.ipynb`
---
## 🚀 Tech Stack






---
## 📌 Project Summary
This project builds a clear and interpretable **Logistic Regression** model to predict the likelihood of diabetes using clinical features.
The notebook walks through an **end-to-end ML workflow**:
- Data preprocessing & feature scaling
- Exploratory data analysis (EDA)
- Model training & evaluation
- Visual diagnostics (confusion matrix, ROC curve)
- Interpretation of coefficients as odds ratios
✨ The goal is to **balance predictive performance with interpretability**, a crucial requirement for healthcare decision-making.
---
## 📋 Dataset
- **Source:** Pima Indians Diabetes Database (UCI Repository)
- **Target Variable:** `Outcome` → (1 = diabetic, 0 = non-diabetic)
- **Features Used:**
`Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age`
**Preprocessing applied:**
- Replaced biologically implausible zero values (for Glucose, BP, BMI, etc.)
- Applied `StandardScaler` for normalization
- Split dataset using `train_test_split(test_size=0.3, random_state=42)`
---
## 🔎 Exploratory Data Analysis (EDA)
Key checks performed:
- Class balance between diabetic vs. non-diabetic patients
- Feature distributions & correlations
- Relationship of top predictors (Glucose, BMI, Age) with the outcome
*(EDA plots can be extended in future iterations)*
---
## 🧠 Model Training
- **Algorithm:** Logistic Regression (`sklearn.linear_model.LogisticRegression`)
- **Handling imbalance:** (optionally `class_weight='balanced'`)
- **Outputs:** Predictions, probabilities, coefficients
---
## ✅ Model Performance
**Confusion Matrix**

**ROC Curve**

**Evaluation Metrics (Test Set):**
- **Accuracy:** `0.77`
- **Precision:** `0.75`
- **Recall (Sensitivity):** `0.75`
- **F1-score:** `0.82`
---
## 🔑 Model Interpretation
Coefficients were mapped to odds ratios for clinical interpretability. Example:
| Feature | Coefficient (β) | Odds Ratio (exp(β)) | Interpretation |
|--------------------------|----------------:|--------------------:|------------------------------------------|
| Glucose | `β_glucose` | `OR_glucose` | Higher glucose → higher odds of diabetes |
| BMI | `β_bmi` | `OR_bmi` | Elevated BMI increases odds |
| Age | `β_age` | `OR_age` | Older patients have higher risk |
| DiabetesPedigreeFunction | `β_dpf` | `OR_dpf` | Strong family history raises odds |
---
## 🩺 Clinical & Business Insights
- **Glucose** and **BMI** are the strongest indicators of diabetes risk.
- **Age** and **family history (DPF)** further amplify predicted risk.
- The model can serve as a **screening tool** for healthcare professionals: prioritizing high-risk patients for testing.
- In practice, **higher recall (sensitivity)** is preferred to minimize false negatives (undiagnosed diabetics).
---
## ⚠️ Limitations
- Logistic Regression assumes linear log-odds → nonlinear patterns may be missed.
- Missing/imputed data can bias the model.
- Dataset is relatively small; external validation required.
- Clinical deployment requires regulatory approval & real-world testing.
---
## 📂 Repository Structure
```
📁 Logistic-Regression-diabetics
│── Logistic regression for diabetics.ipynb # Jupyter notebook
│── diabetes.csv # Dataset
│── images/
│ ├── confusion metrics.png
│ ├── Roc curve.png
│── README.md # Project documentation
```
---
## ✨ Recruiter Pitch
Developed a **Logistic Regression model** for diabetes prediction with strong interpretability and healthcare relevance.
- Demonstrated full ML workflow (EDA → modeling → evaluation → insights).
- Produced actionable clinical interpretations via coefficients & odds ratios.
- Delivered professional, reproducible documentation & visual diagnostics.
- Tools: **Python, Pandas, Scikit-learn, Matplotlib, Seaborn, Jupyter**