https://github.com/kianaabrisham/stroke-prediction-ml-pipeline
Clinical ML pipeline with ROC/PR and interpretability
https://github.com/kianaabrisham/stroke-prediction-ml-pipeline
class-imbalance clinical-data healthcare interpretability machine-learning pandas pipeline precision-recall roc-auc scikit-learn
Last synced: 2 days ago
JSON representation
Clinical ML pipeline with ROC/PR and interpretability
- Host: GitHub
- URL: https://github.com/kianaabrisham/stroke-prediction-ml-pipeline
- Owner: KianaAbrisham
- License: mit
- Created: 2025-09-30T13:51:50.000Z (7 days ago)
- Default Branch: main
- Last Pushed: 2025-09-30T13:53:06.000Z (7 days ago)
- Last Synced: 2025-09-30T21:27:54.889Z (7 days ago)
- Topics: class-imbalance, clinical-data, healthcare, interpretability, machine-learning, pandas, pipeline, precision-recall, roc-auc, scikit-learn
- Language: Jupyter Notebook
- Homepage:
- Size: 7.81 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Stroke Prediction — ML Pipeline (Portfolio Sample)
A **clean, reproducible** scikit-learn pipeline for binary classification on a stroke dataset.
This repo is designed as a professional portfolio example: clear preprocessing, class imbalance handling,
model comparison, and publication-quality evaluation figures.## What this demonstrates
- Reproducible **ML pipeline** with `ColumnTransformer` (numeric/one-hot)
- **Imputation** (median/most_frequent) and **scaling**
- Class imbalance handling via `class_weight='balanced'`
- **Modeling:** Logistic Regression, Random Forest
- **Evaluation:** ROC-AUC, PR-AUC, confusion matrix, classification report
- **Interpretability:** feature importances (RF) and permutation importance## Repo Structure
```
.
├── notebooks
│ └── stroke_pipeline.ipynb
├── data
│ └── sample.csv # Tiny demo CSV (columns similar to popular stroke datasets)
├── README.md
├── requirements.txt
├── LICENSE
└── .gitignore
```## Using your dataset
1. Place your full dataset CSV as `data/stroke.csv` with columns similar to:
- `id, gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, stroke`
2. Open `notebooks/stroke_pipeline.ipynb` and set `use_demo = False` to load `data/stroke.csv`.
3. Run all cells.> Note: The repo includes a tiny synthetic `sample.csv` so the notebook runs instantly. For real results, use your dataset.
## Quickstart
```bash
python -m venv .venv
# Windows: .venv\Scripts\activate
# Linux/Mac: source .venv/bin/activate
pip install -r requirements.txt
jupyter notebook notebooks/stroke_pipeline.ipynb
```## License
MIT