https://github.com/sumith25-dev/customer-churn-prediction
https://github.com/sumith25-dev/customer-churn-prediction
Last synced: 15 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/sumith25-dev/customer-churn-prediction
- Owner: sumith25-dev
- Created: 2026-05-28T07:55:56.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-28T08:09:39.000Z (about 1 month ago)
- Last Synced: 2026-05-28T10:06:18.619Z (30 days ago)
- Language: Jupyter Notebook
- Size: 219 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ก Customer Churn Prediction System
> **End-to-end ML pipeline** for telecom customer churn prediction using XGBoost, SMOTE class balancing, SHAP explainability, and a production-ready Streamlit dashboard.
[](https://python.org)
[](https://xgboost.readthedocs.io)
[](https://streamlit.io)
---
## ๐ฏ Results
| Metric | Score |
|--------|-------|
| **Accuracy** | **92%** |
| **AUC-ROC** | **0.89** |
| **Recall** | **88%** |
| Inference Time | < 2 seconds |
| Baseline (Logistic Regression) | 81% accuracy |
> Outperforms logistic regression baseline by **11 percentage points** on IBM Telco dataset (7,043 records).
---
## ๐๏ธ Architecture
```
churn-prediction/
โโโ app.py # Streamlit dashboard (4 pages)
โโโ src/
โ โโโ train.py # Training pipeline (XGBoost + SMOTE + GridSearch)
โ โโโ predict.py # Inference helpers + SHAP explanations
โ โโโ utils.py # Sample CSV generator & shared utilities
โโโ models/ # Saved artifacts (after training)
โ โโโ xgb_model.pkl
โ โโโ scaler.pkl
โ โโโ feature_cols.pkl
โ โโโ shap_explainer.pkl
โโโ data/ # Place dataset CSV here
โโโ requirements.txt
โโโ README.md
```
---
## ๐ Quick Start
### 1. Clone & install
```bash
git clone https://github.com/YOUR_USERNAME/customer-churn-prediction.git
cd customer-churn-prediction
pip install -r requirements.txt
```
### 2. Download the dataset
Get the IBM Telco Customer Churn dataset from Kaggle:
```
https://www.kaggle.com/datasets/blastchar/telco-customer-churn
```
Place `WA_Fn-UseC_-Telco-Customer-Churn.csv` inside the `data/` folder.
### 3. Train the model
```bash
python src/train.py
```
This will:
- Load and clean the 7,043-record dataset
- Engineer 20+ features (tenure buckets, service count, charge ratios)
- Apply SMOTE to balance the 26% churn minority class
- Run 5-fold cross-validated grid search over XGBoost hyperparameters
- Evaluate on a 20% held-out test set
- Save SHAP explainability artifacts
- Output: `models/*.pkl`, `models/confusion_matrix.png`, `models/roc_curve.png`, `models/shap_summary.png`
### 4. Launch the dashboard
```bash
streamlit run app.py
```
---
## ๐ Key Churn Drivers (SHAP Analysis)
1. **Contract Type** โ Month-to-month contracts show 3ร higher churn rate
2. **Tenure** โ New customers (0โ12 months) churn most frequently
3. **Monthly Charges** โ Higher bills correlate with churn risk
4. **Internet Service** โ Fiber optic users churn more than DSL
5. **Tech Support** โ Absence of tech support increases churn risk
---
## ๐ Dashboard Features
| Page | Description |
|------|-------------|
| ๐ Dashboard | KPI cards, key churn drivers, model overview |
| ๐ Single Prediction | Real-time inference with SHAP waterfall chart |
| ๐ฆ Bulk Prediction | CSV upload โ predictions โ downloadable results |
| ๐ Model Insights | Feature importance, ROC curve, confusion matrix |
---
## ๐ ๏ธ Pipeline Details
### Data Processing
- Handle missing `TotalCharges` (11 records) with median imputation
- Encode categorical features via one-hot encoding
- Create derived features: `tenure_group`, `service_count`, `charges_per_month`
### Class Imbalance
- Dataset: 73.5% No Churn / 26.5% Churn
- Strategy: **SMOTE** (Synthetic Minority Over-sampling Technique) on training split only
### Model Selection
- Algorithm: **XGBoost** (gradient-boosted trees)
- Validation: **StratifiedKFold (k=5)** cross-validation
- Tuning: **GridSearchCV** over n_estimators, max_depth, learning_rate, subsample
### Explainability
- **SHAP TreeExplainer** for global feature importance and local per-prediction explanations
- Surfaces top positive/negative drivers for each customer prediction
---
## ๐ค Author
**Sumith B R** โ Junior AI Engineer
[LinkedIn](https://www.linkedin.com/in/sumith-b-r-548534200/) ยท [GitHub](https://github.com/sumith25-dev)