https://github.com/tszon/end-to-end_ds_ml_project
I built an end-to-end customer churn segregation and prediction project.
https://github.com/tszon/end-to-end_ds_ml_project
containerisation data-science docker explianable-ai exploratory-data-analysis feature-engineering hdbscan-clustering kmeans-clustering machine-learning mlflow preprocessing-data scikit-learn shap statistical-test statistical-tests streamlit supervised-learning visualisation vscode
Last synced: about 2 months ago
JSON representation
I built an end-to-end customer churn segregation and prediction project.
- Host: GitHub
- URL: https://github.com/tszon/end-to-end_ds_ml_project
- Owner: Tszon
- Created: 2024-09-10T22:29:43.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-08-26T11:44:44.000Z (10 months ago)
- Last Synced: 2025-09-03T17:52:18.032Z (9 months ago)
- Topics: containerisation, data-science, docker, explianable-ai, exploratory-data-analysis, feature-engineering, hdbscan-clustering, kmeans-clustering, machine-learning, mlflow, preprocessing-data, scikit-learn, shap, statistical-test, statistical-tests, streamlit, supervised-learning, visualisation, vscode
- Language: Jupyter Notebook
- Homepage: https://www.datacamp.com/portfolio/TszonTseng
- Size: 16.2 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ End-to-End ML Deployment: Telco Customer Churn Project
## ๐ Live Demo
Click ๐ [](https://tszontseng-telco-end2end-customer-churn-project.streamlit.app/)
---
## ๐ Project Overview
Customer churn is a major challenge for telecom companies โ retaining customers is often more cost-effective than acquiring new ones.
This project builds an **end-to-end machine learning pipeline** to predict churn, explain drivers of churn, and segment customers into actionable groups for better retention strategies.
The project includes:
* **EDA** โ Explore churn patterns, tenure, contracts, charges.
* **Customer Segmentation** โ KMeans (baseline) vs HDBSCAN (tuned).
* **Churn Prediction** โ Logistic Regression baseline vs advanced ensemble models (Random Forest, XGBoost, Voting Classifier).
* **Explainability** โ SHAP summary & waterfall plots.
* **Interactive App** โ Built with **Streamlit**, deployed on Streamlit Cloud.
---
## ๐ Dataset
Dataset: [WA_Fn-UseC_-Telco-Customer-Churn.csv](https://www.kaggle.com/blastchar/telco-customer-churn)
* **Target**: `"Churn"` (Yes/No)
* **Features include**:
* **Demographics** โ Gender, Senior Citizen, Dependents, Partner
* **Services** โ Phone, Internet, Tech Support, Streaming, Security
* **Account Info** โ Tenure, Contract, Billing, Payment Method
* **Charges** โ Monthly & Total Charges
---
## ๐งช Methods & Models
### ๐ Exploratory Data Analysis (EDA)
* Customers with **fiber optic internet churn the most** (pricing/service quality issues).
* **DSL customers churn less**, possibly due to stable pricing or loyalty.
* **High churn in first 5 months** โ critical onboarding phase.
* Long-tenure customers (>24 months) show **significantly lower churn rates**.
### ๐ฅ Customer Segmentation (Unsupervised Learning)
* **Cluster 0 โ Budget Loyalists** โ Minimal services, mailed check payments, stable.
* **Cluster 1 โ At-Risk Premiums** โ Fiber optic, month-to-month, electronic check, highest churn risk.
* **Cluster 2 โ Balanced Mainstream** โ Moderate DSL usage, mixed services, mid-spenders.
* **Cluster -1 โ Drifters** โ DSL, no phone, low commitment.
### ๐ Churn Prediction Models
* Logistic Regression (baseline)
* Random Forest (ensemble)
* XGBoost (boosted trees)
* Voting Classifier (combined)
### ๐ Explainability (SHAP)
* Feature importance ranking.
* SHAP summary plots + waterfall plots for individual predictions.
---
## ๐ Deployment
* **Streamlit App** for interactive visualization and prediction.
* **Dockerized** for reproducibility.
* **Deployed on Streamlit Cloud** with a public link.
---
## โ๏ธ Installation & Usage
### 1. Clone Repo
```bash
git clone https://github.com//Customer_Churn_Prediction.git
cd Customer_Churn_Prediction
```
### 2. Install Requirements
```bash
pip install -r requirements.txt
```
### 3. Run Locally
```bash
streamlit run scripts/app.py
```
App runs at: [http://localhost:8501](http://localhost:8501)
### 4. Run with Docker
```bash
docker build -t churn-app .
docker run -p 8501:8501 churn-app
```
---
## ๐ฆ Project Structure
```
Customer_Churn_Prediction/
โ
โโโ data/ # feature store JSON (not raw data)
โโโ models/ # saved ML models (.joblib)
โโโ reports_app/ # plots & visualizations
โโโ scripts/ # Streamlit app (app.py) & utilities
โโโ src/ # preprocessing, feature engineering, utils
โโโ config.json # config settings
โโโ requirements.txt # dependencies
โโโ Dockerfile # container setup
โโโ README.md # this file
```
---
## ๐ ๏ธ Tech Stack
* **Python**: `pandas`, `numpy`, `scikit-learn`, `xgboost`, `shap`, `hdbscan`, `umap`
* **Visualization**: `matplotlib`, `seaborn`, `streamlit`
* **MLOps Tools**: `Docker`, `GitHub`, `MLflow` (Experimental Tracking)
* **Deployment**: `Streamlit Cloud`
---
## ๐ Next Steps
* Extend segmentation with deep embeddings.
* Add hyperparameter search with Optuna.
* Deploy with a custom domain using Render or Railway.
---
## ๐ค Author
Developed by **[Tszon Tseng](https://github.com/Tszontseng)**
* ๐ผ Passionate about Data Science & AI
* ๐ Building end-to-end ML pipelines
* ๐ [LinkedIn Profile](https://www.linkedin.com/in/tszon-tseng-a381aa297/)
---
โจ With this app, telecom providers can **predict churn, understand why customers leave, and design better retention strategies.**