https://github.com/pydevcasts/churn_modeling_article
customer churn prediction system for banking institutions using advanced feature engineering and ensemble learning techniques. The model addresses highly imbalanced datasets (10:1 ratio) by combining SMOTE oversampling with a Soft Voting Classifier (Random Forest, Gradient Boosting, and XGBoost)
https://github.com/pydevcasts/churn_modeling_article
customer-churn-prediction ensemble-learning imbalanced-data smote
Last synced: 29 days ago
JSON representation
customer churn prediction system for banking institutions using advanced feature engineering and ensemble learning techniques. The model addresses highly imbalanced datasets (10:1 ratio) by combining SMOTE oversampling with a Soft Voting Classifier (Random Forest, Gradient Boosting, and XGBoost)
- Host: GitHub
- URL: https://github.com/pydevcasts/churn_modeling_article
- Owner: pydevcasts
- Created: 2025-06-27T13:28:37.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2026-05-26T23:00:43.000Z (about 1 month ago)
- Last Synced: 2026-05-27T00:20:31.905Z (about 1 month ago)
- Topics: customer-churn-prediction, ensemble-learning, imbalanced-data, smote
- Language: Jupyter Notebook
- Homepage:
- Size: 41.5 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Customer Churn Prediction in Banking
## Advanced Feature Engineering & Ensemble Learning for Imbalanced Banking Data
[](https://www.python.org/)
[](https://scikit-learn.org/)
[](https://xgboost.ai/)
[](LICENSE)
---
## 📋 Overview
This project presents a robust machine learning solution for **customer churn prediction** in the banking sector. The model addresses the challenge of highly imbalanced datasets (10:1 ratio) through advanced feature engineering and ensemble learning techniques.
### Key Highlights
- **Accuracy**: 92%
- **AUC Score**: 0.96
- **Precision**: 0.96
- **Recall**: 0.87
- **Outperformed previous best model** (91% accuracy)
---
## 🎯 Problem Statement
Customer churn is a critical challenge in the banking industry. Identifying customers likely to leave enables proactive retention strategies. The main challenges addressed:
- **Highly imbalanced dataset** (90% non-churn vs 10% churn)
- **Complex feature interactions** affecting customer behavior
- **Need for interpretable** yet powerful predictions
---
## 📊 Dataset
The dataset contains banking customer information with the following features:
| Feature | Description |
|---------|-------------|
| CreditScore | Customer's credit score |
| Geography | Country (France, Germany, Spain) |
| Gender | Male/Female |
| Age | Customer age |
| Tenure | Years with the bank |
| Balance | Account balance |
| NumOfProducts | Number of bank products used |
| HasCrCard | Credit card ownership (0/1) |
| IsActiveMember | Active member status (0/1) |
| EstimatedSalary | Estimated annual salary |
| Exited | **Target**: Churn status (1 = exited) |
---
## 🛠️ Methodology
### 1. Data Preprocessing
**Outlier Removal:**
```python
- CreditScore ≤ 359
- Age ≥ 71 years
- NumOfProducts ≥ 4