https://github.com/akwardhan/loan-default-prediction-xgboost-streamlit
Full-scale loan default prediction system using XGBoost, trained on 1.3M LendingClub loans. Includes feature-rich preprocessing, class imbalance handling, recall-focused ML pipeline, and Streamlit web deployment for real-time borrower risk scoring.
https://github.com/akwardhan/loan-default-prediction-xgboost-streamlit
credit-risk data-science google-colab loan-default-prediction machine-learning python real-world-project scikit-learn streamlit xgboost
Last synced: about 2 months ago
JSON representation
Full-scale loan default prediction system using XGBoost, trained on 1.3M LendingClub loans. Includes feature-rich preprocessing, class imbalance handling, recall-focused ML pipeline, and Streamlit web deployment for real-time borrower risk scoring.
- Host: GitHub
- URL: https://github.com/akwardhan/loan-default-prediction-xgboost-streamlit
- Owner: Akwardhan
- Created: 2025-06-14T21:02:46.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-07-06T05:25:54.000Z (4 months ago)
- Last Synced: 2025-07-27T22:48:47.553Z (3 months ago)
- Topics: credit-risk, data-science, google-colab, loan-default-prediction, machine-learning, python, real-world-project, scikit-learn, streamlit, xgboost
- Language: Jupyter Notebook
- Homepage:
- Size: 821 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Loan Default Prediction – XGBoost Model + Streamlit App
[](https://www.python.org/) [](https://loan-default-prediction-xgboost-app-lgccyltgxl9suauhmcuhqp.streamlit.app/) []()Built a high-scale loan default prediction system on 1.3M+ LendingClub records. Includes feature-rich preprocessing, class imbalance handling, model benchmarking, and XGBoost-based deployment. A production-ready Streamlit web app allows real-time borrower risk assessment to support financial decision-making.
---
## 📌 Objective
To predict whether a borrower will **fully repay** or **default** on a loan, using key financial, credit, and application data. The goal is to reduce loan loss risk by detecting defaulters early and supporting data-backed approval workflows.
---
## 📦 Dataset Summary
- **Source:** LendingClub public dataset
- **Final Size:** ~1.3M records × 86 features
- **Target Classes:**
- Fully Paid → `0`
- Charged Off → `1`
- Dropped other statuses like "Current", "Late", "In Grace Period"---
## 📊 Exploratory Data Analysis (EDA)
| Feature | Key Insight |
|----------------|------------------------------------------------------------------------------|
| 💰 Loan Amount | Defaults skewed higher (median ₹12L vs ₹9.5L for paid) |
| 💳 Income | Lower-income borrowers showed higher default probability |
| 📈 DTI | DTI > 35% strongly correlated with default |
| 🔎 FICO Score | FICO < 660 had 4× higher default risk |
| 🛒 Purpose | Small business/vacation loans showed 3× default rates |
| ⏱️ Term | 60-month loans defaulted 2× more than 36-month terms |
| 📉 Interest | Higher interest rates aligned with higher default likelihood |---
## 🛠️ Data Cleaning & Preprocessing
- Removed leakage and ID-like fields (e.g., `recoveries`, `desc`, `member_id`)
- Handled nulls using domain logic (median for `dti`, mode for `term`, etc.)
- Dropped low-signal or high-missing-value columns
- One-hot encoded key categorical features: `purpose`, `home_ownership`, `term`, etc.
- Scaled numeric columns (e.g., `loan_amnt`, `int_rate`, `dti`)
- Used `scale_pos_weight` in XGBoost to handle class imbalance (default ~20%)---
## 🤖 Model Training & Comparison
Trained and benchmarked three models to maximize recall and F1-score on **default cases (class 1)**.
| Metric | Logistic Regression | Random Forest | XGBoost ✅ |
|--------|---------------------|----------------|------------|
| **Accuracy** | 80.1% | 79.7% | 79.7% |
| **Precision (Class 1)** | 0.51 | 0.46 | 0.46 |
| **Recall (Class 1)** | 0.07 ❌ | 0.09 ✅ | 0.09 ✅ |
| **F1-Score (Class 1)** | 0.12 | 0.15 ✅ | 0.15 ✅ |
| **True Positives** | 3,782 | 4,637 ✅ | 4,637 ✅ |
| **False Negatives** | 49,930 ❌ | 49,075 ✅ | 49,075 ✅ |### ✅ Final Model: `XGBoostClassifier`
Reasons:
- Best balance between recall and precision for identifying defaulters
- Fewer false negatives (TP=4,637) → better business risk control
- Tuned with `scale_pos_weight` to handle class imbalance directly
- Scalable, fast, and robust with large datasets (~1.3M records)---
## 🚀 Streamlit App – Real-Time Risk Scoring
Built a clean, lightweight Streamlit app to simulate loan applications and output default risk in real time.
### 🎯 Inputs:
- Loan Amount, Term, Annual Income
- FICO Score, DTI, Interest Rate
- Purpose, Home Ownership (via checkbox/input)### 📈 Outputs:
- Prediction: **Default** / **Fully Paid**
- Default Probability: e.g., **76.43%**## 🚀 Live Streamlit App
🔗 **Try the live app here:** [Loan Default Risk Predictor – Streamlit](https://loan-default-prediction-xgboost-app-lgccyltgxl9suauhmcuhqp.streamlit.app/)
📸 
📸 ---
---
## 🧠 Tools & Technologies
- Python: pandas, numpy, scikit-learn, XGBoost
- Streamlit: real-time web app
- Jupyter & Google Colab: experimentation
- Matplotlib, Seaborn: visualization
- Git, GitHub: version control, collaboration---
## 💼 Business Impact
This ML solution enables:
✅ Early detection of **high-risk borrowers**
✅ Safer loan approval workflows
✅ Reduced **default-related revenue loss**
✅ Interpretable dashboard for **non-technical stakeholders**---
## 🔗 Dataset Reference
- 📂 [LendingClub Loan Data – Kaggle](https://www.kaggle.com/datasets/wordsforthewise/lending-club)
---
## 📌 Author
**Anmol Kirtiwardhan**
🌐 Explore all my projects & live apps: [akwardhan.github.io](https://akwardhan.github.io)