https://github.com/utkarsh-284/credit-default-risk
This Repository deals with Credit Risk evaluation and prediction of default
https://github.com/utkarsh-284/credit-default-risk
credit-card credit-risk default econometrics essemblelearning exploratory-data-analysis jupyter-notebook logistic-regression machine-learning neural-network python support-vector-machines
Last synced: about 2 months ago
JSON representation
This Repository deals with Credit Risk evaluation and prediction of default
- Host: GitHub
- URL: https://github.com/utkarsh-284/credit-default-risk
- Owner: utkarsh-284
- Created: 2025-02-28T12:21:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-28T13:28:24.000Z (over 1 year ago)
- Last Synced: 2025-02-28T19:10:02.990Z (over 1 year ago)
- Topics: credit-card, credit-risk, default, econometrics, essemblelearning, exploratory-data-analysis, jupyter-notebook, logistic-regression, machine-learning, neural-network, python, support-vector-machines
- Language: Jupyter Notebook
- Homepage:
- Size: 2.13 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Credit Default Risk Analysis
## Overview
This project analyzes credit card default risk using machine learning techniques on a dataset of 30,000 Taiwanese credit card clients. The analysis includes comprehensive exploratory data analysis, statistical modeling, and machine learning approaches to predict credit card defaults and identify key risk factors.
## Key Findings
### 🎯 **Best Model Performance**
- **Support Vector Machine (SVM)**: 80.5% accuracy, 0.70 AUC-ROC score
- **Ensemble Methods**: Voting Classifier achieved 80.5% accuracy
- **Neural Network**: Deep learning model achieved 80.0% accuracy
### 📊 **Critical Risk Factors**
1. **Recent Payment Status (PAY_0)**: Most predictive feature (coefficient = 0.515)
2. **Payment History (PAY_2)**: Secondary predictor (coefficient = 0.111)
3. **Bill Amount (BILL_AMT1)**: Financial capacity indicator
4. **Credit Limit (LIMIT_BAL)**: Risk tolerance measure
5. **Age Demographics**: U-shaped risk curve (highest for young adults and seniors)
### 🔍 **Demographic Insights**
- **Gender**: Females take more loans but have lower default rates
- **Education**: University graduates borrow most, but all education levels show similar default proportions
- **Age**: Peak borrowing at 25-30 years, with highest default risk for young adults and seniors
- **Marital Status**: Singles borrow more but have higher default rates than married individuals
## Dataset Details
**Source**: [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients)
### Variables:
- **Demographic Features**: Credit limit (`LIMIT_BAL`), gender (`SEX`), education (`EDUCATION`), marital status (`MARRIAGE`), age (`AGE`)
- **Payment History**: Repayment status for past 6 months (`PAY_0` to `PAY_6`)
- **Billing Statements**: Bill amounts for past 6 months (`BILL_AMT1` to `BILL_AMT6`)
- **Payment Amounts**: Amounts paid in past 6 months (`PAY_AMT1` to `PAY_AMT6`)
- **Target Variable**: DEFAULT (binary indicator: 1 = default, 0 = non-default)
## Methodology
### Data Preprocessing
- **Missing Values**: No missing values detected
- **Categorical Variables**:
- EDUCATION: Categories 0, 5, 6 merged into "Others"
- MARRIAGE: Category 0 merged into "Others"
- **Feature Engineering**:
- Created `TOTAL_BILL_AMT` (sum of all bill amounts)
- Created `TOTAL_PAY_AMT` (sum of all payment amounts)
- **Outlier Removal**: Applied IQR method, reducing dataset from 30,000 to 25,174 observations
### Exploratory Data Analysis (EDA)
- Comprehensive demographic analysis revealing risk patterns
- Financial behavior correlation analysis
- Age-based risk segmentation
- Payment history impact assessment
### Statistical Analysis
- **Logistic Regression**: Pseudo R-squared = 0.121 (12.1% variance explained)
- **Key Statistical Findings**:
- PAY_0 highly significant (p < 0.001)
- PAY_2 significant (p < 0.001)
- BILL_AMT1 significant (p < 0.001)
### Machine Learning Models Tested
| Model | Accuracy | AUC-ROC | Performance Rank |
|-------|----------|---------|------------------|
| **Support Vector Machine** | **80.5%** | **0.70** | **🥇 Best** |
| Voting Classifier | 80.5% | - | 🥈 Second |
| Random Forest | 80.0% | - | 🥉 Third |
| Deep Neural Network | 80.0% | - | 🥉 Third |
| Logistic Regression | 79.3% | - | 4th |
| Lasso Regression | 76.2% | - | 5th |
| Elastic Net | 76.3% | - | 6th |
## Business Impact
### Risk Management Applications
1. **Early Warning System**: Monitor recent payment status for immediate intervention
2. **Credit Limit Optimization**: Adjust limits based on risk scores
3. **Customer Segmentation**: Target high-risk demographics proactively
4. **Collection Strategy**: Prioritize collection efforts based on risk assessment
### Operational Benefits
- **Proactive Risk Mitigation**: Identify high-risk clients before default
- **Reduced Financial Losses**: Targeted interventions based on risk scores
- **Improved Customer Retention**: Personalized risk-based strategies
- **Optimized Credit Policies**: Data-driven lending decisions
## Model Performance Analysis
### Support Vector Machine (Best Model)
- **Accuracy**: 80.5%
- **AUC-ROC**: 0.70
- **Advantages**:
- Excellent for binary classification
- Handles non-linear relationships
- Robust to outliers
- **Business Value**: Highest predictive power for default risk identification
### Feature Importance Ranking
1. **PAY_0** (Most recent payment status) - Primary predictor
2. **PAY_2** (2nd most recent payment status) - Secondary predictor
3. **BILL_AMT1** (Most recent bill amount) - Financial capacity indicator
4. **LIMIT_BAL** (Credit limit) - Risk tolerance indicator
5. **Age-related features** - Life stage risk factors
## Technical Implementation
### Dependencies
```
Python 3.7+
pandas, numpy, matplotlib, seaborn
scikit-learn, statsmodels
tensorflow (for neural networks)
```
### Key Libraries Used
- **Data Processing**: Pandas, NumPy
- **Visualization**: Matplotlib, Seaborn
- **Statistical Analysis**: Statsmodels
- **Machine Learning**: Scikit-learn
- **Deep Learning**: TensorFlow
## Future Work
### Model Enhancements
- **Advanced Algorithms**: XGBoost, LightGBM implementation
- **Class Imbalance**: SMOTE or cost-sensitive learning
- **Feature Engineering**: Payment-to-bill ratios, temporal trends
- **Cross-validation**: Robust model validation
### Data Improvements
- **Additional Features**: Macroeconomic indicators, employment data
- **Real-time Data**: Live transactional data integration
- **Geographic Expansion**: Multi-market validation
- **Temporal Analysis**: Time series modeling
### Deployment Considerations
- **API Development**: Real-time prediction endpoints
- **Dashboard Creation**: Risk visualization interface
- **Model Monitoring**: Performance tracking and drift detection
- **Scalability**: Cloud deployment for production use
## Files Structure
```
Credit Risk Analysis/
├── README.md # Project overview and documentation
├── MODEL_REPORT.md # Comprehensive model analysis report
├── Credit_Default_Prediction.ipynb # Complete Jupyter notebook analysis
└── content/
├── default of credit card clients.xls # Original dataset
└── default+of+credit+card+clients.zip # Compressed dataset
```
## Quick Start
1. **Clone the repository**
2. **Install dependencies**: `pip install pandas numpy matplotlib seaborn scikit-learn statsmodels tensorflow`
3. **Open the Jupyter notebook**: `Credit_Default_Prediction.ipynb`
4. **Run the analysis**: Execute cells sequentially for complete analysis
## Results Summary
### Model Performance
- **Best Model**: Support Vector Machine (80.5% accuracy)
- **Key Predictor**: Recent payment status (PAY_0)
- **Business Impact**: Proactive risk identification and mitigation
### Key Insights
- Recent payment behavior is the strongest default predictor
- Age shows U-shaped risk pattern (highest for young adults and seniors)
- Higher credit limits correlate with lower default rates
- Gender and education show nuanced risk relationships
## Contributor
**Utkarsh Bhardwaj**
**Publish Date**: February 28, 2025
**Contact**: ubhardwaj284@gmail.com
[LinkedIn](https://www.linkedin.com/in/utkarsh284/) | [GitHub](https://github.com/utkarsh-284)
---
**Note**: This analysis provides a solid foundation for credit risk assessment. The SVM model with 80.5% accuracy can be effectively deployed for real-time default prediction and risk management applications.