An open API service indexing awesome lists of open source software.

https://github.com/utkarsh-284/credit-default-risk

This Repository deals with Credit Risk evaluation and prediction of default
https://github.com/utkarsh-284/credit-default-risk

credit-card credit-risk default econometrics essemblelearning exploratory-data-analysis jupyter-notebook logistic-regression machine-learning neural-network python support-vector-machines

Last synced: about 2 months ago
JSON representation

This Repository deals with Credit Risk evaluation and prediction of default

Awesome Lists containing this project

README

          

# Credit Default Risk Analysis

## Overview
This project analyzes credit card default risk using machine learning techniques on a dataset of 30,000 Taiwanese credit card clients. The analysis includes comprehensive exploratory data analysis, statistical modeling, and machine learning approaches to predict credit card defaults and identify key risk factors.

## Key Findings

### 🎯 **Best Model Performance**
- **Support Vector Machine (SVM)**: 80.5% accuracy, 0.70 AUC-ROC score
- **Ensemble Methods**: Voting Classifier achieved 80.5% accuracy
- **Neural Network**: Deep learning model achieved 80.0% accuracy

### 📊 **Critical Risk Factors**
1. **Recent Payment Status (PAY_0)**: Most predictive feature (coefficient = 0.515)
2. **Payment History (PAY_2)**: Secondary predictor (coefficient = 0.111)
3. **Bill Amount (BILL_AMT1)**: Financial capacity indicator
4. **Credit Limit (LIMIT_BAL)**: Risk tolerance measure
5. **Age Demographics**: U-shaped risk curve (highest for young adults and seniors)

### 🔍 **Demographic Insights**
- **Gender**: Females take more loans but have lower default rates
- **Education**: University graduates borrow most, but all education levels show similar default proportions
- **Age**: Peak borrowing at 25-30 years, with highest default risk for young adults and seniors
- **Marital Status**: Singles borrow more but have higher default rates than married individuals

## Dataset Details

**Source**: [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients)

### Variables:
- **Demographic Features**: Credit limit (`LIMIT_BAL`), gender (`SEX`), education (`EDUCATION`), marital status (`MARRIAGE`), age (`AGE`)
- **Payment History**: Repayment status for past 6 months (`PAY_0` to `PAY_6`)
- **Billing Statements**: Bill amounts for past 6 months (`BILL_AMT1` to `BILL_AMT6`)
- **Payment Amounts**: Amounts paid in past 6 months (`PAY_AMT1` to `PAY_AMT6`)
- **Target Variable**: DEFAULT (binary indicator: 1 = default, 0 = non-default)

## Methodology

### Data Preprocessing
- **Missing Values**: No missing values detected
- **Categorical Variables**:
- EDUCATION: Categories 0, 5, 6 merged into "Others"
- MARRIAGE: Category 0 merged into "Others"
- **Feature Engineering**:
- Created `TOTAL_BILL_AMT` (sum of all bill amounts)
- Created `TOTAL_PAY_AMT` (sum of all payment amounts)
- **Outlier Removal**: Applied IQR method, reducing dataset from 30,000 to 25,174 observations

### Exploratory Data Analysis (EDA)
- Comprehensive demographic analysis revealing risk patterns
- Financial behavior correlation analysis
- Age-based risk segmentation
- Payment history impact assessment

### Statistical Analysis
- **Logistic Regression**: Pseudo R-squared = 0.121 (12.1% variance explained)
- **Key Statistical Findings**:
- PAY_0 highly significant (p < 0.001)
- PAY_2 significant (p < 0.001)
- BILL_AMT1 significant (p < 0.001)

### Machine Learning Models Tested

| Model | Accuracy | AUC-ROC | Performance Rank |
|-------|----------|---------|------------------|
| **Support Vector Machine** | **80.5%** | **0.70** | **🥇 Best** |
| Voting Classifier | 80.5% | - | 🥈 Second |
| Random Forest | 80.0% | - | 🥉 Third |
| Deep Neural Network | 80.0% | - | 🥉 Third |
| Logistic Regression | 79.3% | - | 4th |
| Lasso Regression | 76.2% | - | 5th |
| Elastic Net | 76.3% | - | 6th |

## Business Impact

### Risk Management Applications
1. **Early Warning System**: Monitor recent payment status for immediate intervention
2. **Credit Limit Optimization**: Adjust limits based on risk scores
3. **Customer Segmentation**: Target high-risk demographics proactively
4. **Collection Strategy**: Prioritize collection efforts based on risk assessment

### Operational Benefits
- **Proactive Risk Mitigation**: Identify high-risk clients before default
- **Reduced Financial Losses**: Targeted interventions based on risk scores
- **Improved Customer Retention**: Personalized risk-based strategies
- **Optimized Credit Policies**: Data-driven lending decisions

## Model Performance Analysis

### Support Vector Machine (Best Model)
- **Accuracy**: 80.5%
- **AUC-ROC**: 0.70
- **Advantages**:
- Excellent for binary classification
- Handles non-linear relationships
- Robust to outliers
- **Business Value**: Highest predictive power for default risk identification

### Feature Importance Ranking
1. **PAY_0** (Most recent payment status) - Primary predictor
2. **PAY_2** (2nd most recent payment status) - Secondary predictor
3. **BILL_AMT1** (Most recent bill amount) - Financial capacity indicator
4. **LIMIT_BAL** (Credit limit) - Risk tolerance indicator
5. **Age-related features** - Life stage risk factors

## Technical Implementation

### Dependencies
```
Python 3.7+
pandas, numpy, matplotlib, seaborn
scikit-learn, statsmodels
tensorflow (for neural networks)
```

### Key Libraries Used
- **Data Processing**: Pandas, NumPy
- **Visualization**: Matplotlib, Seaborn
- **Statistical Analysis**: Statsmodels
- **Machine Learning**: Scikit-learn
- **Deep Learning**: TensorFlow

## Future Work

### Model Enhancements
- **Advanced Algorithms**: XGBoost, LightGBM implementation
- **Class Imbalance**: SMOTE or cost-sensitive learning
- **Feature Engineering**: Payment-to-bill ratios, temporal trends
- **Cross-validation**: Robust model validation

### Data Improvements
- **Additional Features**: Macroeconomic indicators, employment data
- **Real-time Data**: Live transactional data integration
- **Geographic Expansion**: Multi-market validation
- **Temporal Analysis**: Time series modeling

### Deployment Considerations
- **API Development**: Real-time prediction endpoints
- **Dashboard Creation**: Risk visualization interface
- **Model Monitoring**: Performance tracking and drift detection
- **Scalability**: Cloud deployment for production use

## Files Structure
```
Credit Risk Analysis/
├── README.md # Project overview and documentation
├── MODEL_REPORT.md # Comprehensive model analysis report
├── Credit_Default_Prediction.ipynb # Complete Jupyter notebook analysis
└── content/
├── default of credit card clients.xls # Original dataset
└── default+of+credit+card+clients.zip # Compressed dataset
```

## Quick Start

1. **Clone the repository**
2. **Install dependencies**: `pip install pandas numpy matplotlib seaborn scikit-learn statsmodels tensorflow`
3. **Open the Jupyter notebook**: `Credit_Default_Prediction.ipynb`
4. **Run the analysis**: Execute cells sequentially for complete analysis

## Results Summary

### Model Performance
- **Best Model**: Support Vector Machine (80.5% accuracy)
- **Key Predictor**: Recent payment status (PAY_0)
- **Business Impact**: Proactive risk identification and mitigation

### Key Insights
- Recent payment behavior is the strongest default predictor
- Age shows U-shaped risk pattern (highest for young adults and seniors)
- Higher credit limits correlate with lower default rates
- Gender and education show nuanced risk relationships

## Contributor
**Utkarsh Bhardwaj**
**Publish Date**: February 28, 2025
**Contact**: ubhardwaj284@gmail.com

[LinkedIn](https://www.linkedin.com/in/utkarsh284/) | [GitHub](https://github.com/utkarsh-284)

---

**Note**: This analysis provides a solid foundation for credit risk assessment. The SVM model with 80.5% accuracy can be effectively deployed for real-time default prediction and risk management applications.