Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sarahm44/credit-risk-predictor
Uses several machine learning models to predict credit risk.
https://github.com/sarahm44/credit-risk-predictor
balanced-random-forest cluster-centroids cluster-centroids-undersampling credit-risk easy-ensemble-classifier ensemble-learning fintech logistic-regression machine-learning naive-random-oversampler random-forest resampling smote smote-oversampler smoteenn-combination
Last synced: 3 days ago
JSON representation
Uses several machine learning models to predict credit risk.
- Host: GitHub
- URL: https://github.com/sarahm44/credit-risk-predictor
- Owner: sarahm44
- Created: 2022-05-04T02:04:10.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-08-08T12:04:37.000Z (over 2 years ago)
- Last Synced: 2025-01-26T05:18:42.326Z (18 days ago)
- Topics: balanced-random-forest, cluster-centroids, cluster-centroids-undersampling, credit-risk, easy-ensemble-classifier, ensemble-learning, fintech, logistic-regression, machine-learning, naive-random-oversampler, random-forest, resampling, smote, smote-oversampler, smoteenn-combination
- Language: Jupyter Notebook
- Homepage:
- Size: 9.54 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Credit Risk Predictor
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/credit-risk.jpg)## Table of Contents
- [Overview](#overview)
- [Resampling](#resampling)
* [Logistic Regression](#logistic-regression)
* [Oversampling](#oversampling)
* [Undersampling](#undersampling)
* [Combination Sampling](#combination-sampling)
* [Conclusions](#conclusions)
- [Ensemble Learning](#ensemble-learning)
* [Balanced Random Forest](#balanced-random-forest)
* [Easy Ensemble Classifier](#easy-ensemble-classifier)
* [Conclusions](#conclusions-1)## Overview
This task required that I build and evaluate several machine learning models to predict credit risk using data you'd typically see from peer-to-peer lending services.
Credit risk is an inherently imbalanced classification problem (the number of good loans is much larger than the number of at-risk loans), so I employed different techniques for training and evaluating models with imbalanced classes.
I used the [imbalanced-learn](https://imbalanced-learn.org/stable/) and [Scikit-learn](https://scikit-learn.org/stable/) libraries to build and evaluate models using the two following techniques, each contained in their own code file:
1. [Resampling](https://github.com/sarahm44/credit-risk-predictor/blob/main/credit_risk_resampling.ipynb)
2. [Ensemble Learning](https://github.com/sarahm44/credit-risk-predictor/blob/main/credit_risk_ensemble.ipynb)## Resampling
### Logistic Regression
I completed a simple linear regression. I then generated the balanced accuracy score, confusion matrix and classification report, as below:
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/logistic_regression.png)
### Oversampling
In this section, I compared two oversampling algorithms to determine which algorithm results in the best performance.These algorithms were:
* naive random oversampling algorithm
* SMOTE algorithm.I then generated the balanced accuracy score, confusion matrix and classification report for each.
See naive random oversampling algorithm results below:
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/nro.png)
See SMOTE algorithm results below:
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/smote.png)
### Undersampling
In this section, I tested an undersampling algorithm to determine which algorithm results in the best performance compared to the oversampling algorithms above.
I undersampled the data using the Cluster Centroids algorithm.
I then generated the balanced accuracy score, confusion matrix and classification report.
See Cluster Centroids algorithm results below:
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/cc.png)
### Combination Sampling
In this section, I tested a combination over- and under-sampling algorithm to determine if the algorithm results in the best performance compared to the other sampling algorithms above.
I resampled the data using the SMOTEENN algorithm.
I then generated the balanced accuracy score, confusion matrix and classification report.
See the SMOTEENN algorithm results below:
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/smoteenn.png)
### Conclusions
#### Best Balanced Accuracy Score
| Model | Balanced Accuracy Score |
| --- | --- |
| Simple logistic regression | 0.9892813049736127 |
| Naive random oversampling | 0.9946414201183431 |
| SMOTE oversampling | 0.9946680739911509 |
| Cluster Centroids undersampling | 0.9932813049736127 |
| SMOTEENN combination sampling | 0.9946680739911509 |
SMOTE oversampling and combination sampling with SMOTEENN had the best balanced accuracy scores.#### Best Recall Score
| Model | High Risk | Low Risk | Average |
| --- | --- | --- | --- |
| Simple logistic regression | 0.98 | 0.99 | 0.99 |
| Naive random oversampling | 1.00 | 0.99 | 0.99 |
| SMOTE oversampling | 1.00 | 0.99 | 0.99 |
| Cluster Centroids undersampling | 0.99 | 0.99 | 0.99 |
| SMOTEENN combination sampling | 1.00 | 0.99 | 0.99 |
Naive random oversampling, SMOTE oversampling and combination sampling with SMOTEENN have the best recall scores.#### Best Geometric Mean Score
| Model | High Risk | Low Risk | Average |
| --- | --- | --- | --- |
| Simple logistic regression | 0.99 | 0.99 | 0.99 |
| Naive random oversampling | 0.99 | 0.99 | 0.99 |
| SMOTE oversampling | 0.99 | 0.99 | 0.99 |
| Cluster Centroids undersampling | 0.99, | 0.99 | 0.99 |
| SMOTEENN combination sampling | 0.99 | 0.99 | 0.99 |
Every model had the same geometric mean score.## Ensemble Learning
In this section, I trained and compared two different ensemble classifiers to predict loan risk and evaluate each model.
These were:
* Balanced Random Forest Classifier
* Easy Ensemble ClassifierI generated the balanced accuracy score, confusion matrix and classification report for each.
### Balanced Random Forest
See the Balanced Random Forest Classifier results below:
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/random_forest.png)
### Easy Ensemble Classifier
See the Easy Ensemble Classifier results below:
![](https://github.com/sarahm44/credit-risk-predictor/blob/main/images/easy_ensemble.png)
### Conclusions
#### Best Balanced Accuracy Score
| Model | Balanced Accuracy Score |
| --- | --- |
| Balanced Random Forest Classifier | 0.795829959187949 |
|Easy Ensemble Classifier | 0.9263912558266958 |
Easy Ensemble Classifier had the best balanced accuracy score.#### Best Recall Score
| Model | High Risk | Low Risk | Average |
| --- | --- | --- | --- |
| Balanced Random Forest Classifier | 0.71 | 0.88 | 0.88 |
| Easy Ensemble Classifier | 0.91 | 0.94 | 0.94 |
Easy Ensemble Classifier had the best recall score.#### Best Geometric Mean Score
| Model | High Risk | Low Risk | Average |
| --- | --- | --- | --- |
| Balanced Random Forest Classifier | 0.79 | 0.79 | 0.79 |
| Easy Ensemble Classifier | 0.93 | 0.93 | 0.93 |
Easy Ensemble Classifier had the best geometric mean score.#### Top three features
The top three features are:
* total_rec_prncp
* total_rec_int
* total_pymnt_in