Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/derak-isaack/dsf-phase-03-finale
Loan default prediction
https://github.com/derak-isaack/dsf-phase-03-finale
Last synced: about 1 month ago
JSON representation
Loan default prediction
- Host: GitHub
- URL: https://github.com/derak-isaack/dsf-phase-03-finale
- Owner: derak-isaack
- Created: 2023-12-01T11:24:27.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2023-12-03T21:01:14.000Z (about 1 year ago)
- Last Synced: 2023-12-03T22:19:51.769Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 15 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
Credit default analysis
Default rates have always been on the high rise due to harsh economic conditions sometimes. However, there are people who have poor credit scores but still end up getting qualified for more loans. The loan policies sometimes change and these peole get easier leeways to penetrate through the system checks.
Seeing how loan recovery agents and staff of microfinance institutions struggle looking for defaulters made me go into developing a classification model to predict the possibility of loan defaulters.
The goal of this project was to developa classification model which predicts whether a cliemt will default or not going by the lifestyle of the specific client.
One thing to note is that this data is cleaned ad has no any missing alues. However, there are categorical variables with **very high cardinality(number of unique values)**. Using the OneHotEncorder is very counter-intuitive and will result to memory issues. I explored an encoding ,ethod known as Count encoding where the variable is replaced by the value of its count. This help preserves the weigt of classes because label encoding them will result to loss of the class weights.
The modelling process involves a series of pipelines to ensure no data is leaked to the mdoel in the process of featue engineering oor even encoding. Pipelines also automate the whole process of Machine learning beacuse the datav passes through a series of transformers before being finally fit on the final model. The pipelines can only be accessed using scikit-learn which has several integration to mitigate the effects of class imbalance where voting is normally towards the majority side.
The main aim is to reduce the values of False negatives as this will make defaulers become qualified for new loans which is not good. Any tuning to the final model should seek to minimize teh false negative occurrences as much as possible. The final model was the Catboost which is optimized for classification problems and handles the class imbalances pretty well.