https://github.com/bits-bytes-nn/loan-repayment-prediction
Automated approach from feature engineering to modeling on the Kaggle Home Credit Default Risk competition dataset
https://github.com/bits-bytes-nn/loan-repayment-prediction
autogluon featuretools mxnet
Last synced: about 1 year ago
JSON representation
Automated approach from feature engineering to modeling on the Kaggle Home Credit Default Risk competition dataset
- Host: GitHub
- URL: https://github.com/bits-bytes-nn/loan-repayment-prediction
- Owner: bits-bytes-nn
- Created: 2021-02-23T04:59:02.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2021-03-03T03:01:46.000Z (about 5 years ago)
- Last Synced: 2025-01-19T12:28:20.774Z (over 1 year ago)
- Topics: autogluon, featuretools, mxnet
- Language: Jupyter Notebook
- Homepage:
- Size: 161 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Automated Feature Engineering with *FeatureTools*
* The goal is to do automated feature engineering with [FeatureTools](https://www.featuretools.com).
* A dataset of Kaggle Competition, [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk/overview/description) was downloaded for testing. The dataset consists of 4 tables, and the relationship diagram is as follows. Various derived variables could be created automatically using *FeatureTools*.
* I referenced [Will Koehrsen's post](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219 ) and [his code on Kaggle](https://www.kaggle.com/willkoehrsen/feature-engineering-using-feature-tools).

* A dataset with 2,221 features for 356,255 customers was finally created. Saved as a CSV file, it is about 4GB.
* The whole process took 3 hours and a half on my iMac with 6 cores and 16GB of memory.
# Automated Modeling with *AutoGluon*
* The goal is to do automated modeling with [AutoGluon](https://auto.gluon.ai/stable/index.html#).
* *AutoGluon* makes it easy to automatically experiment with a variety of algorithms, from tree ensembles to deep learning and even model stacking.
|model|score_val|pred_time_val|fit_time|
|:------:|------:|------:|------:|
|weighted_ensemble_k0_l2|0.787430|3098.737486|95757.520068|
|weighted_ensemble_k0_l1|0.786499|601.329862|46704.633752|
|CatboostClassifier_STACKER_l1|0.786261|2511.553999|53790.340020|
|LightGBMClassifierXT_STACKER_l1|0.785994|2511.152501|53834.121477|
|LightGBMClassifier_STACKER_l1|0.785990|2511.691034|53782.292310|
|LightGBMClassifierCustom_STACKER_l1|0.785596|2510.629085|54090.092252|
|LightGBMClassifierCustom_STACKER_l0|0.782958|10.546562|1941.627757|
|CatboostClassifier_STACKER_l0|0.782336|7.888541|1890.238214|
|LightGBMClassifierXT_STACKER_l0|0.780601|11.507542|860.345474|
|LightGBMClassifier_STACKER_l0|0.780356|10.297791|824.519218|
| ... | | | |
* The model stacking technique achieved the highest predictive performance. This was 0.78149 for the Kaggle public board and 0.78391 for the private board as measured by AUROC.
* This process took about 1 day and 6 hours to train on an AWS `m4.16xlarge` EC2 instance with 64 cores and 256GB of memory, and about an hour and a half to infer.