Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dataquestio/loan-prediction

Predict which loans will be foreclosed on.
https://github.com/dataquestio/loan-prediction

Last synced: about 8 hours ago
JSON representation

Predict which loans will be foreclosed on.

Awesome Lists containing this project

README

        

Loan Prediction
-----------------------

Predict whether or not loans acquired by Fannie Mae will go into foreclosure. Fannie Mae acquires loans from other lenders as a way of inducing them to lend more. Fannie Mae releases data on the loans it has acquired and their performance afterwards [here](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html).

Installation
----------------------

### Download the data

* Clone this repo to your computer.
* Get into the folder using `cd loan-prediction`.
* Run `mkdir data`.
* Switch into the `data` directory using `cd data`.
* Download the data files from Fannie Mae into the `data` directory.
* You can find the data [here](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html).
* You'll need to register with Fannie Mae to download the data.
* It's recommended to download all the data from 2012 Q1 to present.
* Extract all of the `.zip` files you downloaded.
* On OSX, you can run `find ./ -name \*.zip -exec unzip {} \;`.
* At the end, you should have a bunch of text files called `Acquisition_YQX.txt`, and `Performance_YQX.txt`, where `Y` is a year, and `X` is a number from `1` to `4`.
* Remove all the zip files by running `rm *.zip`.
* Switch back into the `loan-prediction` directory using `cd ..`.

### Install the requirements

* Install the requirements using `pip install -r requirements.txt`.
* Make sure you use Python 3.
* You may want to use a virtual environment for this.

Usage
-----------------------

* Run `mkdir processed` to create a directory for our processed datasets.
* Run `python assemble.py` to combine the `Acquisition` and `Performance` datasets.
* This will create `Acquisition.txt` and `Performance.txt` in the `processed` folder.
* Run `python annotate.py`.
* This will create training data from `Acquisition.txt` and `Performance.txt`.
* It will add a file called `train.csv` to the `processed` folder.
* Run `python predict.py`.
* This will run cross validation across the training set, and print the accuracy score.

Extending this
-------------------------

If you want to extend this work, here are a few places to start:

* Generate more features in `annotate.py`.
* Switch algorithms in `predict.py`.
* Add in a way to make predictions on future data.
* Try seeing if you can predict if a bank should have issued the loan.
* Remove any columns from `train` that the bank wouldn't have known at the time of issuing the loan.
* Some columns are known when Fannie Mae bought the loan, but not before
* Make predictions.
* Explore seeing if you can predict columns other than `foreclosure_status`.
* Can you predict how much the property will be worth at sale time?
* Explore the nuances between performance updates.
* Can you predict how many times the borrower will be late on payments?
* Can you map out the typical loan lifecycle?