Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/btrotta/kaggle-plasticc

14th place solution for the Kaggle Plasticc challenge to classify objects in space.
https://github.com/btrotta/kaggle-plasticc

Last synced: 18 days ago
JSON representation

14th place solution for the Kaggle Plasticc challenge to classify objects in space.

Host: GitHub
URL: https://github.com/btrotta/kaggle-plasticc
Owner: btrotta
License: mit
Created: 2018-12-18T06:47:27.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2019-01-08T07:39:39.000Z (almost 6 years ago)
Last Synced: 2024-11-24T20:10:56.702Z (about 1 month ago)
Language: Python
Homepage:
Size: 890 KB
Stars: 24
Watchers: 2
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: Readme.md
- License: LICENSE

Awesome Lists containing this project

README

        # kaggle-plasticc

Code for the 14th place solution in the Kaggle PLAsTiCC competition. 

See `Modelling_approach.pdf` for a detailed discussion of the modelling approach.

#### Quick-start guide to running the code

Total runtime is around 5.5 hours on a 24 Gb laptop.

- Download the code. Create a subfolder called `data` and save the csv files there.

- To reproduce the results exactly, create an environment with the specific

     package versions I used. (If you already have numpy, pandas, scikit-learn

     and lightgbm you can skip this 

     step, but the results may differ slightly if you have different versions.) If you have conda, the 

     easiest option is to

     build a conda environment using this command:

     ```

     conda env create environment.yml

     ```

     This will create an environment called `plasticc-bt`.

     The `requirements.txt` file is provided as well if you want to build an environment with pip.

- Run `split_test.py` to split the test data into 100 hdf5 files. They will 

 be saved in an automatically created subfolder `split_100` of the `data` folder. Takes around 15 minutes.

- Run `calculate_features.py` to calculate the features. This will generate 3 files in a folder called  

`features` (the folder is created automatically). Takes around 3.5 hours.

- Run `predict.py` to train the model and make predictions on the test set. Takes around 1.5 hours.

- Run `scale.py` to apply regularisation to the class 99 predictions and generate the final submission file. 

  Takes a couple of minutes.