Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/benanne/kaggle-galaxies

Winning solution for the Galaxy Challenge on Kaggle (http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge)
https://github.com/benanne/kaggle-galaxies

Last synced: 11 days ago
JSON representation

Winning solution for the Galaxy Challenge on Kaggle (http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge)

Lists

README

        

kaggle-galaxies
===============

Winning solution for the Galaxy Challenge on Kaggle (http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

Documentation about the method and the code is available in `doc/documentation.pdf`. Information on how to generate the solution file can also be found below.

## Generating the solution

### Install the dependencies

Instructions for installing Theano and getting it to run on the GPU can be found [here](http://deeplearning.net/software/theano/install.html). It should be possible to install NumPy, SciPy, scikit-image and pandas using `pip` or `easy_install`. To install pylearn2, simply run:

```
git clone git://github.com/lisa-lab/pylearn2.git
```

and add the resulting directory to your `PYTHONPATH`.

**The optional dependencies listed in the documentation don't have to be installed to reproduce the winning solution**: the generated data files are already provided, so they don't have to be regenerated (but of course you can if you want to). If you want to install them, please refer to their respective documentation.

### Download the code

To download the code, run:

```
git clone git://github.com/benanne/kaggle-galaxies.git
```

A bunch of data files (extracted sextractor parameters, IDs files, training labels in NumPy format, ...) are also included. I decided to include these since generating them is a bit tedious and requires extra dependencies. It's about 20MB in total, so depending on your connection speed it could take a minute. Cloning the repository should also create the necessary directory structure (see `doc/documentation.pdf` for more info).

### Download the training data

Download the data files from [Kaggle](http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/data). Place and extract the files in the following locations:

* `data/raw/training_solutions_rev1.csv`
* `data/raw/images_train_rev1/*.jpg`
* `data/raw/images_test_rev1/*.jpg`

Note that the zip file with the training images is called `images_training_rev1.zip`, but they should go in a directory called `images_train_rev1`. This is just for consistency.

### Create data files

**This step may be skipped.** The necessary data files have been included in the git repository. Nevertheless, if you wish to regenerate them (or make changes to how they are generated), here's how to do it.

* create `data/train_ids.npy` by running `python create_train_ids_file.py`.
* create `data/test_ids.npy` by running `python create_test_ids_file.py`.
* create `data/solutions_train.npy` by running `python convert_training_labels_to_npy.py`.
* create `data/pysex_params_extra_*.npy.gz` by running `python extract_pysex_params_extra.py`.
* create `data/pysex_params_gen2_*.npy.gz` by running `python extract_pysex_params_gen2.py`.

### Copy data to RAM

Copy the train and test images to `/dev/shm` by running:

```
python copy_data_to_shm.py
```

If you don't want to do this, you'll need to modify the `realtime_augmentation.py` file in a few places. Please refer to the documentation for more information.

### Train the networks

To train the best single model, run:

```
python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py
```

On a GeForce GTX 680, this took about 67 hours to run to completion. The prediction file generated by this script, `predictions/final/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.csv.gz`, **should get you a score that's good enough to land in the #1 position (without any model averaging)**. You can similarly run the other `try_*.py` scripts to train the other models I used in the winning ensemble.

If you have more than 2GB of GPU memory, I recommend disabling Theano's garbage collector with `allow_gc=False` in your `.theanorc` file or in the `THEANO_FLAGS` environment variable, for a nice speedup. Please refer to [the Theano documentation](http://deeplearning.net/software/theano/tutorial/using_gpu.html#tips-for-improving-performance-on-gpu) for more information on how to get the most out Theano's GPU support.

### Generate augmented predictions

To generate predictions which are averaged across multiple transformations of the input, run:

```
python predict_augmented_npy_maxout2048_extradense.py
```

This takes just over 4 hours on a GeForce GTX 680, and will create two files `predictions/final/augmented/valid/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz` and `predictions/final/augmented/test/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz`. You can similarly run the corresponding `predict_augmented_npy_*.py` files for the other models you trained.

### Blend augmented predictions

To generate blended prediction files from all the models for which you generated augmented predictions, run:

```
python ensemble_predictions_npy.py
```

The script checks which files are present in `predictions/final/augmented/test/` and uses this to determine the models for which predictions are available. It will create three files:

* `predictions/final/blended/blended_predictions_uniform.npy.gz`: uniform blend.
* `predictions/final/blended/blended_predictions.npy.gz`: weighted linear blend.
* `predictions/final/blended/blended_predictions_separate.npy.gz`: weighted linear blend, with separate weights for each question.

### Convert prediction file to CSV

Finally, in order to prepare the predictions for submission, the prediction file needs to be converted from `.npy.gz` format to `.csv.gz`. Run the following to do so (or similarly for any other prediction file in `.npy.gz` format):

```
python create_submission_from_npy.py predictions/final/blended/blended_predictions_uniform.npy.gz
```

### Submit predictions

Submit the file `predictions/final/blended/blended_predictions_uniform.csv.gz` on [Kaggle](http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/submit) to get it scored. Note that the process of generating this file involves considerable randomness: the weights of the networks are initialised randomly, the training data for each chunk is randomly selected, ... so I cannot guarantee that you will achieve the same score as I did. I did not use fixed random seeds. This might not have made much of a difference though, since different GPUs and CUDA toolkit versions will also introduce different rounding errors.