Winning solution for the Galaxy Challenge on Kaggle

Overview

kaggle-galaxies

Winning solution for the Galaxy Challenge on Kaggle (http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

Documentation about the method and the code is available in doc/documentation.pdf. Information on how to generate the solution file can also be found below.

Generating the solution

Install the dependencies

Instructions for installing Theano and getting it to run on the GPU can be found here. It should be possible to install NumPy, SciPy, scikit-image and pandas using pip or easy_install. To install pylearn2, simply run:

git clone git://github.com/lisa-lab/pylearn2.git

and add the resulting directory to your PYTHONPATH.

The optional dependencies listed in the documentation don't have to be installed to reproduce the winning solution: the generated data files are already provided, so they don't have to be regenerated (but of course you can if you want to). If you want to install them, please refer to their respective documentation.

Download the code

To download the code, run:

git clone git://github.com/benanne/kaggle-galaxies.git

A bunch of data files (extracted sextractor parameters, IDs files, training labels in NumPy format, ...) are also included. I decided to include these since generating them is a bit tedious and requires extra dependencies. It's about 20MB in total, so depending on your connection speed it could take a minute. Cloning the repository should also create the necessary directory structure (see doc/documentation.pdf for more info).

Download the training data

Download the data files from Kaggle. Place and extract the files in the following locations:

  • data/raw/training_solutions_rev1.csv
  • data/raw/images_train_rev1/*.jpg
  • data/raw/images_test_rev1/*.jpg

Note that the zip file with the training images is called images_training_rev1.zip, but they should go in a directory called images_train_rev1. This is just for consistency.

Create data files

This step may be skipped. The necessary data files have been included in the git repository. Nevertheless, if you wish to regenerate them (or make changes to how they are generated), here's how to do it.

  • create data/train_ids.npy by running python create_train_ids_file.py.
  • create data/test_ids.npy by running python create_test_ids_file.py.
  • create data/solutions_train.npy by running python convert_training_labels_to_npy.py.
  • create data/pysex_params_extra_*.npy.gz by running python extract_pysex_params_extra.py.
  • create data/pysex_params_gen2_*.npy.gz by running python extract_pysex_params_gen2.py.

Copy data to RAM

Copy the train and test images to /dev/shm by running:

python copy_data_to_shm.py

If you don't want to do this, you'll need to modify the realtime_augmentation.py file in a few places. Please refer to the documentation for more information.

Train the networks

To train the best single model, run:

python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py

On a GeForce GTX 680, this took about 67 hours to run to completion. The prediction file generated by this script, predictions/final/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.csv.gz, should get you a score that's good enough to land in the #1 position (without any model averaging). You can similarly run the other try_*.py scripts to train the other models I used in the winning ensemble.

If you have more than 2GB of GPU memory, I recommend disabling Theano's garbage collector with allow_gc=False in your .theanorc file or in the THEANO_FLAGS environment variable, for a nice speedup. Please refer to the Theano documentation for more information on how to get the most out Theano's GPU support.

Generate augmented predictions

To generate predictions which are averaged across multiple transformations of the input, run:

python predict_augmented_npy_maxout2048_extradense.py

This takes just over 4 hours on a GeForce GTX 680, and will create two files predictions/final/augmented/valid/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz and predictions/final/augmented/test/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz. You can similarly run the corresponding predict_augmented_npy_*.py files for the other models you trained.

Blend augmented predictions

To generate blended prediction files from all the models for which you generated augmented predictions, run:

python ensemble_predictions_npy.py

The script checks which files are present in predictions/final/augmented/test/ and uses this to determine the models for which predictions are available. It will create three files:

  • predictions/final/blended/blended_predictions_uniform.npy.gz: uniform blend.
  • predictions/final/blended/blended_predictions.npy.gz: weighted linear blend.
  • predictions/final/blended/blended_predictions_separate.npy.gz: weighted linear blend, with separate weights for each question.

Convert prediction file to CSV

Finally, in order to prepare the predictions for submission, the prediction file needs to be converted from .npy.gz format to .csv.gz. Run the following to do so (or similarly for any other prediction file in .npy.gz format):

python create_submission_from_npy.py predictions/final/blended/blended_predictions_uniform.npy.gz

Submit predictions

Submit the file predictions/final/blended/blended_predictions_uniform.csv.gz on Kaggle to get it scored. Note that the process of generating this file involves considerable randomness: the weights of the networks are initialised randomly, the training data for each chunk is randomly selected, ... so I cannot guarantee that you will achieve the same score as I did. I did not use fixed random seeds. This might not have made much of a difference though, since different GPUs and CUDA toolkit versions will also introduce different rounding errors.

Comments
  • Example of convolutional maxout layer

    Example of convolutional maxout layer

    I'm playing around with your code to get a better insight into the inner-workings on convnets, and was curious as to how maxout on convolutional layers was implemented for you, even if unused in your final scripts.

    (interestingly I've found that turning dropout on bottom convolution layers tends to make the problem over-fit wildly early on, which is interesting, if baffling)

    opened by GregAtHeron 10
  • When I run this program has encountered a problem,Can you help me?

    When I run this program has encountered a problem,Can you help me?

    @benanne
    Thanks for sharing your project code. When I run this program I met the same problem as @nejyeah ,but now I encountered another problem: when I run:

    python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py

    Using gpu device 0: GeForce GTX 650 Set up data loading Preprocess validation data upfront Error when tring to find the memory information on the GPU: initialization error Error freeing device pointer 0x700e80000 (initialization error). Driver report 0 bytes free and 0 bytes total CudaNdarray_uninit: error freeing self->devdata. (self=0x7f3680676670, self->devata=0x700e80000) Exception MemoryError: 'error freeing device pointer 0x700e80000 (initialization error)' in 'garbage collection' ignored Fatal Python error: unexpected exception during garbage collection Traceback (most recent call last): File "try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py", line 135, in xs_valid = [np.vstack(x_valid) for x_valid in xs_valid] File "/home/lixun/anaconda/lib/python2.7/site-packages/numpy/core/shape_base.py", line 228, in vstack return _nx.concatenate([atleast_2d(_m) for _m in tup], 0) ValueError: need at least one array to concatenate

    Thanks

    opened by lixunlove 5
  • Implementing realtime augmentation

    Implementing realtime augmentation

    Hi, i was going through your code for using realtime_augmentation.py, I can see the training in all try* files. I am trying to understand how minibatch is augmented in training function. xs_shared is used for training but is initialized to zero? Saw the previous issue, you said the jpegs are loaded on the fly. So , trying to go through train_norm, but its pretty messed up. So not getting how ra is used. How should i be using functions in ra for realtime augmenting if i have a preloaded dataset trani_set_x like:

    train_model = theano.function(inputs=[index], outputs=cost, updates=updates,
    givens={
                    x: train_set_x[index * batch_size:(index + 1) * batch_size],
                    y: train_set_y[index * batch_size:(index + 1) * batch_size]
    })
    

    If not, i can load the image dataset on the fly, then where its been done in the code?

    opened by awhitesong 4
  • Help for running this code

    Help for running this code

    Thanks for sharing your code and having done such an excellent work. I have install the dependencies and copy the data to RAM. But when I run: > python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py the program stop and print the follows(the tail are Alex cuda-convnet code):

    Using gpu device 0: GeForce GTS 450 Set up data loading Preprocess validation data upfront /usr/local/lib/python2.7/dist-packages/scikit_image-0.10.1-py2.7-linux-x86_64.egg/skimage/transform/_geometric.py:126: UserWarning: _matrix attribute is deprecated, use params instead. warnings.warn('_matrix attribute is deprecated, ' /usr/local/lib/python2.7/dist-packages/scikit_image-0.10.1-py2.7-linux-x86_64.egg/skimage/transform/_geometric.py:126: UserWarning: _matrix attribute is deprecated, use params instead. warnings.warn('_matrix attribute is deprecated, ' /usr/local/lib/python2.7/dist-packages/scikit_image-0.10.1-py2.7-linux-x86_64.egg/skimage/transform/_geometric.py:126: UserWarning: _matrix attribute is deprecated, use params instead. warnings.warn('_matrix attribute is deprecated, ' /usr/local/lib/python2.7/dist-packages/scikit_image-0.10.1-py2.7-linux-x86_64.egg/skimage/transform/_geometric.py:126: UserWarning: _matrix attribute is deprecated, use params instead. warnings.warn('_matrix attribute is deprecated, ' /usr/local/lib/python2.7/dist-packages/theano/sandbox/rng_mrg.py:768: UserWarning: MRG_RandomStreams Can't determine #streams from size (Shape.0), guessing 60256 nstreams = self.n_streams(size) /usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:110: FutureWarning: comparison to None will result in an elementwise object comparison in the future. start in [None, 0] or /usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:114: FutureWarning: comparison to None will result in an elementwise object comparison in the future. stop in [None, length, maxsize] or /usr/local/lib/python2.7/dist-packages/theano/tensor/subtensor.py:190: FutureWarning: comparison to None will result in an elementwise object comparison in the future. if stop in [None, maxsize]: /home/gpu-server2/kaggle/pylearn2/pylearn2/sandbox/cuda_convnet/init.py:66: UserWarning: You are using probably a too old Theano version. That will cause compilation crash. If so, update Theano. "You are using probably a too old Theano version. That" 1 / 2 * Copyright (c) 2011, Alex Krizhevsky ([email protected]) 3 * All rights reserved. 4 * 5 * Redistribution and use in source and binary forms, with or without modification, 6 * are permitted provided that the following conditions are met: 7 * 8 * - Redistributions of source code must retain the above copyright notice, 9 * this list of conditions and the following disclaimer.

    opened by nejyeah 3
  • One Question about Data Augmentation

    One Question about Data Augmentation

    @benanne First of all, thank you for sharing your code ~ I am especially interested in your online(or real-time) data augmentation module which I would like to prevent from overfitting depend on this strategy. However, I am fully lost in realtime_augmentation.py. The most concern is: if I have load the training dataset X_train(should be a numpy.array), then how could I get a augmented data X_augment by your functions ? Thank you !

    opened by pengpaiSH 2
  • error while running python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py

    error while running python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py

    Hi, I was trying to run python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py however, I am getting some errors. I am running this on an amazon instant with GPU 4Gb.

    ubuntu@ip-172-31-19-167:~/kaggle-galaxies$ python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py
    Using gpu device 0: GRID K520
    Set up data loading
    Preprocess validation data upfront
    Process Process-1:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
        self._target(*self._args, **self._kwargs)
      File "/home/ubuntu/kaggle-galaxies/load_data.py", line 563, in _buffered_generation_process
        data = source_gen.next()
      File "/home/ubuntu/kaggle-galaxies/realtime_augmentation.py", line 337, in realtime_fixed_augmented_data_gen
        for k, imgs_aug in enumerate(gen):
      File "/usr/lib/python2.7/multiprocessing/pool.py", line 269, in <genexpr>
        return (item for chunk in result for item in chunk)
      File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
        raise value
    AttributeError: 'builtin_function_or_method' object has no attribute 'iterkeys'
    Traceback (most recent call last):
      File "try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py", line 138, in <module>
        xs_valid = [np.vstack(x_valid) for x_valid in xs_valid]
      File "/usr/local/lib/python2.7/dist-packages/numpy/core/shape_base.py", line 228, in vstack
        return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    ValueError: need at least one array to concatenate
    
    opened by great-thoughts 1
Owner
Sander Dieleman
Sander Dieleman
This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform.

Zillow-Houses This repository contains full machine learning pipeline of the Zillow Houses competition on Kaggle platform. Pipeline is consists of 10

null 2 Jan 9, 2022
Kaggle Competition using 15 numerical predictors to predict a continuous outcome.

Kaggle-Comp.-Data-Mining Kaggle Competition using 15 numerical predictors to predict a continuous outcome as part of a final project for a stats data

moisey alaev 1 Dec 28, 2021
Predicting job salaries from ads - a Kaggle competition

Predicting job salaries from ads - a Kaggle competition

Zygmunt Zając 57 Oct 23, 2020
Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Sean Zahller 1 Feb 4, 2022
Azure MLOps (v2) solution accelerators.

Azure MLOps (v2) solution accelerator Welcome to the MLOps (v2) solution accelerator repository! This project is intended to serve as the starting poi

Microsoft Azure 233 Jan 1, 2023
#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit ?? This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

Streamlit 53 Jan 2, 2023
Winning solution of the Indoor Location & Navigation Kaggle competition

This repository contains the code to generate the winning solution of the Kaggle competition on indoor location and navigation organized by Microsoft

Tom Van de Wiele 62 Dec 28, 2022
The sixth place winning solution (6/220) in 2021 Gaofen Challenge.

SwinTransformer + OBBDet The sixth place winning solution (6/220) in the track of Fine-grained Object Recognition in High-Resolution Optical Images, 2

ming71 46 Dec 2, 2022
Kaggle | 9th place (part of) solution for the Bristol-Myers Squibb – Molecular Translation challenge

Part of the 9th place solution for the Bristol-Myers Squibb – Molecular Translation challenge translating images containing chemical structures into I

Erdene-Ochir Tuguldur 22 Nov 30, 2022
Kaggle | 9th place single model solution for TGS Salt Identification Challenge

UNet for segmenting salt deposits from seismic images with PyTorch. General We, tugstugi and xuyuan, have participated in the Kaggle competition TGS S

Erdene-Ochir Tuguldur 276 Dec 20, 2022
10th place solution for Google Smartphone Decimeter Challenge at kaggle.

Under refactoring 10th place solution for Google Smartphone Decimeter Challenge at kaggle. Google Smartphone Decimeter Challenge Global Navigation Sat

null 12 Oct 25, 2022
This is the winning solution of the Endocv-2021 grand challange.

Endocv2021-winner [Paper] This is the winning solution of the Endocv-2021 grand challange. Dependencies pytorch # tested with 1.7 and 1.8 torchvision

Vajira Thambawita 14 Dec 3, 2022
An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

Bottom-Up and Top-Down Attention for Visual Question Answering An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge. The

Hengyuan Hu 731 Jan 3, 2023
Xview3 solution - XView3 challenge, 2nd place solution

Xview3, 2nd place solution https://iuu.xview.us/ test split aggregate score publ

Selim Seferbekov 24 Nov 23, 2022
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th place solution

Lyft Motion Prediction for Autonomous Vehicles Code for the 4th place solution of Lyft Motion Prediction for Autonomous Vehicles on Kaggle. Discussion

null 44 Jun 27, 2022
7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

null 8 Jul 9, 2021
My 1st place solution at Kaggle Hotel-ID 2021

1st place solution at Kaggle Hotel-ID My 1st place solution at Kaggle Hotel-ID to Combat Human Trafficking 2021. https://www.kaggle.com/c/hotel-id-202

Kohei Ozaki 18 Aug 19, 2022
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
Kaggle G2Net Gravitational Wave Detection : 2nd place solution

Kaggle G2Net Gravitational Wave Detection : 2nd place solution

Hiroshechka Y 33 Dec 26, 2022
Solution of Kaggle competition: Sartorius - Cell Instance Segmentation

Sartorius - Cell Instance Segmentation https://www.kaggle.com/c/sartorius-cell-instance-segmentation Environment setup Build docker image bash .dev_sc

null 68 Dec 9, 2022