Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/farrajota/multi-gpu-torchnet

Train an object classifier using multiple gpus in Torch7
https://github.com/farrajota/multi-gpu-torchnet

dbcollection multi-gpu torch7 torchnet

Last synced: about 12 hours ago
JSON representation

Train an object classifier using multiple gpus in Torch7

Host: GitHub
URL: https://github.com/farrajota/multi-gpu-torchnet
Owner: farrajota
License: mit
Created: 2017-04-03T22:37:48.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2017-10-01T21:30:14.000Z (about 7 years ago)
Last Synced: 2023-10-20T12:34:52.583Z (about 1 year ago)
Topics: dbcollection, multi-gpu, torch7, torchnet
Language: Lua
Size: 24.4 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Train an object classifier on [ImageNet](http://image-net.org/download-images) using multiple gpus in Torch7

This repo shows how to train a object classifier over ImageNet/Cifar10/Cifar100/MNIST using a multi-threaded, multi-gpu approach.

## Features

- Several types of networks like AlexNet, Overfeat, VGG, Googlenet, etc. are available for training;

- Multi-GPU support;

- Data loading/processing using multiple threads;

- Easily apply data augmentation;

- Integration with the `dbcollection` package.

## Requirements

- NVIDIA GPU with compute capability 3.5+ (2GB+ ram)

- [torch7](http://torch.ch/docs/getting-started.html#_)

- [torchnet](https://github.com/torchnet/torchnet)

- [dbcollection](https://github.com/dbcollection/dbcollection)

## Running the code

The main script comes with several options which can be listed by running the script with the flag --help

```bash

th main.lua --help

```

To train a network using the default settings, simply do:

```bash

th main.lua

```

> Note: You must have the ImageNet ILSVRC2012 dataset (or any other dataset) setup before running this script. For more information about how to setup your datasets using `dbcollection` see [here](https://github.com/dbcollection/dbcollection).

By default, the script trains theAlexNet model on 1 GPU with the CUDNN backend and loads data from disk using 4 CPU threads.

To run an alexnet model using two or more GPUs, set `nGPU` to the number of GPUs you want to use (in this example only two are used):

```bash

th main.lua -nGPU 2 -netType alexnet

```

In case you want to specify which gpus do use, do the following:

```bash

CUDA_VISIBLE_DEVICES=0,1 th main.lua -nGPU 2 -netType alexnet

```

> Note: this will select the first two GPUs detected in your system.

To use more threads for data loading/processing, use the `nThreads` flag to specify the number of threads you want to use.

```bash

th main.lua -nThreads 2

```

For a complete list of available options, please see the `opts.lua` file or run `th main.lua --help` in the command line.

## Data loading benchmark comparison

For most datasets, loading the necessary metadata (filenames, labels, etc.) from disk should carry a very small, almost  insignificant overhead compared to loading metadata from memory.

To showcase this, some scripts under `benchmark/` for the ImageNet ILSVRC2012 and Cifar10 datasets are available for benchmarking this. Here it is used the average time for 1000 data fetches with `batchsize=128` and `nThreads=4`.

The `train` scores use more data augmentation preprocessing compared to the `test` scores which uses less data augmentation techniques.

Dataset | train | test

--- | --- | ---

Cifar10 *(disk)* | 0.01509s | 0.00953s

Cifar10 *(ram)* | **0.00772s**  | **0.00557s**

ILSVRC2012 *(disk)* | 0.34635 | **0.35729**

ILSVRC2012 *(ram)* | **0.34553** | 0.36107

> Note: This tests were done using a 6-core Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz, 32GB ram, 2TB SSHD and Ubuntu 14.04. Note that the overhead is very small when using datasets with bigger images like the Imagenet, meaning that the overhead can be hidden by using enough cores or a faster disk.

## Code Description

- `main.lua` (~250 lines) - Script using torchnet's api for training and testing a network over ImageNet.

- `utils.lua` (~125 lines) - Multi-gpu functions for loading/storing/setting a model.

- `transforms.lua` (~500 lines) - Data augmentation functions, mostly derived from [here](https://github.com/facebook/fb.resnet.torch/blob/master/datasets/transforms.lua) and [here](https://github.com/NVIDIA/DIGITS/pull/777).

- `configs.lua` (~200 lines) - Setup configurations (options, model, logger, etc.)

- `statistics.lua` (~100 lines) - Computes the dataset's mean/std statistics for 10000 samples and stores it to `./cache` dir.

- `model.lua` (~40 lines) - Creates/Loads a model from training/testing.

- `data.lua` (~110 lines) - Contains the methods to featch/load data of the available datasets.

## License

MIT license (see the LICENSE file)

## Disclamer

This code has been inpired on torchnet's [mnist training example](https://github.com/torchnet/torchnet/blob/master/example/mnist.lua), soumith's [multi-gpu ImageNet training code](https://github.com/soumith/imagenet-multiGPU.torch) and @karandwivedi42 [multigpu-torchnet](https://github.com/karandwivedi42/imagenet-multiGPU.torchnet).