https://github.com/libffcv/ffcv-imagenet

Train ImageNet *fast* in 500 lines of code with FFCV
https://github.com/libffcv/ffcv-imagenet

Last synced: about 1 year ago
JSON representation

Train ImageNet *fast* in 500 lines of code with FFCV

Host: GitHub
URL: https://github.com/libffcv/ffcv-imagenet
Owner: libffcv
License: apache-2.0
Created: 2022-01-18T07:10:26.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2024-05-10T18:32:18.000Z (about 2 years ago)
Last Synced: 2025-04-03T19:05:35.378Z (about 1 year ago)
Language: Python
Homepage:
Size: 56.6 KB
Stars: 141
Watchers: 2
Forks: 35
Open Issues: 14
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # `ffcv` ImageNet Training

A minimal, single-file PyTorch ImageNet training script designed for hackability. Run `train_imagenet.py` to get...

- ...high accuracies on ImageNet

- ...with as many lines of code as the PyTorch ImageNet example

- ...in 1/10th the time.

## Results

Train models more efficiently, either with 8 GPUs in parallel or by training 8 ResNet-18's at once.



See benchmark setup here: [https://docs.ffcv.io/benchmarks.html](https://docs.ffcv.io/benchmarks.html).

## Citation

If you use this setup in your research, cite:

```

@misc{leclerc2022ffcv,

    author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},

    title = {ffcv},

    year = {2022},

    howpublished = {\url{https://github.com/libffcv/ffcv/}},

    note = {commit xxxxxxx}

}

```

(Make sure to replace ``xxxxxxx`` above with the hash of the commit used!)

## Configurations

The configuration files corresponding to the above results are:

| Link to Config                                                                                                                         |   top_1 |   top_5 |   # Epochs |   Time (mins) | Architecture   | Setup    |

|:---------------------------------------------------------------------------------------------------------------------------------------|--------:|--------:|-----------:|--------------:|:---------------|:---------|

| Link | 0.784 | 0.941  |         88 |       77.2 | ResNet-50      | 8 x A100 |

| Link | 0.780 | 0.937 |         56 |       49.4 | ResNet-50      | 8 x A100 |

| Link | 0.772 | 0.932 |         40 |       35.6 | ResNet-50      | 8 x A100 |

| Link | 0.766 | 0.927 |         32 |       28.7 | ResNet-50      | 8 x A100 |

| Link | 0.756 | 0.921 |         24 |       21.7  | ResNet-50      | 8 x A100 |

| Link | 0.738 | 0.908 |         16 |       14.9 | ResNet-50      | 8 x A100 |

| Link | 0.724 | 0.903   |         88 |      187.3  | ResNet-18      | 1 x A100 |

| Link | 0.713  | 0.899 |         56 |      119.4   | ResNet-18      | 1 x A100 |

| Link | 0.706 | 0.894 |         40 |       85.5 | ResNet-18      | 1 x A100 |

| Link | 0.700 | 0.889 |         32 |       68.9   | ResNet-18      | 1 x A100 |

| Link | 0.688  | 0.881 |         24 |       51.6 | ResNet-18      | 1 x A100 |

| Link | 0.669 | 0.868 |         16 |       35.0 | ResNet-18      | 1 x A100 |

## Training Models

First pip install the requirements file in this directory:

```

pip install -r requirements.txt

```

Then, generate an ImageNet dataset; make the dataset used for the results above with the following command (`IMAGENET_DIR` should point to a PyTorch style [ImageNet dataset](https://github.com/MadryLab/pytorch-imagenet-dataset):

```bash

# Required environmental variables for the script:

export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/

export WRITE_DIR=/your/path/here/

# Starting in the root of the Git repo:

cd examples;

# Serialize images with:

# - 500px side length maximum

# - 50% JPEG encoded

# - quality=90 JPEGs

./write_imagenet.sh 500 0.50 90

```

Then, choose a configuration from the [configuration table](#configurations). With the config file path in hand, train as follows:

```bash

# 8 GPU training (use only 1 for ResNet-18 training)

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Set the visible GPUs according to the `world_size` configuration parameter

# Modify `data.in_memory` and `data.num_workers` based on your machine

python train_imagenet.py --config-file rn50_configs/.yaml \

    --data.train_dataset=/path/to/train/dataset.ffcv \

    --data.val_dataset=/path/to/val/dataset.ffcv \

    --data.num_workers=12 --data.in_memory=1 \

    --logging.folder=/your/path/here

```

Adjust the configuration by either changing the passed YAML file or by specifying arguments via [fastargs](https://github.com/GuillaumeLeclerc/fastargs) (i.e. how the dataset paths were passed above).

## Training Details

System setup. We trained on p4.24xlarge ec2 instances (8 A100s).



Dataset setup. Generally larger side length will aid in accuracy but decrease

throughput:

 - ResNet-50 training: 50% JPEG 500px side length

 - ResNet-18 training: 10% JPEG 400px side length



Algorithmic details. We use a standard ImageNet training pipeline (à la the PyTorch ImageNet example) with only the following differences/highlights:

- SGD optimizer with momentum and weight decay on all non-batchnorm parameters

- Test-time augmentation over left/right flips

- Progressive resizing from 160px to 192px: 160px training until 75% of the way through training (by epochs), then 192px until the end of training.

- Validation set sizing according to ["Fixing the train-test resolution discrepancy"](https://arxiv.org/abs/1906.06423): 224px at test time.

- Label smoothing

- Cyclic learning rate schedule



Refer to the code and configuration files for a more exact specification.

To obtain configurations we first gridded for hyperparameters at a 30 epoch schedule. Fixing these parameters, we then varied only the number of epochs (stretching the learning rate schedule across the number of epochs as motivated by [Budgeted Training](https://arxiv.org/abs/1905.04753)) and plotted the results above.

## FAQ

### Why is the first epoch slow?

The first epoch can be slow for the first epoch if the dataset hasn't been cached in memory yet.

### What if I can't fit my dataset in memory?

See this [guide here](https://docs.ffcv.io/parameter_tuning.html#scenario-large-scale-datasets).

### Other questions

Please open up a [GitHub discussion](https://github.com/MadryLab/ffcv/discussions) for non-bug related questions; if you find a bug please report it on [GitHub issues](https://github.com/MadryLab/ffcv/issues).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/libffcv/ffcv-imagenet

Awesome Lists containing this project

README