Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rasbt/b3-basic-batchsize-benchmark

Experiments for the blog post "No, We Don't Have to Choose Batch Sizes As Powers Of 2"
https://github.com/rasbt/b3-basic-batchsize-benchmark

deep-learning deep-neural-networks machine-learning neural-networks

Last synced: 2 months ago
JSON representation

Experiments for the blog post "No, We Don't Have to Choose Batch Sizes As Powers Of 2"

Host: GitHub
URL: https://github.com/rasbt/b3-basic-batchsize-benchmark
Owner: rasbt
Created: 2022-07-01T17:53:49.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-07-05T18:53:12.000Z (over 2 years ago)
Last Synced: 2024-10-04T13:32:14.831Z (3 months ago)
Topics: deep-learning, deep-neural-networks, machine-learning, neural-networks
Language: Python
Homepage: https://sebastianraschka.com/blog/2022/batch-size-2.html
Size: 18.6 KB
Stars: 18
Watchers: 4
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # B3 -- Basic Batchsize Benchmark

A quick benchmark with different batch sizes that was prompted by the discussion [here](https://twitter.com/rasbt/status/1542882893181108227?s=20&t=96dUITuyaNJUfw1TWxDLng), which was in turn prompted by the [Do Batch Sizes Actually Need to be Powers of 2?](https://wandb.ai/datenzauberai/Batch-Size-Testing/reports/Do-Batch-Sizes-Actually-Need-to-be-Powers-of-2---VmlldzoyMDkwNDQx) article.

Right now, this benchmark is a [MobileNetV3 (large)](https://arxiv.org/abs/1905.02244) on CIFAR-10 (the images are resized to 224 to reach proper GPU utilization). You can run it as follows:

**Step 1: Initial Setup**

```bash

git clone https://github.com/rasbt/b3-basic-batchsize-benchmark.git

cd b3-basic-batchsize-benchmark

conda create -n benchmark python=3.8

conda activate benchmark

pip install -r requirements.txt

```

**Step 2: Running the Training Script**

```python

python main.py --num_epochs 10 --batch_size 127 --mixed_precision true

```

### Additional Resources

- [Ross Wightman mentioning](https://twitter.com/wightmanr/status/1542917523556904960?s=20&t=96dUITuyaNJUfw1TWxDLng) that it might matter more for TPUs

- [Nvidia's Deep Learning Performance Documentation on matrix multiplication](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html) explaining the theoretical rational behind choosing batch sizes as multiples of 8 for tensor cores

### Results

| batch size | train time | inf. time  | epochs | GPU  | mixed prec. |

| ---------- | ---------- | --------- | ------ | ---- | ----------- |

| 100        | 10.50 min  | 0.15 min  | 10     | V100 | Yes         |

| 127        | 9.80 min   | 0.15 min  | 10     | V100 | Yes         |

| 128        | 9.78 min   | 0.15 min  | 10     | V100 | Yes         |

| 129        | 9.92 min   | 0.15 min  | 10     | V100 | Yes         |

| 156        | 9.38 min   | 0.16 min  | 10     | V100 | Yes         |

|            |            |           |        |      |             |

| 511        | 8.74 min   | 0.17 min  | 10     | V100 | Yes         |

| 512        | 8.71 min   | 0.17 min  | 10     | V100 | Yes         |

| 513        | 8.72 min   | 0.17 min  | 10     | V100 | Yes         |

Below, I trained the same neural network using 4 V100 GPUs with the distributed data parallel strategy:

```bash

python main.py --num_epochs 10 --batch_size 255 --mixed_precision true --num_workers 4 --strategy ddp

```

| batch size | train time | epochs | GPU    | mixed prec. |

| ---------- | ---------- | ------ | ------ | ----------- |

| 255        |  2.95 min  |  10    | 4xV100 | Yes         |

| 256        |  2.87 min  |  10    | 4xV100 | Yes         |

| 257        |  2.86 min  |  10    | 4xV100 | Yes         |

Note that I removed the inference time (here: evaluation on the test set) from this table, because in practice, you would still use a single V100 for inference purposes. 

Note that this is all from one run each. To get more reliable stats, repeating the runs many times and reporting the average + SD might be worthwhile. However, even from the numbers above, it is probably apparent that there is only a small but barely noticeable difference between 127, 128, and 129.

**Or in other words, do you have a batch size of 128 that you would like to run, but it doesn't fit into memory? It's probably okay to train that model with a batch size of 120 and 100 before scaling it down to 64** 😊.