https://github.com/tlkh/tf-metal-experiments
TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)
https://github.com/tlkh/tf-metal-experiments
benchmark bert deep-learning gpu m1 m1-max tensorflow
Last synced: 4 months ago
JSON representation
TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)
- Host: GitHub
- URL: https://github.com/tlkh/tf-metal-experiments
- Owner: tlkh
- License: mit
- Created: 2021-10-26T13:46:07.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-02-10T10:00:55.000Z (over 4 years ago)
- Last Synced: 2025-04-02T19:07:58.403Z (about 1 year ago)
- Topics: benchmark, bert, deep-learning, gpu, m1, m1-max, tensorflow
- Language: Jupyter Notebook
- Homepage:
- Size: 241 KB
- Stars: 277
- Watchers: 16
- Forks: 32
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tf-metal-experiments
TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)
## Setup
This is tested on M1 series Apple Silicon SOC only.
### TensorFlow 2.x
1. Follow the official instructions from Apple [here](https://developer.apple.com/metal/tensorflow-plugin/)
2. Test that your Metal GPU is working by running `tf.config.list_physical_devices("GPU")`, you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that says `Metal device set to: Apple M1 Max` or similar.
3. Now you should be ready to run any TF code that doesn't require external libraries.
### HuggingFace Transformers library
If you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.
1. Install the `regex` library (I don't know why it has to be like this, but yeah): `python3 -m pip install --upgrade regex --no-use-pep517`. You might need do `xcode-select --install` if the above command doesn't work.
2. `pip install transformers ipywidgets`
## Experiments and Benchmarks
After some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max.
* For all the cases here, increasing batch size does not seem to increase the throughput.
* High Power Mode enabled + plugged into charger (this does not seem to affect the benchmarks anyway)
Power draw also doesn't seem to be able to go much higher than ~40W:
* Power draw from the GPU (averaged over 1 second) can be measured with `sudo powermetrics --samplers gpu_power -i1000 -n1`.
* I decided to report peak power as observed via `asitop` (see: [tlkh/asitop](https://github.com/tlkh/asitop))
| Model | GPU | BatchSize | Throughput | Peak Power | Memory |
| ----------- | ---------- | --------- | ----------- | ----- | ------ |
| ResNet50 | M1 Max 32c | 128 | 140 img/sec | 42W | 21 GB |
| MobileNetV2 | M1 Max 32c | 128 | 352 img/sec | 37W | 13 GB |
| DistilBERT | M1 Max 32c | 64 | 120 seq/sec | 35W | 9 GB |
| BERTLarge | M1 Max 32c | 16 | 19 seq/sec | 36W | 14 GB |
The benchmark scripts used are included in this repo.
```shell
python train_benchmark.py --type cnn --model resnet50
python train_benchmark.py --type cnn --model mobilenetv2
python train_benchmark.py --type transformer --model distilbert-base-uncased
python train_benchmark.py --type transformer --model bert-large-uncased --bs 16
```
**Reference Benchmarks from RTX 3090**
| Model | GPU | BatchSize | Throughput | Power |
| ----------- | ---------- | --------- | ----------- | ----- |
| Same Batch Size as M1 | | | | |
| ResNet50 | 3090 | 128 | 1100 img/sec| 360W |
| MobileNetV2 | 3090 | 128 | 2001 img/sec| 340W |
| DistilBERT | 3090 | 64 | 1065 seq/sec| 360W |
| BERTLarge | 3090 | 16 | 131 seq/sec | 335W |
| Larger Batch Size | | | | |
| ResNet50 | 3090 | 256 | 1185 img/sec| 370W |
| MobileNetV2 | 3090 | 256 | 2197 img/sec| 350W |
| DistilBERT | 3090 | 256 | 1340 seq/sec| 380W |
| BERTLarge | 3090 | 64 | 193 seq/sec | 365W |
For 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. Also increase the length of an epoch, as sometimes 3090 is too fast and results in poorer measurement due to overhead of starting/ending the training which finishes in seconds.
Note: 3090 running at 400W power limit. CPU is 5600X.
```shell
# config for NVIDIA Tensor Core GPU
# run with more steps, XLA and FP16 (enable tensor core aka mixed precision)
python train_benchmark.py --type cnn --model resnet50 --xla --fp16 --steps 100
python train_benchmark.py --type cnn --model mobilenetv2 --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model distilbert-base-uncased --xla --fp16 --steps 100
python train_benchmark.py --type transformer --model bert-large-uncased --bs 16 --xla --fp16 --steps 30
# If no Tensor Core, remove --fp16 flag
```
## Measuring Achievable TFLOPS
We can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around >8 TFLOPS for large enough problem sizes.

The plot can be generated using `tflops_sweep.py`.
Note that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)