https://github.com/tlkh/tf-metal-experiments

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)
https://github.com/tlkh/tf-metal-experiments

benchmark bert deep-learning gpu m1 m1-max tensorflow

Last synced: 4 months ago
JSON representation

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Host: GitHub
URL: https://github.com/tlkh/tf-metal-experiments
Owner: tlkh
License: mit
Created: 2021-10-26T13:46:07.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-02-10T10:00:55.000Z (over 4 years ago)
Last Synced: 2025-04-02T19:07:58.403Z (over 1 year ago)
Topics: benchmark, bert, deep-learning, gpu, m1, m1-max, tensorflow
Language: Jupyter Notebook
Homepage:
Size: 241 KB
Stars: 277
Watchers: 16
Forks: 32
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # tf-metal-experiments

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

## Setup

This is tested on M1 series Apple Silicon SOC only. 

### TensorFlow 2.x

1. Follow the official instructions from Apple [here](https://developer.apple.com/metal/tensorflow-plugin/)

2. Test that your Metal GPU is working by running `tf.config.list_physical_devices("GPU")`, you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that says `Metal device set to: Apple M1 Max` or similar.

3. Now you should be ready to run any TF code that doesn't require external libraries.

### HuggingFace Transformers library

If you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.

1. Install the `regex` library (I don't know why it has to be like this, but yeah): `python3 -m pip install --upgrade regex --no-use-pep517`. You might need do `xcode-select --install` if the above command doesn't work.

2. `pip install transformers ipywidgets`

## Experiments and Benchmarks

After some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max.

* For all the cases here, increasing batch size does not seem to increase the throughput.

* High Power Mode enabled + plugged into charger (this does not seem to affect the benchmarks anyway)

Power draw also doesn't seem to be able to go much higher than ~40W:

* Power draw from the GPU (averaged over 1 second) can be measured with `sudo powermetrics --samplers gpu_power -i1000 -n1`.

* I decided to report peak power as observed via `asitop` (see: [tlkh/asitop](https://github.com/tlkh/asitop))

| Model       | GPU        | BatchSize | Throughput  | Peak Power | Memory |

| ----------- | ---------- | --------- | ----------- | ----- | ------ |

| ResNet50    | M1 Max 32c | 128       | 140 img/sec | 42W   | 21 GB  |

| MobileNetV2 | M1 Max 32c | 128       | 352 img/sec | 37W   | 13 GB  |

| DistilBERT  | M1 Max 32c | 64        | 120 seq/sec | 35W   | 9 GB   |

| BERTLarge   | M1 Max 32c | 16        | 19 seq/sec  | 36W   | 14 GB  |

The benchmark scripts used are included in this repo.

```shell

python train_benchmark.py --type cnn --model resnet50

python train_benchmark.py --type cnn --model mobilenetv2

python train_benchmark.py --type transformer --model distilbert-base-uncased

python train_benchmark.py --type transformer --model bert-large-uncased --bs 16

```

**Reference Benchmarks from RTX 3090**

| Model       | GPU        | BatchSize | Throughput  | Power |

| ----------- | ---------- | --------- | ----------- | ----- |

| Same Batch Size as M1 | | | | |

| ResNet50    | 3090       | 128       | 1100 img/sec| 360W  |

| MobileNetV2 | 3090       | 128       | 2001 img/sec| 340W  |

| DistilBERT  | 3090       | 64        | 1065 seq/sec| 360W  |

| BERTLarge   | 3090       | 16        | 131 seq/sec | 335W  |

| Larger Batch Size | | | | |

| ResNet50    | 3090       | 256       | 1185 img/sec| 370W  |

| MobileNetV2 | 3090       | 256       | 2197 img/sec| 350W  |

| DistilBERT  | 3090       | 256       | 1340 seq/sec| 380W  |

| BERTLarge   | 3090       | 64        | 193 seq/sec | 365W  |

For 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. Also increase the length of an epoch, as sometimes 3090 is too fast and results in poorer measurement due to overhead of starting/ending the training which finishes in seconds.

Note: 3090 running at 400W power limit. CPU is 5600X.

```shell

# config for NVIDIA Tensor Core GPU

# run with more steps, XLA and FP16 (enable tensor core aka mixed precision)

python train_benchmark.py --type cnn --model resnet50 --xla --fp16 --steps 100

python train_benchmark.py --type cnn --model mobilenetv2 --xla --fp16 --steps 100

python train_benchmark.py --type transformer --model distilbert-base-uncased --xla --fp16 --steps 100

python train_benchmark.py --type transformer --model bert-large-uncased --bs 16 --xla --fp16 --steps 30

# If no Tensor Core, remove --fp16 flag

```

## Measuring Achievable TFLOPS

We can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around >8 TFLOPS for large enough problem sizes.

![](gpu_tflops_plot.jpg)

The plot can be generated using `tflops_sweep.py`. 

Note that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tlkh/tf-metal-experiments

Awesome Lists containing this project

README