{"id":18752731,"url":"https://github.com/tlkh/tf-metal-experiments","last_synced_at":"2026-03-10T18:06:56.391Z","repository":{"id":38799719,"uuid":"421442643","full_name":"tlkh/tf-metal-experiments","owner":"tlkh","description":"TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)","archived":false,"fork":false,"pushed_at":"2022-02-10T10:00:55.000Z","size":247,"stargazers_count":277,"open_issues_count":13,"forks_count":32,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-04-02T19:07:58.403Z","etag":null,"topics":["benchmark","bert","deep-learning","gpu","m1","m1-max","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tlkh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-26T13:46:07.000Z","updated_at":"2025-01-01T13:55:30.000Z","dependencies_parsed_at":"2022-09-12T01:11:30.561Z","dependency_job_id":null,"html_url":"https://github.com/tlkh/tf-metal-experiments","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlkh%2Ftf-metal-experiments","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlkh%2Ftf-metal-experiments/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlkh%2Ftf-metal-experiments/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tlkh%2Ftf-metal-experiments/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tlkh","download_url":"https://codeload.github.com/tlkh/tf-metal-experiments/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248112908,"owners_count":21049749,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","bert","deep-learning","gpu","m1","m1-max","tensorflow"],"created_at":"2024-11-07T17:22:21.596Z","updated_at":"2026-03-10T18:06:51.336Z","avatar_url":"https://github.com/tlkh.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tf-metal-experiments\n\nTensorFlow Metal Backend on Apple Silicon Experiments (just for fun)\n\n## Setup\n\nThis is tested on M1 series Apple Silicon SOC only. \n\n### TensorFlow 2.x\n\n1. Follow the official instructions from Apple [here](https://developer.apple.com/metal/tensorflow-plugin/)\n2. Test that your Metal GPU is working by running `tf.config.list_physical_devices(\"GPU\")`, you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that says `Metal device set to: Apple M1 Max` or similar.\n3. Now you should be ready to run any TF code that doesn't require external libraries.\n\n### HuggingFace Transformers library\n\nIf you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.\n\n1. Install the `regex` library (I don't know why it has to be like this, but yeah): `python3 -m pip install --upgrade regex --no-use-pep517`. You might need do `xcode-select --install` if the above command doesn't work.\n2. `pip install transformers ipywidgets`\n\n## Experiments and Benchmarks\n\nAfter some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max.\n\n* For all the cases here, increasing batch size does not seem to increase the throughput.\n* High Power Mode enabled + plugged into charger (this does not seem to affect the benchmarks anyway)\n\nPower draw also doesn't seem to be able to go much higher than ~40W:\n\n* Power draw from the GPU (averaged over 1 second) can be measured with `sudo powermetrics --samplers gpu_power -i1000 -n1`.\n* I decided to report peak power as observed via `asitop` (see: [tlkh/asitop](https://github.com/tlkh/asitop))\n\n\n| Model       | GPU        | BatchSize | Throughput  | Peak Power | Memory |\n| ----------- | ---------- | --------- | ----------- | ----- | ------ |\n| ResNet50    | M1 Max 32c | 128       | 140 img/sec | 42W   | 21 GB  |\n| MobileNetV2 | M1 Max 32c | 128       | 352 img/sec | 37W   | 13 GB  |\n| DistilBERT  | M1 Max 32c | 64        | 120 seq/sec | 35W   | 9 GB   |\n| BERTLarge   | M1 Max 32c | 16        | 19 seq/sec  | 36W   | 14 GB  |\n\nThe benchmark scripts used are included in this repo.\n\n```shell\npython train_benchmark.py --type cnn --model resnet50\npython train_benchmark.py --type cnn --model mobilenetv2\npython train_benchmark.py --type transformer --model distilbert-base-uncased\npython train_benchmark.py --type transformer --model bert-large-uncased --bs 16\n```\n\n**Reference Benchmarks from RTX 3090**\n\n| Model       | GPU        | BatchSize | Throughput  | Power |\n| ----------- | ---------- | --------- | ----------- | ----- |\n| Same Batch Size as M1 | | | | |\n| ResNet50    | 3090       | 128       | 1100 img/sec| 360W  |\n| MobileNetV2 | 3090       | 128       | 2001 img/sec| 340W  |\n| DistilBERT  | 3090       | 64        | 1065 seq/sec| 360W  |\n| BERTLarge   | 3090       | 16        | 131 seq/sec | 335W  |\n| Larger Batch Size | | | | |\n| ResNet50    | 3090       | 256       | 1185 img/sec| 370W  |\n| MobileNetV2 | 3090       | 256       | 2197 img/sec| 350W  |\n| DistilBERT  | 3090       | 256       | 1340 seq/sec| 380W  |\n| BERTLarge   | 3090       | 64        | 193 seq/sec | 365W  |\n\nFor 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. Also increase the length of an epoch, as sometimes 3090 is too fast and results in poorer measurement due to overhead of starting/ending the training which finishes in seconds.\n\nNote: 3090 running at 400W power limit. CPU is 5600X.\n\n```shell\n# config for NVIDIA Tensor Core GPU\n# run with more steps, XLA and FP16 (enable tensor core aka mixed precision)\npython train_benchmark.py --type cnn --model resnet50 --xla --fp16 --steps 100\npython train_benchmark.py --type cnn --model mobilenetv2 --xla --fp16 --steps 100\npython train_benchmark.py --type transformer --model distilbert-base-uncased --xla --fp16 --steps 100\npython train_benchmark.py --type transformer --model bert-large-uncased --bs 16 --xla --fp16 --steps 30\n# If no Tensor Core, remove --fp16 flag\n```\n\n## Measuring Achievable TFLOPS\n\nWe can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around \u003e8 TFLOPS for large enough problem sizes.\n\n![](gpu_tflops_plot.jpg)\n\nThe plot can be generated using `tflops_sweep.py`. \n\nNote that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftlkh%2Ftf-metal-experiments","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftlkh%2Ftf-metal-experiments","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftlkh%2Ftf-metal-experiments/lists"}