https://github.com/mysterionrise/tensorflow-metal-experiments
Example of training NN based on Tensorflow Metal using ARM M chips from Apple
https://github.com/mysterionrise/tensorflow-metal-experiments
gpu m1 m1-mac m2 m2-mac m3-mac neural-network neural-networks tensorflow tensorflow-gpu tensorflow-tutorials
Last synced: about 1 month ago
JSON representation
Example of training NN based on Tensorflow Metal using ARM M chips from Apple
- Host: GitHub
- URL: https://github.com/mysterionrise/tensorflow-metal-experiments
- Owner: MysterionRise
- License: mit
- Created: 2015-04-24T21:22:04.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2026-02-04T13:21:48.000Z (5 months ago)
- Last Synced: 2026-02-05T00:48:10.571Z (5 months ago)
- Topics: gpu, m1, m1-mac, m2, m2-mac, m3-mac, neural-network, neural-networks, tensorflow, tensorflow-gpu, tensorflow-tutorials
- Language: Jupyter Notebook
- Homepage:
- Size: 7.14 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TensorFlow Metal Experiments
[](https://www.python.org/downloads/)
[](https://www.tensorflow.org/)
[](https://opensource.org/licenses/MIT)
Benchmarking GPU vs CPU training performance across Apple Silicon, NVIDIA GPUs, and Intel CPUs using TensorFlow Metal and MLX.
## Key Findings

**TL;DR: For large models, GPU acceleration provides 17x speedup on Apple Silicon and up to 120x on NVIDIA.**
| Hardware | GPU Cores | VGG16 (s/epoch) | Speedup vs i7-8700 |
|----------|-----------|-----------------|-------------------|
| RTX 4070 Super | 7168 CUDA | 7s | 123x |
| RTX 2070 | 2304 CUDA | 18s | 48x |
| M1 Max | 32 GPU | 21s | 41x |
| M4 Pro | 16 GPU | 26s | 33x |
| M2 | 10 GPU | 64s | 13x |
| i7-13700KF | - | 126s | 7x |
| M1 Max (CPU only) | - | 368s | 2.3x |
| i7-8700 | - | 863s | 1x (baseline) |
### Apple Silicon GPU Speedup
- **M1 Max**: 17.5x faster with Metal GPU vs CPU-only
- **M2**: 8.3x faster with Metal GPU vs CPU-only
- **M4 Pro**: See MLX vs TensorFlow comparison below
## Project Structure
```
tensorflow-metal-experiments/
├── notebooks/
│ ├── tf_mnist_train.ipynb # Simple CNN (93k params)
│ ├── tf_fashion_mnist_train.ipynb # CNN with dropout (412k params)
│ ├── tf_cifar100-train.ipynb # VGG16-style (34M params)
│ ├── mlx_comparison.ipynb # MLX vs TensorFlow Metal (naive)
│ ├── optimized_benchmark.ipynb # Naive vs Optimized comparison
│ └── benchmark_report.ipynb # Generate benchmark charts
├── src/utils/
│ └── device_config.py # Reusable GPU/CPU configuration
├── benchmarks/
│ └── results.json # Structured benchmark data
└── assets/
└── vgg16_benchmark.png # Benchmark visualization
```
## Installation
### Prerequisites: Install Python (macOS)
If you don't have Python installed, use Homebrew:
```bash
# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install Python 3.11+
brew install python@3.11
# Verify installation
python3.11 --version
```
### Apple Silicon Setup (M1/M2/M3/M4)
```bash
# Navigate to project directory
cd tensorflow-metal-experiments
# Create virtual environment
python3.11 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Upgrade pip
pip install --upgrade pip
# Install dependencies (TF 2.18 is required for Metal compatibility)
pip install "tensorflow>=2.18,<2.19" tensorflow-metal mlx
pip install matplotlib seaborn pandas numpy jupyterlab
# Verify TensorFlow sees the GPU
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
# Should show: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
# Verify MLX
python -c "import mlx.core as mx; print(mx.default_device())"
# Should show: gpu
```
### Windows with NVIDIA GPU (WSL2)
```bash
# Create and activate venv
python -m venv venv
source venv/bin/activate # or: venv\Scripts\activate on Windows
# Install dependencies
pip install tensorflow[and-cuda]
pip install matplotlib seaborn pandas numpy jupyterlab
```
### Run Experiments
```bash
# Make sure venv is activated
source venv/bin/activate
# Start Jupyter
jupyter lab
```
Open any notebook in `notebooks/` and run all cells.
### Deactivate Environment
```bash
deactivate
```
## Switching Between GPU and CPU
Each notebook uses a device configuration helper:
```python
from utils.device_config import configure_device
# Use GPU (Metal or CUDA)
device = configure_device(use_gpu=True)
# Force CPU only
device = configure_device(use_gpu=False)
```
## Benchmarks
### VGG16 on CIFAR-100 (34M Parameters)
This is the primary benchmark. Large models show the most significant GPU acceleration.
| Hardware | Platform | GPU | Time/Epoch |
|----------|----------|-----|------------|
| RTX 4070 Super 12GB | Windows 11 | Yes | 7s |
| RTX 2070 8GB | Windows 10 | Yes | 18s |
| M1 Max 32-core GPU | macOS | Yes | 21s |
| M2 10-core GPU | macOS | Yes | 64s |
| i7-13700KF 3.4GHz | Windows 11 | No | 126s |
| M1 Max 10-core CPU | macOS | No | 368s |
| M2 8-core CPU | macOS | No | 528s |
| i9 2.4GHz 8-core | macOS | No | 630s |
| i7-8700 3.2GHz | Windows 10 | No | 863s |
### Small Model Caveat
For small models (MNIST CNN, 93k params), CPU can sometimes match or beat GPU due to data transfer overhead. GPU acceleration is most beneficial for:
- Models > 1M parameters
- Batch sizes >= 64
- Training runs with many epochs
## Performance Optimization
### Why GPU Utilization May Be Low (~40%)
If you observe low GPU utilization during training, these are the common causes:
1. **NumPy array bottleneck** - Using `model.fit(x_train, y_train)` with NumPy arrays is a major bottleneck
2. **Small batch sizes** - GPU dispatch overhead doesn't amortize for small batches
3. **Model too small** - GPU parallelism not fully utilized for models < 1M params
4. **Data loading on CPU** - Pipeline not optimized for GPU
### Optimization Tips
1. **Use tf.data.Dataset API** instead of NumPy arrays:
```python
# Instead of: model.fit(x_train, y_train)
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.batch(128).prefetch(tf.data.AUTOTUNE)
model.fit(dataset)
```
This can achieve up to 5x acceleration and better GPU utilization.
2. **Increase batch sizes** - Apple's unified memory allows larger batches (try 256, 512) without CPU-GPU transfer overhead
3. **Use mixed precision** where supported:
```python
tf.keras.mixed_precision.set_global_policy('mixed_float16')
```
4. **Monitor GPU power** to verify GPU is being utilized:
```bash
sudo powermetrics --samplers gpu_power -i1000 -n1
```
5. **For MLX**: Use `mx.eval()` strategically to control lazy evaluation
Run `notebooks/optimized_benchmark.ipynb` to see the impact of these optimizations with real benchmarks comparing naive vs optimized implementations for both TensorFlow and MLX.
## MLX vs TensorFlow Metal
The `mlx_comparison.ipynb` notebook benchmarks Apple's MLX framework against TensorFlow Metal.
### M4 Pro Benchmark Results
Benchmarked on **M4 Pro (16-core GPU, 48GB RAM)** - Naive vs Optimized:
| Model | Params | TF Naive | TF Optimized | MLX Naive | MLX Optimized | Best |
|-------|--------|----------|--------------|-----------|---------------|------|
| MNIST CNN | 93K | 77.2s | 24.8s | 16.4s | **11.6s** | MLX Opt |
| Fashion CNN | 412K | 95.3s | 28.2s | 28.0s | **24.1s** | MLX Opt |
### Optimization Impact
| Framework | Optimization | MNIST Speedup | Fashion Speedup |
|-----------|--------------|---------------|-----------------|
| TensorFlow | tf.data + batch=256 | **3.1x faster** | **3.4x faster** |
| MLX | eval per epoch + batch=256 | 1.4x faster | 1.2x faster |
**Key Insights**:
- **TensorFlow benefits most from optimization** - tf.data.Dataset provides 3x+ speedup
- **MLX is fast out of the box** - Already optimized, less room for improvement
- **MLX wins for small/medium models** - Even optimized TensorFlow can't catch up
### When to Use Each Framework
**When to use MLX:**
- Small-to-medium models (< 10M parameters) - fastest option
- Rapid prototyping on Apple Silicon
- Apple-native applications (Core ML integration)
- When you want good performance without optimization work
**When to use TensorFlow Metal:**
- Cross-platform deployment requirements
- Access to TensorFlow Hub / Keras ecosystem
- Production pipelines with TensorFlow Serving
- When you'll invest in tf.data optimization
## Methodology
- All benchmarks run 3 times, median reported
- System was idle during benchmarks (no background tasks)
- Same model architecture across all hardware
- Data loading time excluded from measurements
- Batch sizes kept consistent (64 for MNIST, 128 for CIFAR-100)
## Contributing
1. Run benchmarks on your hardware
2. Add results to `benchmarks/results.json`
3. Run `notebooks/benchmark_report.ipynb` to regenerate charts
4. Submit a pull request
## License
MIT