Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/TristanBilot/mlx-benchmark

Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.
https://github.com/TristanBilot/mlx-benchmark

apple-silicon benchmark deep-learning machine-learning mlx pytorch

Last synced: 3 months ago
JSON representation

Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.

Awesome Lists containing this project

README

        

# โšก๏ธ mlx-benchmark โšก๏ธ
### A comprehensive benchmark of MLX ops.

This repo aims to benchmark Apple's MLX operations and layers, on all Apple Silicon chips, along with some GPUs.

**Contributions:** Everyone can contribute to the benchmark! If you have a missing device or if you want to add a missing layer/operation, please read the [contribution guidelines](CONTRIBUTING.md).

Current M chips: `M1`, `M1 Pro`, `M1 Max`, `M2`, `M2 Pro`, `M2 Max`, `M2 Ultra`, `M3`, `M3 Pro`, `M3 Max`.

Current CUDA GPUs: `RTX4090`, `Tesla V100`.

Missing devices: `M1 Ultra`, and `other CUDA GPUs`.

> [!NOTE]
> You can submit your benchmark even for a device that is already listed, provided you use a newer version of MLX. Simply submit a PR by overriding the old benchmark table. Also, most of the existing benchmarks do not include the `mx.compile` feature, which has been recently added to mlx-benchmark.

## Benchmarks ๐Ÿงช

Benchmarks are generated by measuring the runtime of every `mlx` operations on GPU and CPU, along with their equivalent in pytorch with `mps`, `cpu` and `cuda` backends. On MLX with GPU, the operations compiled with `mx.compile` are included in the benchmark by default. To not benchmark the compiled functions, set `--compile=False`.

For each operation, we measure the runtime of multiple experiments. We propose 2 benchmarks based on these experiments:

* [Detailed benchmark](benchmarks/detailed_benchmark.md): provides the runtime of each experiment.
* [Average runtime benchmark](benchmarks/average_benchmark.md): computes the mean of experiments. Easier to navigate, with fewer details.

## Installation ๐Ÿ’ป

### Installation on Mac devices

Running the benchmark locally is straightforward. Create a new env with `osx-arm64` architecture and install the dependencies.

```shell
CONDA_SUBDIR=osx-arm64 conda create -n mlx_benchmark python=3.10 numpy pytorch torchvision scipy requests -c conda-forge

pip install -r requirements.txt
```

### Installation on other devices
Other operating systems than macOS can only run the torch experiments, on CPU or with a CUDA device. Install a new env without the `CONDA_SUBDIR=osx-arm64` prefix and install the torch package that matches your CUDA version. Then install all the requirements within `requirements.txt`, except `mlx`.

Finally, open the `config.py` file and set:
```
USE_MLX = False
```
to avoid importing the mlx package, which cannot be installed on non-Mac devices.

## Run the benchmark ๐Ÿง‘โ€๐Ÿ’ป

### Run on Mac

To run the benchmark on mps, mlx and CPU:

```shell
python run_benchmark.py --include_mps=True --include_mlx_gpu=True --include_mlx_cpu=True --include_cpu=True
```

### Run on other devices

To run the torch benchmark on CUDA and CPU:

```shell
python run_benchmark.py --include_mps=False --include_mlx_gpu=False --include_mlx_cpu=False --include_cuda=True --include_cpu=True
```

### Run only compiled functions

If you're interested in benchmarking only operations against operations compiled with `mx.compile`, you can run:

```shell
python run_benchmark.py --include_mps=False --include_cpu=False --include_mlx_cpu=False
```

## Contributing ๐Ÿš€

If you have a device not yet featured in the benchmark, especially the ones listed below, your PR is welcome to broaden the scope and accuracy of this project.