{"id":13409613,"url":"https://github.com/dionhaefner/pyhpc-benchmarks","last_synced_at":"2025-04-12T09:21:54.561Z","repository":{"id":38985062,"uuid":"212333820","full_name":"dionhaefner/pyhpc-benchmarks","owner":"dionhaefner","description":"A suite of benchmarks for CPU and GPU performance of the most popular high-performance libraries for Python :rocket:","archived":false,"fork":false,"pushed_at":"2024-10-08T12:43:53.000Z","size":1252,"stargazers_count":319,"open_issues_count":6,"forks_count":24,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-03T09:09:22.001Z","etag":null,"topics":["benchmarks","cupy","gpu","high-performance-computing","jax","parallel-computing","python","pytorch","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dionhaefner.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-02T12:22:16.000Z","updated_at":"2025-04-01T05:56:41.000Z","dependencies_parsed_at":"2024-01-16T07:23:17.621Z","dependency_job_id":"8b36d719-c01f-4140-a7d8-942abb75f127","html_url":"https://github.com/dionhaefner/pyhpc-benchmarks","commit_stats":{"total_commits":85,"total_committers":3,"mean_commits":"28.333333333333332","dds":0.02352941176470591,"last_synced_commit":"d438ef8b76ecb51a019b105a6e06150e5a35c177"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dionhaefner%2Fpyhpc-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dionhaefner%2Fpyhpc-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dionhaefner%2Fpyhpc-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dionhaefner%2Fpyhpc-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dionhaefner","download_url":"https://codeload.github.com/dionhaefner/pyhpc-benchmarks/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248544039,"owners_count":21121882,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarks","cupy","gpu","high-performance-computing","jax","parallel-computing","python","pytorch","tensorflow"],"created_at":"2024-07-30T20:01:02.335Z","updated_at":"2025-04-12T09:21:54.535Z","avatar_url":"https://github.com/dionhaefner.png","language":"Python","funding_links":[],"categories":["Python","Python / C++","Benchmarks and Datasets"],"sub_categories":["Mojo🔥FastAPI Client","Benchmarks"],"readme":"[![DOI](https://zenodo.org/badge/212333820.svg)](https://zenodo.org/badge/latestdoi/212333820)\n\n# HPC benchmarks for Python\n\nThis is a suite of benchmarks to test the *sequential CPU* and GPU performance of various computational backends with Python frontends.\n\nSpecifically, we want to test which high-performance backend is best for *geophysical* (finite-difference based) *simulations*.\n\n**Contents**\n\n- [FAQ](#faq)\n- [Installation](#environment-setup)\n- [Usage](#usage)\n- [Example results](#example-results)\n- [Conclusion](#conclusion)\n- [Contributing](#contributing)\n\n## FAQ\n\n### Why?\n\nThe scientific Python ecosystem is thriving, but high-performance computing in Python isn't really a thing yet.\nWe try to change this [with our pure Python ocean simulator Veros](https://github.com/dionhaefner/veros), but which backend should we use for computations?\n\nTremendous amounts of time and resources go into the development of Python frontends to high-performance backends,\nbut those are usually tailored towards deep learning. We wanted to see whether we can profit from those advances, by\n(ab-)using these libraries for geophysical modelling.\n\n### Why do the benchmarks look so weird?\n\nThese are more or less verbatim copies from [Veros](https://github.com/dionhaefner/veros) (i.e., actual parts of a physical model).\nMost earth system and climate model components are based on finite-difference schemes to compute derivatives. This can be represented\nin vectorized form by index shifts of arrays (such as `0.5 * (arr[1:] + arr[:-1])`, the first-order derivative of `arr` at every point). The most common index range is `[2:-2]`, which represents the full domain (the two outermost grid cells are overlap / \"ghost cells\" that allow us to shift the array across the boundary).\n\nNow, maths is difficult, and numerics are weird. When many different physical quantities (defined on different grids) interact, things\nget messy very fast.\n\n### Why only test sequential CPU performance?\n\nTwo reasons:\n- I was curious to see how good the compilers are without being able to fall back to thread parallelism.\n- In many physical models, it is pretty straightforward to parallelize the model \"by hand\" via MPI.\n  Therefore, we are not really dependent on good parallel performance out of the box.\n\n### Which backends are currently supported?\n\n- [NumPy](https://numpy.org) (CPU only)\n- [Numba](https://numba.pydata.org) (CPU only)\n- [Aesara](https://github.com/aesara-devs/aesara) (CPU only)\n- [Jax](https://github.com/google/jax)\n- [Tensorflow](https://www.tensorflow.org)\n- [Pytorch](https://pytorch.org)\n- [CuPy](https://cupy.chainer.org/) (GPU only)\n- [Taichi](https://www.taichi-lang.org/)\n\n(not every backend is available for every benchmark)\n\n### What is included in the measurements?\n\nPure time spent number crunching. Preparing the inputs, copying stuff from and to GPU, compilation time, time it takes to check results etc. are excluded.\nThis is based on the assumption that these things are only done a few times per simulation (i.e., that their cost is\namortized during long-running simulations).\n\n### How does this compare to a low-level implementation?\n\nAs a rule of thumb (from our experience with Veros), the performance of a Fortran implementation is very close to that of the Numba backend, or ~3 times faster than NumPy.\n\n\n## Environment setup\n\nFor CPU:\n\n```bash\n$ conda env create -f environment-cpu.yml\n$ conda activate pyhpc-bench-cpu\n```\n\nGPU:\n\n```bash\n$ conda env create -f environment-gpu.yml\n$ conda activate pyhpc-bench-gpu\n```\n\nIf you prefer to install things by hand, just have a look at the environment files to see what you need. You don't need to install all backends; if a module is unavailable, it is skipped automatically.\n\n## Usage\n\nYour entrypoint is the script `run.py`:\n\n```bash\n$ python run.py --help\nUsage: run.py [OPTIONS] BENCHMARK\n\n  HPC benchmarks for Python\n\n  Usage:\n\n      $ python run.py benchmarks/\u003cBENCHMARK_FOLDER\u003e\n\n  Examples:\n\n      $ taskset -c 0 python run.py benchmarks/equation_of_state\n\n      $ python run.py benchmarks/equation_of_state -b numpy -b jax --device\n      gpu\n\n  More information:\n\n      https://github.com/dionhaefner/pyhpc-benchmarks\n\nOptions:\n  -s, --size INTEGER              Run benchmark for this array size\n                                  (repeatable)  [default: 4096, 16384, 65536,\n                                  262144, 1048576, 4194304]\n  -b, --backend [numpy|cupy|jax|aesara|numba|pytorch|taichi|tensorflow]\n                                  Run benchmark with this backend (repeatable)\n                                  [default: run all backends]\n  -r, --repetitions INTEGER       Fixed number of iterations to run for each\n                                  size and backend [default: auto-detect]\n  --burnin INTEGER                Number of initial iterations that are\n                                  disregarded for final statistics  [default:\n                                  1]\n  --device [cpu|gpu|tpu]          Run benchmarks on given device where\n                                  supported by the backend  [default: cpu]\n  --help                          Show this message and exit.\n```\n\nBenchmarks are run for all combinations of the chosen sizes (`-s`) and backends (`-b`), in random order.\n\n### CPU\n\nSome backends refuse to be confined to a single thread, so I recommend you wrap your benchmarks\nin `taskset` to set processor affinity to a single core (only works on Linux):\n\n```bash\n$ conda activate pyhpc-bench-cpu\n$ taskset -c 0 python run.py benchmarks/\u003cbenchmark_name\u003e\n```\n\n### GPU\n\nSome backends use all available GPUs by default, some don't. If you have multiple GPUs, you can set the\none to be used through `CUDA_VISIBLE_DEVICES`, so keep things fair.\n\nSome backends are greedy with allocating memory. On GPU, you can only run one backend at a time (add NumPy for reference):\n\n```bash\n$ conda activate pyhpc-bench-gpu\n$ export CUDA_VISIBLE_DEVICES=\"0\"\n$ for backend in jax cupy pytorch tensorflow; do\n...    python run benchmarks/\u003cbenchmark_name\u003e --device gpu -b $backend -b numpy -s 10_000_000\n...    done\n```\n\n## Example results\n\n### Summary\n\n#### Equation of state\n\n\u003cp align=\"middle\"\u003e\n  \u003cimg src=\"results/aws-plots/bench-equation_of_state-CPU.png?raw=true\" width=\"400\"\u003e\n  \u003cimg src=\"results/aws-plots/bench-equation_of_state-GPU.png?raw=true\" width=\"400\"\u003e\n\u003c/p\u003e\n  \n#### Isoneutral mixing\n\n\u003cp align=\"middle\"\u003e\n  \u003cimg src=\"results/aws-plots/bench-isoneutral_mixing-CPU.png?raw=true\" width=\"400\"\u003e\n  \u003cimg src=\"results/aws-plots/bench-isoneutral_mixing-GPU.png?raw=true\" width=\"400\"\u003e\n\u003c/p\u003e\n\n#### Turbulent kinetic energy\n\n\u003cp align=\"middle\"\u003e\n  \u003cimg src=\"results/aws-plots/bench-turbulent_kinetic_energy-CPU.png?raw=true\" width=\"400\"\u003e\n  \u003cimg src=\"results/aws-plots/bench-turbulent_kinetic_energy-GPU.png?raw=true\" width=\"400\"\u003e\n\u003c/p\u003e\n\n### Full reports\n\n- [Example results on EC2 with Tesla V100 GPU](/results/aws.md) (more reliable)\n- [Example results on Google Colab](/results/colab.md) (easier to reproduce)\n- [Example results on bare metal](/results/magni.md) (most reliable, but outdated)\n\n## Conclusion\n\nLessons I learned by assembling these benchmarks: (your mileage may vary)\n\n- The performance of JAX is very competitive, both on GPU and CPU. It is consistently among the top implementations on both platforms.\n- Pytorch performs very well on GPU for large problems (slightly better than JAX), but its CPU performance is not great for tasks with many slicing operations.\n- Numba is a great choice on CPU if you don't mind writing explicit for loops (which can be more readable than a vectorized implementation), being slightly faster than JAX with little effort.\n- JAX performance on GPU seems to be quite hardware dependent. JAX performancs significantly better (relatively speaking) on a Tesla P100 than a Tesla K80.\n- If you have embarrasingly parallel workloads, speedups of \u003e 1000x are easy to achieve on high-end GPUs.\n- TPUs are catching up to GPUs. We can now get similar performance to a high-end GPU on these workloads.\n- Tensorflow is not great for applications like ours, since it lacks tools to apply partial updates to tensors (such as `tensor[2:-2] = 0.`).\n- If you use Tensorflow on CPU, make sure to use XLA (`experimental_compile`) for tremendous speedups.\n- CuPy is nice! Often you don't need to change anything in your NumPy code to have it run on GPU (with decent, but not outstanding performance).\n- Reaching Fortran performance on CPU for non-trivial tasks is hard :)\n\n## Contributing\n\nCommunity contributions are encouraged! Whether you want to donate another benchmark, share your experience, optimize an implementation, or suggest another backend - [feel free to ask](https://github.com/dionhaefner/pyhpc-benchmarks/issues) or [open a PR](https://github.com/dionhaefner/pyhpc-benchmarks/pulls).\n\n### Adding a new backend\n\nAdding a new backend is easy!\n\nLet's assume that you want to add support for a library called `speedygonzales`. All you need to do is this:\n\n- Implement a benchmark to use your library, e.g. `benchmarks/equation_of_state/eos_speedygonzales.py`.\n- Register the benchmark in the respective `__init__.py` file (`benchmarks/equation_of_state/__init__.py`), by adding `\"speedygonzales\"` to its `__implementations__` tuple.\n- Register the backend, by adding its setup function to the `__backends__` dict in [`backends.py`](https://github.com/dionhaefner/pyhpc-benchmarks/blob/master/backends.py).\n\n   A setup function is what is called before every call to your benchmark, and can be used for custom setup and teardown. In the simplest case, it is just\n\n   ```python\n   def setup_speedygonzales(device='cpu'):\n       # code to run before benchmark\n       yield\n       # code to run after benchmark\n   ```\n\nThen, you can run the benchmark with your new backend:\n\n```bash\n$ python run.py benchmarks/equation_of_state -b speedygonzales\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdionhaefner%2Fpyhpc-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdionhaefner%2Fpyhpc-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdionhaefner%2Fpyhpc-benchmarks/lists"}