{"id":16976167,"url":"https://github.com/ashvardanian/parallelreductionsbenchmark","last_synced_at":"2025-04-06T18:14:18.298Z","repository":{"id":53699413,"uuid":"206343839","full_name":"ashvardanian/ParallelReductionsBenchmark","owner":"ashvardanian","description":"Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal - all it takes to sum a lot of numbers fast!","archived":false,"fork":false,"pushed_at":"2025-02-23T17:08:45.000Z","size":18175,"stargazers_count":92,"open_issues_count":4,"forks_count":9,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-03-30T17:08:46.724Z","etag":null,"topics":["apple","avx512","cuda","glsl","gpgpu","gpu","gpu-acceleration","gpu-computing","hpc","intel","metal","nvidia","opencl","openmp","parallel","simd","stl","tbb","thrust"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/posts/cuda-parallel-reductions/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-04T14:50:38.000Z","updated_at":"2025-03-25T15:11:18.000Z","dependencies_parsed_at":"2025-01-15T22:14:25.282Z","dependency_job_id":"dc6e67b8-5a52-477e-ad86-b8979b3da3c9","html_url":"https://github.com/ashvardanian/ParallelReductionsBenchmark","commit_stats":{"total_commits":61,"total_committers":3,"mean_commits":"20.333333333333332","dds":0.180327868852459,"last_synced_commit":"1a271b060e5319d3320402913dce004e49c4b6e3"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FParallelReductionsBenchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FParallelReductionsBenchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FParallelReductionsBenchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FParallelReductionsBenchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/ParallelReductionsBenchmark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247526762,"owners_count":20953143,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple","avx512","cuda","glsl","gpgpu","gpu","gpu-acceleration","gpu-computing","hpc","intel","metal","nvidia","opencl","openmp","parallel","simd","stl","tbb","thrust"],"created_at":"2024-10-14T01:25:10.390Z","updated_at":"2025-04-06T18:14:18.277Z","avatar_url":"https://github.com/ashvardanian.png","language":"C++","readme":"# Parallel Reductions Benchmark for CPUs \u0026 GPUs\n\n![Parallel Reductions Benchmark](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/ParallelReductionsBenchmark.jpg?raw=true)\n\nOne of the canonical examples when designing parallel algorithms is implementing parallel tree-like reductions, which is a special case of accumulating a bunch of numbers located in a continuous block of memory.\nIn modern C++, most developers would call `std::accumulate(array.begin(), array.end(), 0)`, and in Python, it's just a `sum(array)`.\nImplementing those operations with high utilization in many-core systems is surprisingly non-trivial and depends heavily on the hardware architecture.\nThis repository contains several educational examples showcasing the performance differences between different solutions:\n\n- Single-threaded but SIMD-accelerated code:\n  - SSE, AVX, AVX-512 on x86.\n  - 🔜 NEON and SVE on Arm.\n- OpenMP `reduction` clause.\n- Thrust with its `thrust::reduce`.\n- CUB with its `cub::DeviceReduce::Sum`.\n- CUDA kernels with and w/out [warp-primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/).\n- CUDA kernels with [Tensor-Core](https://www.nvidia.com/en-gb/data-center/tensor-cores/) acceleration.\n- [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and cuBLAS strided vector and matrix routines.\n- OpenCL kernels, eight of them.\n- Parallel STL `\u003calgorithm\u003e` in GCC with Intel oneTBB.\n\nNotably:\n\n- on arrays with billions of elements, the default `float` error mounts, and the results become inaccurate unless a [Kahan-like scheme](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) is used.\n- to minimize the overhead [Translation Lookaside Buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) __(TLB)__ misses, the arrays are aligned to the OS page size and are allocated in [huge pages on Linux](https://wiki.debian.org/Hugepages), if possible.\n- to reduce the memory access latency on many-core  [Non-Uniform Memory Access](https://en.wikipedia.org/wiki/Non-uniform_memory_access) __(NUMA)__ systems, `libnuma` and `pthread` help maximize data affinity.\n- to \"hide\" latency on wide CPU registers (like `ZMM`), expensive Assembly instructions executed on different [CPU ports](https://easyperf.net/blog/2018/03/21/port-contention#utilizing-full-capacity-of-the-load-instructions) are interleaved.\n\n---\n\nThe examples in this repository were originally written in early 2010s and were updated in 2019, 2022, and 2025.\nPreviously, it also included ArrayFire, Halide, and Vulkan queues for SPIR-V kernels and SyCL.\n\n- [Lecture Slides](https://drive.google.com/file/d/16AicAl99t3ZZFnza04Wnw_Vuem0w8lc7/view?usp=sharing) from 2019.\n- [CppRussia Talk](https://youtu.be/AA4RI6o0h1U) in Russia in 2019.\n- [JetBrains Talk](https://youtu.be/BUtHOftDm_Y) in Germany \u0026 Russia in 2019.\n\n## Build \u0026 Run\n\nThis repository is a CMake project designed to be built on Linux with GCC, Clang, or NVCC.\nYou may need to install the following dependencies for complete functionality:\n\n```sh\nsudo apt install libblas-dev            # For OpenBLAS on Linux\nsudo apt install libnuma1 libnuma-dev   # For NUMA allocators on Linux\nsudo apt install cuda-toolkit           # This may not be as easy 😈\n```\n\nThe following script will, by default, generate a 1GB array of numbers and reduce them using every available backend.\nAll the classical Google Benchmark arguments are supported, including `--benchmark_filter=opencl`.\nAll the library dependencies, including GTest, GBench, Intel oneTBB, FMT, and Thrust with CUB, will be automatically fetched.\nYou are expected to build this on an x86 machine with CUDA drivers installed.\n\n```sh\ncmake -B build_release -D CMAKE_BUILD_TYPE=Release         # Generate the build files\ncmake --build build_release --config Release               # Build the project\nbuild_release/reduce_bench                                 # Run all benchmarks\nbuild_release/reduce_bench --benchmark_filter=\"cuda\"       # Only CUDA-related\nPARALLEL_REDUCTIONS_LENGTH=1024 build_release/reduce_bench # Set a different input size\n```\n\nNeed a more fine-grained control to run only CUDA-based backends?\n\n```sh\ncmake -DCMAKE_CUDA_COMPILER=nvcc -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -B build_release\ncmake --build build_release --config Release\nbuild_release/reduce_bench --benchmark_filter=cuda\n```\n\nTo debug or introspect, the procedure is similar:\n\n```sh\ncmake -DCMAKE_CUDA_COMPILER=nvcc -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_BUILD_TYPE=Debug -B build_debug\ncmake --build build_debug --config Debug\n```\n\nAnd then run your favorite debugger.\n\nOptional backends:\n\n- To enable [Intel OpenCL](https://github.com/intel/compute-runtime/blob/master/README.md) on CPUs: `apt-get install intel-opencl-icd`.\n- To run on integrated Intel GPU, follow [this guide](https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/prerequisites.html).\n\n## Results\n\nDifferent hardware would yield different results, but the general trends and observations are:\n\n- Accumulating over 100M `float` values generally requires `double` precision or Kahan-like numerical tricks to avoid instability.\n- Carefully unrolled `for`-loop is easier for the compiler to vectorize and faster than `std::accumulate`.\n- For `float`, `double`, and even Kahan-like schemes, hand-written AVX2 code is faster than auto-vectorization.\n- Parallel `std::reduce` for extensive collections is naturally faster than serial `std::accumulate`, but you may not feel the difference between `std::execution::par` and `std::execution::par_unseq` on CPU.\n- CUB is always faster than Thrust, and even for trivial types and large jobs, the difference can be 50%.\n\n### Nvidia DGX-H100\n\nOn Nvidia DGX-H100 nodes, with GCC 12 and NVCC 12.1, one may expect the following results:\n\n```sh\n$ build_release/reduce_bench\nYou did not feed the size of arrays, so we will use a 1GB array!\n2024-05-06T00:11:14+00:00\nRunning build_release/reduce_bench\nRun on (160 X 2100 MHz CPU s)\nCPU Caches:\n  L1 Data 32 KiB (x160)\n  L1 Instruction 32 KiB (x160)\n  L2 Unified 4096 KiB (x80)\n  L3 Unified 16384 KiB (x2)\nLoad Average: 3.23, 19.01, 13.71\n----------------------------------------------------------------------------------------------------------------\nBenchmark                                                      Time             CPU   Iterations UserCounters...\n----------------------------------------------------------------------------------------------------------------\nunrolled\u003cf32\u003e/min_time:10.000/real_time                149618549 ns    149615366 ns           95 bytes/s=7.17653G/s error,%=50\nunrolled\u003cf64\u003e/min_time:10.000/real_time                146594731 ns    146593719 ns           95 bytes/s=7.32456G/s error,%=0\nstd::accumulate\u003cf32\u003e/min_time:10.000/real_time         194089563 ns    194088811 ns           72 bytes/s=5.5322G/s error,%=93.75\nstd::accumulate\u003cf64\u003e/min_time:10.000/real_time         192657883 ns    192657360 ns           74 bytes/s=5.57331G/s error,%=0\nopenmp\u003cf32\u003e/min_time:10.000/real_time                    5061544 ns      5043250 ns         2407 bytes/s=212.137G/s error,%=65.5651u\nstd::reduce\u003cpar, f32\u003e/min_time:10.000/real_time          3749938 ns      3727477 ns         2778 bytes/s=286.336G/s error,%=0\nstd::reduce\u003cpar, f64\u003e/min_time:10.000/real_time          3921280 ns      3916897 ns         3722 bytes/s=273.824G/s error,%=100\nstd::reduce\u003cpar_unseq, f32\u003e/min_time:10.000/real_time    3884794 ns      3864061 ns         3644 bytes/s=276.396G/s error,%=0\nstd::reduce\u003cpar_unseq, f64\u003e/min_time:10.000/real_time    3889332 ns      3866968 ns         3585 bytes/s=276.074G/s error,%=100\nsse\u003cf32aligned\u003e@threads/min_time:10.000/real_time        5986350 ns      5193690 ns         2343 bytes/s=179.365G/s error,%=1.25021\navx2\u003cf32\u003e/min_time:10.000/real_time                    110796474 ns    110794861 ns          127 bytes/s=9.69112G/s error,%=50\navx2\u003cf32kahan\u003e/min_time:10.000/real_time               134144762 ns    134137771 ns          105 bytes/s=8.00435G/s error,%=0\navx2\u003cf64\u003e/min_time:10.000/real_time                    115791797 ns    115790878 ns          121 bytes/s=9.27304G/s error,%=0\navx2\u003cf32aligned\u003e@threads/min_time:10.000/real_time       5958283 ns      5041060 ns         2358 bytes/s=180.21G/s error,%=1.25033\navx2\u003cf64\u003e@threads/min_time:10.000/real_time              5996481 ns      5123440 ns         2337 bytes/s=179.062G/s error,%=1.25001\ncub@cuda/min_time:10.000/real_time                        356488 ns       356482 ns        39315 bytes/s=3.012T/s error,%=0\nwarps@cuda/min_time:10.000/real_time                      486387 ns       486377 ns        28788 bytes/s=2.20759T/s error,%=0\nthrust@cuda/min_time:10.000/real_time                     500941 ns       500919 ns        27512 bytes/s=2.14345T/s error,%=0\n```\n\nObservations:\n\n- 286 GB/s upper bound on the CPU.\n- 2.2 TB/s using vanilla CUDA approaches.\n- 3 TB/s using CUB.\n\nOn Nvidia H200 GPUs, the numbers are even higher:\n\n```sh\n-------------------------------------------------------------------------------------------------------------\nBenchmark                                                   Time             CPU   Iterations UserCounters...\n-------------------------------------------------------------------------------------------------------------\ncuda/cub/min_time:10.000/real_time                     254609 ns       254607 ns        54992 bytes/s=4.21723T/s error,%=0\ncuda/thrust/min_time:10.000/real_time                  319709 ns       316368 ns        43846 bytes/s=3.3585T/s error,%=0\ncuda/thrust/interleaving/min_time:10.000/real_time     318598 ns       314996 ns        43956 bytes/s=3.37021T/s error,%=0\n```\n\n### AWS Zen4 `m7a.metal-48xl`\n\nOn AWS Zen4 `m7a.metal-48xl` instances with GCC 12, one may expect the following results:\n\n```sh\n$ build_release/reduce_bench\nYou did not feed the size of arrays, so we will use a 1GB array!\n2025-01-18T11:26:46+00:00\nRunning build_release/reduce_bench\nRun on (192 X 3701.95 MHz CPU s)\nCPU Caches:\n  L1 Data 32 KiB (x192)\n  L1 Instruction 32 KiB (x192)\n  L2 Unified 1024 KiB (x192)\n  L3 Unified 32768 KiB (x24)\nLoad Average: 4.54, 2.78, 4.94\n***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.\n--------------------------------------------------------------------------------------------------------------------\nBenchmark                                                          Time             CPU   Iterations UserCounters...\n--------------------------------------------------------------------------------------------------------------------\nunrolled\u003cf32\u003e/min_time:10.000/real_time                     30546168 ns     30416147 ns          461 bytes/s=35.1514G/s error,%=50\nunrolled\u003cf64\u003e/min_time:10.000/real_time                     31563095 ns     31447017 ns          442 bytes/s=34.0189G/s error,%=0\nstd::accumulate\u003cf32\u003e/min_time:10.000/real_time             219734340 ns    219326135 ns           64 bytes/s=4.88655G/s error,%=93.75\nstd::accumulate\u003cf64\u003e/min_time:10.000/real_time             219853985 ns    219429612 ns           64 bytes/s=4.88389G/s error,%=0\nopenmp\u003cf32\u003e/min_time:10.000/real_time                        5749979 ns      5709315 ns         1996 bytes/s=186.738G/s error,%=149.012u\nstd::reduce\u003cpar, f32\u003e/min_time:10.000/real_time              2913596 ns      2827125 ns         4789 bytes/s=368.528G/s error,%=0\nstd::reduce\u003cpar, f64\u003e/min_time:10.000/real_time              2899901 ns      2831183 ns         4874 bytes/s=370.268G/s error,%=0\nstd::reduce\u003cpar_unseq, f32\u003e/min_time:10.000/real_time        3026168 ns      2940291 ns         4461 bytes/s=354.819G/s error,%=0\nstd::reduce\u003cpar_unseq, f64\u003e/min_time:10.000/real_time        3053703 ns      2936506 ns         4797 bytes/s=351.62G/s error,%=0\nsse\u003cf32aligned\u003e@threads/min_time:10.000/real_time           10132563 ns      9734108 ns         1000 bytes/s=105.969G/s error,%=0.520837\navx2\u003cf32\u003e/min_time:10.000/real_time                         32225620 ns     32045487 ns          435 bytes/s=33.3195G/s error,%=50\navx2\u003cf32kahan\u003e/min_time:10.000/real_time                   110283627 ns    110023814 ns          127 bytes/s=9.73619G/s error,%=0\navx2\u003cf64\u003e/min_time:10.000/real_time                         55559986 ns     55422069 ns          247 bytes/s=19.3258G/s error,%=0\navx2\u003cf32aligned\u003e@threads/min_time:10.000/real_time           9612120 ns      9277454 ns         1467 bytes/s=111.707G/s error,%=0.521407\navx2\u003cf64\u003e@threads/min_time:10.000/real_time                 10091882 ns      9708706 ns         1389 bytes/s=106.397G/s error,%=0.520837\navx512\u003cf32streamed\u003e/min_time:10.000/real_time               55713332 ns     55615555 ns          243 bytes/s=19.2726G/s error,%=50\navx512\u003cf32streamed\u003e@threads/min_time:10.000/real_time        9701513 ns      9383267 ns         1435 bytes/s=110.678G/s error,%=50.2604\navx512\u003cf32unrolled\u003e/min_time:10.000/real_time               48203352 ns     48085623 ns          228 bytes/s=22.2753G/s error,%=50\navx512\u003cf32unrolled\u003e@threads/min_time:10.000/real_time        9275968 ns      8955543 ns         1508 bytes/s=115.755G/s error,%=50.2604\navx512\u003cf32interleaving\u003e/min_time:10.000/real_time           40012581 ns     39939290 ns          352 bytes/s=26.8351G/s error,%=50\navx512\u003cf32interleaving\u003e@threads/min_time:10.000/real_time    9477545 ns      9168739 ns         1488 bytes/s=113.293G/s error,%=50.2581\n```\n\nObservations:\n\n- 370 GB/s can be reached in dual-socket DDR5 setups with 12 channel memory.\n- Using Kahan-like schemes is 3x slower than pure `float` and 2x slower than `double`.\n\nOne of the interesting observations is the effect of latency hiding, interleaving the operations executing on different ports of the same CPU.\nIt is evident when benchmarking AVX-512 kernels on very small arrays:\n\n```sh\n--------------------------------------------------------------------------------------------------------------------\nBenchmark                                                          Time             CPU   Iterations UserCounters...\n--------------------------------------------------------------------------------------------------------------------\navx512/f32/streamed/min_time:10.000/real_time                    19.4 ns         19.4 ns    724081506 bytes/s=211.264G/s\navx512/f32/unrolled/min_time:10.000/real_time                    15.1 ns         15.1 ns    934282388 bytes/s=271.615G/s\navx512/f32/interleaving/min_time:10.000/real_time                12.3 ns         12.3 ns   1158791855 bytes/s=332.539G/s\n```\n\nThe reason this happens is that on Zen4:\n\n- Addition instructions like `vaddps zmm, zmm, zmm` and `vaddpd zmm, zmm, zmm` execute on ports 2 and 3.\n- Fused-Multiply-Add instructions like `vfmadd132ps zmm, zmm, zmm` execute on ports 0 and 1.\n\nSo if the CPU can fetch enough data in time, we can have at least 4 ports simultaneously busy, and the latency of the operation is hidden.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fparallelreductionsbenchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Fparallelreductionsbenchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fparallelreductionsbenchmark/lists"}