{"id":29035725,"url":"https://github.com/ashvardanian/BenchmarkingTutorial","last_synced_at":"2025-06-26T12:31:40.105Z","repository":{"id":157821888,"uuid":"465608128","full_name":"ashvardanian/less_slow.cpp","owner":"ashvardanian","description":"Playing around \"Less Slow\" coding practices in C++ 20, C, CUDA, PTX, \u0026 Assembly, from numerics \u0026 SIMD to coroutines, ranges, exception handling, networking and user-space IO","archived":false,"fork":false,"pushed_at":"2025-05-19T06:37:45.000Z","size":2188,"stargazers_count":1792,"open_issues_count":12,"forks_count":67,"subscribers_count":19,"default_branch":"main","last_synced_at":"2025-06-21T14:52:59.394Z","etag":null,"topics":["assembly","assembly-language","avx512","benchmark","coroutines","cpp","cpp-programming","cpp17","cpp20","cuda","gcc","google-benchmark","hpc","io-uring","linux-kernel","llvm","ptx","ranges","tutorial","tutorials"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/tags/less-slow/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-03-03T07:10:25.000Z","updated_at":"2025-06-18T05:43:07.000Z","dependencies_parsed_at":"2024-12-16T16:28:10.152Z","dependency_job_id":"b7d1d16b-8b38-4121-b291-20d8137e5330","html_url":"https://github.com/ashvardanian/less_slow.cpp","commit_stats":{"total_commits":21,"total_committers":2,"mean_commits":10.5,"dds":"0.47619047619047616","last_synced_commit":"fd548fae731e4273a723b4541c47417c78b689d1"},"previous_names":["ashvardanian/less_slow.cpp","ashvardanian/benchmarkingtutorial"],"tags_count":29,"template":false,"template_full_name":null,"purl":"pkg:github/ashvardanian/less_slow.cpp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fless_slow.cpp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fless_slow.cpp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fless_slow.cpp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fless_slow.cpp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/less_slow.cpp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fless_slow.cpp/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262067916,"owners_count":23253698,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assembly","assembly-language","avx512","benchmark","coroutines","cpp","cpp-programming","cpp17","cpp20","cuda","gcc","google-benchmark","hpc","io-uring","linux-kernel","llvm","ptx","ranges","tutorial","tutorials"],"created_at":"2025-06-26T12:31:37.068Z","updated_at":"2025-06-26T12:31:40.065Z","avatar_url":"https://github.com/ashvardanian.png","language":"C++","readme":"# Playing Around _Less Slow_ Coding Practices for C++, CUDA, and Assembly Code\n\n\u003e The benchmarks in this repository don't aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design.\n\u003e It also provides an example of using some non-[STL](https://en.wikipedia.org/wiki/Standard_Template_Library) but de facto standard libraries in C++, importing them via CMake and compiling from source.\n\u003e For higher-level abstractions and languages, check out [`less_slow.rs`](https://github.com/ashvardanian/less_slow.rs) and [`less_slow.py`](https://github.com/ashvardanian/less_slow.py).\n\u003e I needed many of these measurements to reconsider my own coding habits, but hopefully they're helpful to others as well.\n\u003e Most of the code is organized in very long, ordered, and nested `#pragma` sections — not necessarily the preferred form for everyone.\n\nMuch of modern code suffers from common pitfalls — bugs, security vulnerabilities, and __performance bottlenecks__.\nUniversity curricula and coding bootcamps tend to stick to traditional coding styles and standard features, rarely exposing the more fun, unusual, and potentially efficient design opportunities.\nThis repository explores just that.\n\n![Less Slow C++](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/less_slow.cpp.jpg?raw=true)\n\nThe code leverages C++20 and CUDA features and is designed primarily for GCC, Clang, and NVCC compilers on Linux, though it may work on other platforms.\nThe topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism.\nSome of the highlights include:\n\n- __100x cheaper random inputs?!__ Discover how input generation sometimes costs more than the algorithm.\n- __1% error in trigonometry at 1/40 cost:__ Approximate STL functions like [`std::sin`](https://en.cppreference.com/w/cpp/numeric/math/sin) in just 3 lines of code.\n- __4x faster lazy-logic__ with custom [`std::ranges`](https://en.cppreference.com/w/cpp/ranges) and iterators!\n- __Compiler optimizations beyond `-O3`:__ Learn about less obvious flags and techniques for another 2x speedup.\n- __Multiplying matrices?__ Check how a 3x3x3 GEMM can be 70% slower than 4x4x4, despite 60% fewer ops.\n- __Scaling AI?__ Measure the gap between theoretical [ALU](https://en.wikipedia.org/wiki/Arithmetic_logic_unit) throughput and your [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms).\n- __How many if conditions are too many?__ Test your CPU's branch predictor with just 10 lines of code.\n- __Prefer recursion to iteration?__ Measure the depth at which your algorithm will [`SEGFAULT`](https://en.wikipedia.org/wiki/Segmentation_fault).\n- __Why avoid exceptions?__ Take `std::error_code` or [`std::variant`](https://en.cppreference.com/w/cpp/utility/variant)-like [ADTs](https://en.wikipedia.org/wiki/Algebraic_data_type)?\n- __Scaling to many cores?__ Learn how to use [OpenMP](https://en.wikipedia.org/wiki/OpenMP), Intel's oneTBB, or your custom thread pool.\n- __How to handle [JSON](https://www.json.org/json-en.html) avoiding memory allocations?__ Is it easier with C++ 20 or old-school C 99 tools?\n- __How to properly use STL's associative containers__ with custom keys and transparent comparators?\n- __How to beat a hand-written parser__ with [`consteval`](https://en.cppreference.com/w/cpp/language/consteval) RegEx engines?\n- __Is the pointer size really 64 bits__ and how to exploit [pointer-tagging](https://en.wikipedia.org/wiki/Tagged_pointer)?\n- __How many packets is [UDP](https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/) dropping__ and how to serve web requests in [`io_uring`](https://en.wikipedia.org/wiki/Io_uring) from user-space?\n- __Scatter and Gather__ for 50% faster vectorized disjoint memory operations.\n- __Intel's oneAPI vs Nvidia's CCCL?__ What's so special about `\u003cthrust\u003e` and `\u003ccub\u003e`?\n- __CUDA C++, [PTX](https://en.wikipedia.org/wiki/Parallel_Thread_Execution) Intermediate Representations, and SASS__, and how do they differ from CPU code?\n- __How to choose between intrinsics, inline `asm`, and separate `.S` files__ for your performance-critical code?\n- __Tensor Cores \u0026 Memory__ differences on CPUs, and Volta, Ampere, Hopper, and Blackwell GPUs!\n- __How coding FPGA differs from GPU__ and what is High-Level Synthesis, Verilog, and VHDL? 🔜 #36\n- __What are Encrypted Enclaves__ and what's the latency of Intel SGX, AMD SEV, and ARM Realm? 🔜 #31\n\nTo read, jump to the [`less_slow.cpp` source file](https://github.com/ashvardanian/less_slow.cpp/blob/main/less_slow.cpp) and read the code snippets and comments.\nKeep in mind, that most modern IDEs have a navigation bar to help you view and jump between `#pragma region` sections.\nFollow the instructions below to run the code in your environment and compare it to the comments as you read through the source.\n\n## Running the Benchmarks\n\nThe project aims to be compatible with GCC, Clang, and MSVC compilers on Linux, MacOS, and Windows.\nThat said, to cover the broadest functionality, using GCC on Linux is recommended:\n\n- If you are on Windows, it's recommended that you set up a Linux environment using [WSL](https://docs.microsoft.com/en-us/windows/wsl/install).\n- If you are on MacOS, consider using the non-native distribution of Clang from [Homebrew](https://brew.sh) or [MacPorts](https://www.macports.org).\n- If you are on Linux, make sure to install CMake and a recent version of GCC or Clang compilers to support C++20 features.\n\nIf you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.\n\n```sh\ngit clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository\ncd less_slow.cpp                                            # Change the directory\n\npip install cmake --upgrade                                 # PyPI has a newer version of CMake\nsudo apt-get install -y build-essential g++                 # Install default build tools\nsudo apt-get install -y pkg-config liburing-dev             # Install liburing for kernel-bypass\nsudo apt-get install -y libopenblas-base                    # Install numerics libraries\n\ncmake -B build_release -D CMAKE_BUILD_TYPE=Release          # Generate the build files\ncmake --build build_release --config Release                # Build the project\nbuild_release/less_slow                                     # Run the benchmarks\n```\n\nThe build will pull and compile several third-party dependencies from the source:\n\n- Google's [Benchmark](https://github.com/google/benchmark) is used for profiling.\n- Intel's [oneTBB](https://github.com/uxlfoundation/oneTBB) is used as the Parallel STL backend.\n- Meta's [libunifex](https://github.com/facebookexperimental/libunifex) is used for senders \u0026 executors.\n- Eric Niebler's [range-v3](https://github.com/ericniebler/range-v3) replaces `std::ranges`.\n- Victor Zverovich's [fmt](https://github.com/fmtlib/fmt) replaces `std::format`.\n- Ash Vardanian's [StringZilla](https://github.com/ashvardanian/stringzilla) replaces `std::string`.\n- Hana Dusíková's [CTRE](https://github.com/hanickadot/compile-time-regular-expressions) replaces `std::regex`.\n- Niels Lohmann's [json](https://github.com/nlohmann/json) is used for JSON deserialization.\n- Yaoyuan Guo's [yyjson](https://github.com/ibireme/yyjson) for faster JSON processing.\n- Google's [Abseil](https://github.com/abseil/abseil-cpp) replaces STL's associative containers.\n- Lewis Baker's [cppcoro](https://github.com/lewissbaker/cppcoro) implements C++20 coroutines.\n- Jens Axboe's [liburing](https://github.com/axboe/liburing) to simplify Linux kernel-bypass.\n- Chris Kohlhoff's [ASIO](https://github.com/chriskohlhoff/asio) as a [networking TS](https://en.cppreference.com/w/cpp/experimental/networking) extension.\n- Nvidia's [CCCL](https://github.com/nvidia/cccl) for GPU-accelerated algorithms.\n- Nvidia's [CUTLASS](https://github.com/nvidia/cutlass) for GPU-accelerated Linear Algebra.\n\nTo build without Parallel STL, Intel TBB, BLAS, and CUDA:\n\n```sh\ncmake -B build_release -D CMAKE_BUILD_TYPE=Release -D USE_INTEL_TBB=OFF -D USE_NVIDIA_CCCL=OFF -D USE_BLAS=OFF\ncmake --build build_release --config Release\n```\n\nTo build on MacOS, pulling key dependencies from [Homebrew](https://brew.sh):\n\n```sh\nbrew install openblas\ncmake -B build_release \\\n      -D CMAKE_BUILD_TYPE=Release \\\n      -D CMAKE_C_FLAGS=\"-I$(brew --prefix openblas)/include\" \\\n      -D CMAKE_CXX_FLAGS=\"-I$(brew --prefix openblas)/include\" \\\n      -D CMAKE_EXE_LINKER_FLAGS=\"-L$(brew --prefix openblas)/lib\"\ncmake --build build_release --config Release\n```\n\nTo control the output or run specific benchmarks, use the following flags:\n\n```sh\nbuild_release/less_slow --benchmark_format=json             # Output in JSON format\nbuild_release/less_slow --benchmark_out=results.json        # Save the results to a file instead of `stdout`\nbuild_release/less_slow --benchmark_filter=std_sort         # Run only benchmarks containing `std_sort` in their name\n```\n\nTo enhance stability and reproducibility, disable Simultaneous Multi-Threading __(SMT)__ on your CPU and use the `--benchmark_enable_random_interleaving=true` flag, which shuffles and interleaves benchmarks as described [here](https://github.com/google/benchmark/blob/main/docs/random_interleaving.md).\n\n```sh\nbuild_release/less_slow --benchmark_enable_random_interleaving=true\n```\n\nGoogle Benchmark supports [User-Requested Performance Counters](https://github.com/google/benchmark/blob/main/docs/perf_counters.md) through `libpmf`.\nNote that collecting these may require `sudo` privileges.\n\n```sh\nsudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters=\"CYCLES,INSTRUCTIONS\"\n```\n\nAlternatively, use the Linux `perf` tool for performance counter collection:\n\n```sh\nsudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort\n```\n\n## Project Structure\n\nThe primary file of this repository is clearly the `less_slow.cpp` C++ file with CPU-side code.\nSeveral other files for different hardware-specific optimizations are created:\n\n```sh\n$ tree .\n.\n├── CMakeLists.txt          # Build \u0026 assembly instructions for all files\n├── less_slow.cpp           # Primary CPU-side benchmarking code with the majority of examples\n├── less_slow_amd64.S       # Hand-written Assembly kernels for 64-bit x86 CPUs\n├── less_slow_aarch64.S     # Hand-written Assembly kernels for 64-bit Arm CPUs\n├── less_slow.cu            # CUDA C++ examples for parallel algorithms for Nvidia GPUs\n├── less_slow_sm70.ptx      # Hand-written PTX IR kernels for Nvidia Volta GPUs\n└── less_slow_sm90a.ptx     # Hand-written PTX IR kernels for Nvidia Hopper GPUs\n```\n\n## Memes and References\n\nEducational content without memes?!\nCome on!\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"https://github.com/ashvardanian/ashvardanian/blob/master/memes/ieee764-vs-gnu-compiler.jpg?raw=true\" alt=\"IEEE 754 vs GNU Compiler\"\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"https://github.com/ashvardanian/ashvardanian/blob/master/memes/no-easter-bunny-no-free-abstractions.jpg?raw=true\" alt=\"No Easter Bunny, No Free Abstractions\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n## Google Benchmark Functionality\n\nThis benchmark suite uses most of the features provided by Google Benchmark.\nIf you write a lot of benchmarks and avoid going to the full [User Guide](https://github.com/google/benchmark/blob/main/docs/user_guide.md), here is a condensed list of the most useful features:\n\n- `-\u003eArgs({x, y})` - Pass multiple arguments to parameterized benchmarks\n- `BENCHMARK()` - Register a basic benchmark function\n- `BENCHMARK_CAPTURE()` - Create variants of benchmarks with different captured values\n- `Counter::kAvgThreads` - Specify thread-averaged counters\n- `DoNotOptimize()` - Prevent compiler from optimizing away operations\n- `ClobberMemory()` - Force memory synchronization\n- `-\u003eComplexity(oNLogN)` - Specify and validate algorithmic complexity\n- `-\u003eSetComplexityN(n)` - Set input size for complexity calculations\n- `-\u003eComputeStatistics(\"max\", ...)` - Calculate custom statistics across runs\n- `-\u003eIterations(n)` - Control exact number of iterations\n- `-\u003eMinTime(n)` - Set minimum benchmark duration\n- `-\u003eMinWarmUpTime(n)` - To warm up the data caches\n- `-\u003eName(\"...\")` - Assign custom benchmark names\n- `-\u003eRange(start, end)` - Profile for a range of input sizes\n- `-\u003eRangeMultiplier(n)` - Set multiplier between range values\n- `-\u003eReportAggregatesOnly()` - Show only aggregated statistics\n- `state.counters[\"name\"]` - Create custom performance counters\n- `state.PauseTiming()`, `ResumeTiming()` - Control timing measurement\n- `state.SetBytesProcessed(n)` - Record number of bytes processed\n- `state.SkipWithError()` - Skip benchmark with error message\n- `-\u003eThreads(n)` - Run benchmark with specified number of threads\n- `-\u003eUnit(kMicrosecond)` - Set time unit for reporting\n- `-\u003eUseRealTime()` - Measure real time instead of CPU time\n- `-\u003eUseManualTime()` - To feed custom timings for GPU and IO benchmarks\n","funding_links":[],"categories":["Engineering \u0026 Performance"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2FBenchmarkingTutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2FBenchmarkingTutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2FBenchmarkingTutorial/lists"}