{"id":17980916,"url":"https://github.com/nvidia/nvbench","last_synced_at":"2025-04-14T13:46:58.155Z","repository":{"id":37028168,"uuid":"344241309","full_name":"NVIDIA/nvbench","owner":"NVIDIA","description":"CUDA Kernel Benchmarking Library","archived":false,"fork":false,"pushed_at":"2025-04-13T04:14:53.000Z","size":1072,"stargazers_count":617,"open_issues_count":67,"forks_count":74,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-04-13T20:59:54.647Z","etag":null,"topics":["benchmark","cuda","cuda-kernels","gpu","kernel-benchmark","nvidia","performance"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-03T19:29:55.000Z","updated_at":"2025-04-13T14:21:05.000Z","dependencies_parsed_at":"2022-06-29T08:04:25.651Z","dependency_job_id":"fb60f83f-5d45-4305-834b-f2968d29c436","html_url":"https://github.com/NVIDIA/nvbench","commit_stats":{"total_commits":403,"total_committers":20,"mean_commits":20.15,"dds":0.3052109181141439,"last_synced_commit":"c03033b50e46748207b27685b1cdfcbe4a2fec59"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/nvbench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248890633,"owners_count":21178475,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","cuda","cuda-kernels","gpu","kernel-benchmark","nvidia","performance"],"created_at":"2024-10-29T18:06:55.022Z","updated_at":"2025-04-14T13:46:58.146Z","avatar_url":"https://github.com/NVIDIA.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Overview\n\nThis project is a work-in-progress. Everything is subject to change.\n\nNVBench is a C++17 library designed to simplify CUDA kernel benchmarking. It\nfeatures:\n\n* [Parameter sweeps](docs/benchmarks.md#parameter-axes): a powerful and\n  flexible \"axis\" system explores a kernel's configuration space. Parameters may\n  be dynamic numbers/strings or [static types](docs/benchmarks.md#type-axes).\n* [Runtime customization](docs/cli_help.md): A rich command-line interface\n  allows [redefinition of parameter axes](docs/cli_help_axis.md), CUDA device\n  selection, locking GPU clocks (Volta+), changing output formats, and more.\n* [Throughput calculations](docs/benchmarks.md#throughput-measurements): Compute\n  and report:\n  * Item throughput (elements/second)\n  * Global memory bandwidth usage (bytes/second and per-device %-of-peak-bw)\n* Multiple output formats: Currently supports markdown (default) and CSV output.\n* [Manual timer mode](docs/benchmarks.md#explicit-timer-mode-nvbenchexec_tagtimer):\n  (optional) Explicitly start/stop timing in a benchmark implementation.\n* Multiple measurement types:\n  * Cold Measurements:\n    * Each sample runs the benchmark once with a clean device L2 cache.\n    * GPU and CPU times are reported.\n  * Batch Measurements:\n    * Executes the benchmark multiple times back-to-back and records total time.\n    * Reports the average execution time (total time / number of executions).\n  * [CPU-only Measurements](docs/benchmarks.md#cpu-only-benchmarks)\n    * Measures the host-side execution time of a non-GPU benchmark.\n    * Not suitable for microbenchmarking.\n\n# Supported Compilers and Tools\n\n- CMake \u003e 3.30.4\n- CUDA Toolkit + nvcc: 11.8 and above\n- g++: 7 -\u003e 14\n- clang++: 14 -\u003e 19\n- Headers are tested with C++17 -\u003e C++20.\n\n# Getting Started\n\n## Minimal Benchmark\n\nA basic kernel benchmark can be created with just a few lines of CUDA C++:\n\n```cpp\nvoid my_benchmark(nvbench::state\u0026 state) {\n  state.exec([](nvbench::launch\u0026 launch) {\n    my_kernel\u003c\u003c\u003cnum_blocks, 256, 0, launch.get_stream()\u003e\u003e\u003e();\n  });\n}\nNVBENCH_BENCH(my_benchmark);\n```\n\nSee [Benchmarks](docs/benchmarks.md) for information on customizing benchmarks\nand implementing parameter sweeps.\n\n## Command Line Interface\n\nEach benchmark executable produced by NVBench provides a rich set of\ncommand-line options for configuring benchmark execution at runtime. See the\n[CLI overview](docs/cli_help.md)\nand [CLI axis specification](docs/cli_help_axis.md) for more information.\n\n## Examples\n\nThis repository provides a number of [examples](examples/) that demonstrate\nvarious NVBench features and usecases:\n\n- [Runtime and compile-time parameter sweeps](examples/axes.cu)\n- [CPU-only benchmarking](examples/cpu_only.cu)\n- [Enums and compile-time-constant-integral parameter axes](examples/enums.cu)\n- [Reporting item/sec and byte/sec throughput statistics](examples/throughput.cu)\n- [Skipping benchmark configurations](examples/skip.cu)\n- [Benchmarking on a specific stream](examples/stream.cu)\n- [Adding / hiding columns (summaries) in markdown output](examples/summaries.cu)\n- [Benchmarks that sync CUDA devices: `nvbench::exec_tag::sync`](examples/exec_tag_sync.cu)\n- [Manual timing: `nvbench::exec_tag::timer`](examples/exec_tag_timer.cu)\n\n### Building Examples\n\nTo build the examples:\n```\nmkdir -p build\ncd build\ncmake -DNVBench_ENABLE_EXAMPLES=ON -DCMAKE_CUDA_ARCHITECTURES=70 .. \u0026\u0026 make\n```\nBe sure to set `CMAKE_CUDA_ARCHITECTURE` based on the GPU you are running on.\n\nExamples are built by default into `build/bin` and are prefixed with `nvbench.example`.\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample output from `nvbench.example.throughput`\u003c/summary\u003e\n\n```\n# Devices\n\n## [0] `Quadro GV100`\n* SM Version: 700 (PTX Version: 700)\n* Number of SMs: 80\n* SM Default Clock Rate: 1627 MHz\n* Global Memory: 32163 MiB Free / 32508 MiB Total\n* Global Memory Bus Peak: 870 GiB/sec (4096-bit DDR @850MHz)\n* Max Shared Memory: 96 KiB/SM, 48 KiB/Block\n* L2 Cache Size: 6144 KiB\n* Maximum Active Blocks: 32/SM\n* Maximum Active Threads: 2048/SM, 1024/Block\n* Available Registers: 65536/SM, 65536/Block\n* ECC Enabled: No\n\n# Log\n\nRun:  throughput_bench [Device=0]\nWarn: Current measurement timed out (15.00s) while over noise threshold (1.26% \u003e 0.50%)\nPass: Cold: 0.262392ms GPU, 0.267860ms CPU, 7.19s total GPU, 27393x\nPass: Batch: 0.261963ms GPU, 7.18s total GPU, 27394x\n\n# Benchmark Results\n\n## throughput_bench\n\n### [0] Quadro GV100\n\n| NumElements |  DataSize  | Samples |  CPU Time  | Noise |  GPU Time  | Noise | Elem/s  | GlobalMem BW  | BWPeak | Batch GPU  | Batch  |\n|-------------|------------|---------|------------|-------|------------|-------|---------|---------------|--------|------------|--------|\n|    16777216 | 64.000 MiB |  27393x | 267.860 us | 1.25% | 262.392 us | 1.26% | 63.940G | 476.387 GiB/s | 58.77% | 261.963 us | 27394x |\n```\n\n\u003c/details\u003e\n\n\n## Demo Project\n\nTo get started using NVBench with your own kernels, consider trying out\nthe [NVBench Demo Project](https://github.com/allisonvacanti/nvbench_demo).\n\n`nvbench_demo` provides a simple CMake project that uses NVBench to build an\nexample benchmark. It's a great way to experiment with the library without a lot\nof investment.\n\n# Contributing\n\nContributions are welcome!\n\nFor current issues, see the [issue board](https://github.com/NVIDIA/nvbench/issues). Issues labeled with [![](https://img.shields.io/github/labels/NVIDIA/nvbench/good%20first%20issue)](https://github.com/NVIDIA/nvbench/labels/good%20first%20issue) are good for first time contributors.\n\n## Tests\n\nTo build `nvbench` tests:\n```\nmkdir -p build\ncd build\ncmake -DNVBench_ENABLE_TESTING=ON .. \u0026\u0026 make\n```\n\nTests are built by default into `build/bin` and prefixed with `nvbench.test`.\n\nTo run all tests:\n```\nmake test\n```\nor\n```\nctest\n```\n# License\n\nNVBench is released under the Apache 2.0 License with LLVM exceptions.\nSee [LICENSE](./LICENSE).\n\n# Scope and Related Projects\n\nNVBench will measure the CPU and CUDA GPU execution time of a ***single\nhost-side critical region*** per benchmark. It is intended for regression\ntesting and parameter tuning of individual kernels. For in-depth analysis of\nend-to-end performance of multiple applications, the NVIDIA Nsight tools are\nmore appropriate.\n\nNVBench is focused on evaluating the performance of CUDA kernels. It also provides\nCPU-only benchmarking facilities intended for non-trivial CPU workloads, but is\nnot optimized for CPU microbenchmarks. This may change in the future, but for now,\nconsider using Google Benchmark for high resolution CPU benchmarks.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fnvbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvidia%2Fnvbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fnvbench/lists"}