{"id":13670580,"url":"https://github.com/okuvshynov/b63","last_synced_at":"2025-10-04T10:32:06.729Z","repository":{"id":43638314,"uuid":"206397109","full_name":"okuvshynov/b63","owner":"okuvshynov","description":"Micro-benchmarking library for C and C++ with PMU counters tracking","archived":false,"fork":false,"pushed_at":"2023-01-31T15:20:44.000Z","size":75,"stargazers_count":56,"open_issues_count":0,"forks_count":9,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-01-10T22:44:57.385Z","etag":null,"topics":["benchmark","linux-perf","perf-events"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/okuvshynov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-04T19:28:08.000Z","updated_at":"2024-12-28T22:03:30.000Z","dependencies_parsed_at":"2024-10-27T21:53:46.473Z","dependency_job_id":null,"html_url":"https://github.com/okuvshynov/b63","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okuvshynov%2Fb63","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okuvshynov%2Fb63/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okuvshynov%2Fb63/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/okuvshynov%2Fb63/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/okuvshynov","download_url":"https://codeload.github.com/okuvshynov/b63/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235239082,"owners_count":18958091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","linux-perf","perf-events"],"created_at":"2024-08-02T09:00:45.711Z","updated_at":"2025-10-04T10:32:01.423Z","avatar_url":"https://github.com/okuvshynov.png","language":"C","funding_links":[],"categories":["Benchmarking","C","基准测试"],"sub_categories":[],"readme":"# B63\n\nLight-weight micro-benchmarking tool for C.\n\n## Motivation\nWhy was it built, given that quite a few already exist?\n- quick and easy benchmarking for C, not C++ only;\n- benchmarking custom counters, rather than time/cycles only, specifically:\n  - CPU Performance Monitoring Unit counters, for example number of cache misses or branch mispredictions;\n  - jemalloc memory allocations;\n  - custom measurements, like number of hash collisions.\n\n## Examples\nThe easiest way to get a sense of how it could be used is to look at and \nrun benchmarks from examples/ folder. The library is header-only, so examples only need to include:\n- b63.h header;\n- individual counter headers, if needed.\n\nThis is how benchmarking time, cpu cycles and cache misses might look like on Linux:\n\n```cpp\n#include \"../include/b63/b63.h\"\n#include \"../include/b63/counters/perf_events.h\"\n#include \u003calgorithm\u003e\n#include \u003ccstdint\u003e\n#include \u003ccstdlib\u003e\n#include \u003cctime\u003e\n#include \u003cnumeric\u003e\n#include \u003cvector\u003e\n\nconst size_t kSize = (1 \u003c\u003c 16);\nconst size_t kMask = kSize - 1;\n\n/* \n * B63_BASELINE defines a 'baseline' function to benchmark.\n * In this definition, 'sequential' is benchmark name,\n * and 'n' is the parameter the function needs to use as \n * 'how many iterations to run'. It is important to have this parameter\n * to be able to adjust the run time dynamically\n */\nB63_BASELINE(sequential, n) {\n  std::vector\u003cuint32_t\u003e v;\n\n  /* \n   * Anything within 'B63_SUSPEND' will not be counted\n   * towards benchmark score.\n   */\n  B63_SUSPEND {\n    v.resize(kSize);\n    std::iota(v.begin(), v.end(), 0);\n  }\n  int32_t res = 0;\n  for (size_t i = 0; i \u003c n; i++) {\n    for (size_t j = 0; j \u003c kSize; j++) {\n      res += v[j];\n    }\n  }\n  /* this is to prevent compiler from optimizing res out */\n  B63_KEEP(res);\n}\n\n/*\n * This is another benchmark, which will be compared to baseline\n */\nB63_BENCHMARK(random, n) {\n  std::vector\u003cuint32_t\u003e v;\n  B63_SUSPEND {\n    /* b63_seed is passed implicitly to every benchmark */\n    std::srand(b63_seed);\n    v.resize(kSize);\n    std::generate(v.begin(), v.end(), std::rand);\n  }\n  int32_t res = 0;\n  for (size_t i = 0; i \u003c n; i++) {\n    for (size_t j = 0; j \u003c kSize; j++) {\n      res += v[v[j] \u0026 kMask];\n    }\n  }\n  B63_KEEP(res);\n}\n\nint main(int argc, char **argv) {\n  srand(time(0));\n  /* \n   * This call starts benchmarking.\n   * Comma-separated list of counters to measure is passed explicitly here,\n   * but one can provide command-line flag -c to override.\n   * In this case, we are measuring 3 counters:\n   *  * lpe:cycles - CPU cycles spent in benchmark outside of B63_SUSPEND, as measured with Linux perf_events;\n   *  * lpe:L1-dcache-load-misses - CPU L1 Data cache misses during benchmark run outside of B63_SUSPEND;\n   *  * time - wall time outside of B63_SUSPEND.\n   */\n  B63_RUN_WITH(\"time,lpe:cycles,lpe:L1-dcache-load-misses\", argc, argv);\n  return 0;\n}\n```\n\nBuild and run:\n\nThis is the output of the sample run:\n```\n$ g++ -O3 bm_seed.cpp -o bm\n$ ./bm -i # i for interactive mode\nsequential                    time                : 52858.855\nrandom                        time                :148667.365 (+181.253% *)\nsequential                    lpe:cycles          : 132055.030\nrandom                        lpe:cycles          :372451.514 (+182.043% *)\nsequential                    lpe:L1-dcache-load-misses: 4969.704\nrandom                        lpe:L1-dcache-load-misses:80874.886 (+1527.358% *)\n```\nCurrently B63 repeats the run for every counter to reduce side-effects of measurement, but this might change in the future.\nThe way to read the results: for benchmark 'sequential', which is baseline version, we spent 52 milliseconds per iteration;\nFor 'random' version, we see clear increase in time and equivalent increase in CPU cycles (+181%), and a much more prominent increase in L1 data cache misses (+1724%). The asteriks means: p99 confidence interval for the difference between benchmark and baseline does not contain 0, thus, you can be 99% confident that it is derectionally correct result.\n\nExtra examples can be found in examples/ folder:\n1. Measuring time / iteration ([examples/basic.c](examples/basic.c));\n2. Suspending tracking ([examples/suspend.c](examples/suspend.c));\n3. Comparing implementations with baseline ([examples/baseline.c](examples/baseline.c));\n4. Using custom counter, number of function calls in this case ([examples/custom.c](examples/custom.c));\n5. Using cache miss counter from linux perf_events ([examples/l1d_miss.cpp](examples/l1d_miss.cpp));\n6. Using raw counter from linux perf_events ([examples/raw.c](examples/raw.c));\n7. Measuring jemalloc allocation stats ([examples/jemalloc.cpp](examples/jemalloc.cpp));\n8. Utilizing seed to keep benchmark results reproducible ([examples/bm_seed.cpp](examples/bm_seed.cpp));\n9. Multiple comparisons, including A/A test: ([examples/baseline_multi.c](examples/baseline_multi.c)).\n\n## Comparison and baselines\nWithin the benchmark suite, there's a way to define 'baseline', and compare all other benchmarks against it. When comparing, 99% confidence interval is computed using differences between individual epochs.\n\n## Output Modes\nTwo output modes are supported:\n - plaintext mode (default), which produces output suitable for scripting/parsing, printing out each epoch individually to leave an option for more advanced data studies.\n ```\n $ ./_build/bm_baseline\nbasic,time,16777215,233781738\nbasic,time,16777215,228961470\nbasic,time,16777215,230559174\nbasic,time,16777215,228707363\nbasic,time,16777215,228769396\nbasic_half,time,33554431,227525646\nbasic_half,time,33554431,228749848\nbasic_half,time,33554431,228985440\nbasic_half,time,33554431,228123909\nbasic_half,time,33554431,228560855\n```\n - interactive mode turned on with -i flag. There isn't much interactivity really, but the output is formatted and colored for human consumption, rather than other tool consumption.\n ```\n$ ./_build/bm_baseline -i\nbasic                         time                : 13.597\nbasic_half                    time                : 6.787 (-50.083% *)\n```\n\n## Configuration\n\n### CLI Flags\nFollowing CLI flags are supported:\n- -i if provided, interactive output mode will be used;\n- -c counter1[,counter2,counter3,...] -- override default counters for all benchmarks;\n- -e epochs_count -- override how many epochs to run the benchmark for;\n- -t timelimit_per_benchmark - time limit in seconds for how long to run the benchmark; includes time benchmark is suspended.\n- -d delimiter to use for plaintext. Comma is default.\n- -s seed. Optional, needed for reproducibility and A/B testing across binaries, for example, different versions of code or difference hardware. If not provided, seed will be generated.\n\n### Configuration in code\nIt's possible to configure the counters to run within the code itself, by using B63_RUN_WITH(\"list,of,counters\", argc, argv);\n\n## Counters\nIn addition to measuring time, B63 allows to define and use custom counters, for example CPU perf events. Some counters are already built and provided in counters/ folder, but framework is flexible and makes it easy to define new ones.\n\nFor now following counters are implemented:\n1) time - most basic counter, measures time in microseconds. [Linux, FreeBSD, MacOS]\n2) jemalloc - measures bytes allocated by jemalloc. [Linux, FreeBSD, MacOS]\n3) perf_events - measures custom CPU counters, like cache misses, branch mispredictions, etc. [Linux only, 2.6.31+]\n\n### Notes for building custom counters:\nCounters are expected to be additive and monotonic;\nImplementation of the counting and suspension lives in [include/b63/run.h](include/b63/run.h); [examples/custom.c](examples/custom.c) is a simple case of custom counter definition. All counters shipped with the library can be used as examples, as they do not rely on anything internal from b63. \n\nCounters header files should be included from benchmark c/cpp file directly; only default timer counter is included from\nb63 itself. It is done to avoid having an insane amount of ifdefs in the code and compilicated build rules, as counters have to be gated by compiler/os/libraries installed and used.\nWhen benchmarks are configured to run with multiple counters, each benchmark is re-run for each counter. This is an easy way to deal with measurement side effects, but has obvious disadvantages:\n- benchmark needs to run longer;\n- in cases when the variance between benchmarks runs is high, results might look confusing.\n\nThe suspension is an important case to understand and interpret correctly. To illustrate this, let's look at the following example [benchmark](examples/suspend.c):\n\n```\n$ ./_build/bm_suspend\nwith_suspend,time,8388607,117749190\nwith_suspend,time,8388607,117033209\nwith_suspend,time,8388607,114440936\nwith_suspend,time,8388607,114655889\nwith_suspend,time,8388607,114215822\nbasic,time,16777215,228015817\nbasic,time,16777215,230814726\nbasic,time,16777215,227958139\nbasic,time,16777215,228723995\nbasic,time,16777215,229286180\n$ ./_build/bm_suspend -i\nwith_suspend                  time                : 13.672\nbasic                         time                : 13.528\n```\n\nIn interactive mode, the rate of events per iteration is reported, while in plaintext mode number of iterations and number of events is printed out directly. Time limit for running the benchmark is taking time spent in suspension into account, to make run time predictable. Thus, the way to interpret the output is: 'with_suspend' is equivalent to 'basic' in non-suspended time, thus the time/iteration is very close. However, the suspended activity takes a while, so we had to run fewer iterations overall. \n\n### Existing counters:\n#### Linux perf_events (\"lpe:...\")\nThe acronym/prefix used is 'lpe'.\nThis family of counters uses perf_events interface, same as Linux perf tool. It allows counting performance events either by predefined names for popular counters (cycles, cache-misses, branches, page-faults) or custom CPU-specific raw codes in r\u003cMask\u003e\u003cEvent\u003e format. This makes answering questions like 'how many cache misses will different version of the code have?' or\n'how different execution ports on CPU are used across several implementation of the algorithm?' much easier compared to building separate binaries, running them with perf tool (or equivalent) drilling down to the function in question, etc. \n  \nExample usage:\n```\n$ ./bm_raw -c lpe:cycles,lpe:r04a1\n```\n\n#### Jemalloc thread allocations (\"jemalloc_thread_allocated\")\nThis counter tracks the number of bytes allocated by jemalloc in the calling thread. Example usage:\n```\n$ ./bm_jemalloc -c jemalloc_thread_allocated\n```\n\n#### Time (\"time\")\nDefault counter, counts microseconds.\n\n#### OS X kperf-based counters\nThe prefix is kperf. Currently only measures main thread. For a list of events supported, check https://github.com/okuvshynov/b63/blob/master/include/b63/counters/osx_kperf.h#L67-L75\n\n## Dependencies and compatibility\n\nB63 requires following C compiler attributes available:\n- \\_\\_attribute\\_\\_((cleanup))\n- \\_\\_attribute\\_\\_((used)) \n- \\_\\_attribute\\_\\_((section))\n\nReasonably recent GCC and Clang have them, but I'm not sure which versions started supporting them.\n\nIndividual counters can have specific requirements. For example, Linux perf_events, not surprisingly,\nwill only work on Linux, jemalloc counter will only work/make sense if memory allocation is done via jemalloc.\n\n### Tested On\n1. MacBook Pro 2019\n  - OS: MacOS 10.14.6 (x86_64-apple-darwin18.7.0)\n  - CPU: Intel(R) Core(TM) i5-8257U\n  - Compiler: clang-1001.0.46.4 (Apple LLVM)\n2. MacBook Pro 2009 \n  - OS: Ubuntu 18.04.3 (Kernel: 4.15.0-58)\n  - CPU: Intel(R) Core(TM) 2 Duo P8700\n  - Compiler: GCC 7.4.0\n3. Paspberry PI\n  - OS: Raspbian GNU/Linux 9 (Kernel: 4.14.71-v7+)\n  - CPU: ARMv7 Processor rev 4 (v7l)\n  - Compiler: GCC 6.3.0\n4. [VM] FreeBSD\n  - OS: FreeBSD 12.0\n  - Compiler: FreeBSD clang 6.0.1\n5. MacMini 2007\n  - OS: Ubuntu 11.10 (Kernel: 3.0.0-13-generic)\n  - CPU: Intel(R) Core(TM)2 CPU T5600\n  - Compiler: GCC 4.6.1\n  - Caveats:\n    - requires -lrt flag, as POSIX realtime extension are not (yet) in libc.\n    - ref-cycles event from linux perf_events is not supported.\n\n## Internals\nThe library consists of a core part responsible for running the benchmarks, and pluggable counters. The library is header-only, thus, there isn't much encapsulation going on.  Every global symbol is prefixed with b63\\_.\n\nMain internal data structures are:\n1) b63_benchmark. Each function defined with a 'B63_BENCHMARK' or 'B63_BASELINE' macro corresponds to one benchmark instance.\n2) b63_suite. Set of all benchmarks defined in the translation unit.\n3) b63_ctype. Counter Type. Defines a type/family of a counter, for example, 'linux_perf_event' or 'jemalloc'\n4) b63_counter. Instance of a counter, which has to be of one of the defined counter types. \n5) b63_counter_list. Set of all counters to run benchmarks for.\n6) b63_run. Individual benchmark execution.\n\n## Next steps:\n- a convenient way to measure outliers. For example, as hash maps usually have amortized O(1) cost for lookup, what does p99 lookup time looks like for some lookup distribution? What can be done to improve?\n- support CPU perf counters sources beyond Linux perf_events, for example [Intel's PCM](https://github.com/opcm/pcm) and [BSD pmcstat](https://www.freebsd.org/cgi/man.cgi?query=pmcstat).\n- GPU perf counters (at least for Nvidia).\n- [low-pri] disk access and network.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fokuvshynov%2Fb63","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fokuvshynov%2Fb63","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fokuvshynov%2Fb63/lists"}