{"id":43363905,"url":"https://github.com/grahamking/perf-ninja-rs","last_synced_at":"2026-02-02T04:30:16.283Z","repository":{"id":50531022,"uuid":"517515594","full_name":"grahamking/perf-ninja-rs","owner":"grahamking","description":"Rust port of dendibakh/perf-ninja - an online course where you can learn and master the skill of low-level performance analysis and tuning. ","archived":false,"fork":false,"pushed_at":"2025-09-06T17:39:10.000Z","size":46562,"stargazers_count":246,"open_issues_count":0,"forks_count":19,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-09-06T19:26:59.968Z","etag":null,"topics":["performance","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/grahamking.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-07-25T04:23:28.000Z","updated_at":"2025-09-06T17:39:14.000Z","dependencies_parsed_at":"2024-08-25T00:40:40.048Z","dependency_job_id":null,"html_url":"https://github.com/grahamking/perf-ninja-rs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/grahamking/perf-ninja-rs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grahamking%2Fperf-ninja-rs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grahamking%2Fperf-ninja-rs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grahamking%2Fperf-ninja-rs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grahamking%2Fperf-ninja-rs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/grahamking","download_url":"https://codeload.github.com/grahamking/perf-ninja-rs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grahamking%2Fperf-ninja-rs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29004972,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-02T04:25:24.522Z","status":"ssl_error","status_checked_at":"2026-02-02T04:24:51.069Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["performance","rust"],"created_at":"2026-02-02T04:30:11.909Z","updated_at":"2026-02-02T04:30:16.277Z","avatar_url":"https://github.com/grahamking.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Rust labs for Performance Ninja Class\n\nRust port of the exercises in https://github.com/dendibakh/perf-ninja\n\nYou will need to watch the videos at the parent project, that's the course. To do the course in Rust, use this code instead of the parent C++ code.\n\nI recommend reading Denis' free ebook [Performance Analysis and Tuning on Modern CPUs](https://book.easyperf.net/perf_book) as you do the course. Things can get a little confusing otherwise, and the book all by itself is excellent; real practical performance tuning advice from an expert.\n\n## Lab assignments\n\n* Core Bound:\n  * [Vectorization 1](labs/core_bound/vectorization_1)\n  * [Vectorization 2](labs/core_bound/vectorization_2)\n  * [Function Inlining](labs/core_bound/function_inlining_1)\n  * [Dependency Chains 1](labs/core_bound/dep_chains_1)\n  * [Compiler Intrinsics 1](labs/core_bound/compiler_intrinsics_1)\n  * [Compiler Intrinsics 2](labs/core_bound/compiler_intrinsics_2)\n* Memory Bound:\n  * [Data Packing](labs/memory_bound/data_packing)\n  * [Loop Interchange 1](labs/memory_bound/loop_interchange_1): Rust version does not appear to be memory bound, see the README.\n  * [Loop Interchange 2](labs/memory_bound/loop_interchange_2): Rust version does not appear to be memory bound, see the README.\n  * [Loop Tiling](labs/memory_bound/loop_tiling_1)\n  * [SW memory prefetching](labs/memory_bound/swmem_prefetch_1)\n  * [False Sharing](labs/memory_bound/false_sharing_1)\n  * [Huge Pages](labs/memory_bound/huge_pages_1)\n* Bad Speculation:\n  * [Conditional Store](labs/bad_speculation/conditional_store_1)\n  * [Replacing Branches With Lookup Tables](labs/bad_speculation/lookup_tables_1)\n  * [Rust Virtual Calls](labs/bad_speculation/virtual_call_mispredict)\n* Misc:\n  * [Warmup](labs/misc/warmup)\n  * LTO: TODO\n  * PGO: TODO\n  * [Optimize IO](labs/misc/io_opt1)\n\nThe two Loop Interchange labs do not match their C++ version. They are probably not an accurate port and need changing.\n\nThese two labs match the bottlenecks of their C++ versions (under Clang 14), but have different bottlenecks than indicated.\n - Core Bound / Vectorization 1: Try debug mode, that has the correct bottleneck.\n - Memory Bound / SW memory prefetching: Not memory bound, bottleneck seems to be branch prediction.\n\nAside from those differences, the Rust code should serve you well in your studies to become a performance ninja!\n\n## Setup\n\nYou need:\n - [Rust](https://www.rust-lang.org/tools/install) and switch to [nightly](https://rust-lang.github.io/rustup/concepts/channels.html) release.\n - The videos from the parent project: https://github.com/dendibakh/perf-ninja\n - [pmu-tools](https://github.com/andikleen/pmu-tools) to do the investigation.\n\n## Layout\n\nEach lab is a cargo project. In brackets are the mappings to the C++ version.\n\n - `src/lib.rs`: The code you need to optimize (solution.cpp, solution.h, init.cpp)\n - `src/tests.rs`: A unit test (validate.cpp) to check your code still works.\n - `benches/bench_\u003ccrate\u003e.rs`: The benchmark (bench.cpp). This will tell you when you have made src/lib.rs:solution faster.\n\nYou will only need to touch the code in `lib.rs`. The unit test and the benchmark both call that code. The benchmark uses [criterion](https://docs.rs/criterion/latest/criterion/) to produce accurate numbers.\n\n## Work loop\n\n 1. `cargo bench`: How fast is it now?\n 1. Improve the code in `lib.rs`.\n 1. `cargo test --release`: Is it still correct?\n 1. Goto 1.\n\n### Better benchmarks\n\nCriterion (which `cargo bench` is using) does statistical benchmarking, but even with that I get a lot of variance between runs. We can do much better:\n\n 1. Download [runperf](https://gist.github.com/grahamking/9c8c91b871843a9a6ce2bec428b8f48d). This adjusts a bunch of things on Linux to provide repeatable, reliable benchmarks.\n 1. Find the benchmark binary. `cargo bench` builds it as `target/release/deps/bench_\u003ccrate\u003e_\u003chash\u003e`.\n 1. Run it directly: `runperf \u003cbenchmark_binary\u003e --bench`. You should get the same results every time.\n\n### Find bottlenecks\n\nThe videos often walk through this part. Profile the benchmark binary (in `target/release/deps/`). We need to disable criterion's overhead by passing `--profile-time \u003cseconds\u003e`. We always need to pass `--bench` to a Criterion benchmark binary. Use `runperf` (see above) for reliable results.\n\nExamples:\n   - `runperf perf stat ./target/release/deps/bench_\u003ccrate\u003e_\u003chash\u003e --bench --profile-time 5`\n   - `runperf perf record \u003cbinary\u003e --bench --profile-time 5` then `perf report -Mintel`.\n   - `runperf ~/src/pmu-tools/toplev --core S0-C0,S0-C1 -l1 -v --no-desc \u003cbinary\u003e --bench --profile-time 5` (then try with `-l2` instead of `-l1`)\n\n## Misc / Tips\n\nOptimize Rust for your CPU, and include frame pointers: `export RUSTFLAGS=\"-Ctarget-cpu=native -Cforce-frame-pointers=yes\"`.\n\nHave `perf report` display the call graph: `perf record --call-graph fp \u003cprog\u003e`. You need to build with `force-frame-pointers` (above in RUSTFLAGS).\n\nShow assembly: `objdump -Mintel -S -d target/release/deps/bench_vectorization_2 | rustfilt`.\n - `rustfilt` de-mangles Rust symbols: `cargo install rustfilt`\n - `-S` includes source code in the output\n\nBy default `perf record` uses the `cycles` events (number of CPU cycles). If you want to dig into a specific event provide that directly to perf:\n - Branch misses (bad speculation): `runperf perf record --call-graph fp --event=branch-misses:P \u003cprog\u003e`\n - Main memory load (backend bound): `--event=cycle_activity.stalls_l3_miss:P` (An L3 cache miss means we have to go to main memory)\n\nThe `:P` denotes a [Precise Event](https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/custom-analysis/custom-analysis-options/hardware-event-list/precise-events.html).\n\n`runperf` restricts execution to two cores and the `toplev` command above watches both those cores. The hope is that one core gets the tool (`toplev`, `perf`, etc) and the other core gets the program you're testing, and they both run without context switches (Linux tries to avoid moving programs between cores if possible). The downside is that it's not obvious which core your program ran on, and `toplev` output includes both. To simplify, edit `runperf`, replace `taskset -c 0,1 sudo nice -n -5 runuser -u $USERNAME -- $@` with `taskset -c 1 sudo nice -n -5 runuser -u $USERNAME -- $@` (ask taskset to only use core 1) and change the `toplev` command to `--core S0-C1` (only watch Socket 0, Core 1).\n\n## Notes on the port\n\nBest effort was made to keep the code as close to the C++ original as possible. That meant resisting iterator chaining, using C++ names (e.g. `ClassA`), and even sometimes ignoring `clippy`. The hope is that this makes it easier to follow along with the original videos.\n\n## Thanks\n\nThanks to my employer Dropbox for supporting this project during Hack Week 2022.\n\nIf this course is useful to you please consider supporting the parent project's Patreon or GitHub Sponsors.\n\n## License\n\nOriginal problems and ideas Copyright © 2021 by Denis Bakhvalov under Creative Commons license (CC BY 4.0).\nRust port Copyright © 2022 by Graham King under Creative Commons license (CC BY 4.0).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrahamking%2Fperf-ninja-rs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgrahamking%2Fperf-ninja-rs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrahamking%2Fperf-ninja-rs/lists"}