{"id":31892265,"url":"https://github.com/ashvardanian/forkunion","last_synced_at":"2025-12-16T09:05:47.859Z","repository":{"id":291327821,"uuid":"974172475","full_name":"ashvardanian/ForkUnion","owner":"ashvardanian","description":"Lower-latency OpenMP-style minimalistic scoped thread-pool designed for 'Fork-Join' parallelism in Rust and C++, avoiding memory allocations, mutexes, CAS-primitives, and false-sharing on the hot path 🍴","archived":false,"fork":false,"pushed_at":"2025-10-11T21:28:57.000Z","size":606,"stargazers_count":260,"open_issues_count":6,"forks_count":22,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-10-13T04:03:24.223Z","etag":null,"topics":["arm","atomics","compare-and-swap","concurrency","memory-model","mpi","multithreading","openmp","parallel-computing","parallel-stl","parallelism","rayon","thread-pool","threadpool"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/posts/beyond-openmp-in-cpp-rust/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-04-28T11:13:59.000Z","updated_at":"2025-10-12T10:50:47.000Z","dependencies_parsed_at":"2025-06-13T16:53:24.379Z","dependency_job_id":"427e7743-3f2d-4376-a6c5-030ed1570cd4","html_url":"https://github.com/ashvardanian/ForkUnion","commit_stats":null,"previous_names":["ashvardanian/fork_union","ashvardanian/forkunion"],"tags_count":32,"template":false,"template_full_name":null,"purl":"pkg:github/ashvardanian/ForkUnion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FForkUnion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FForkUnion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FForkUnion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FForkUnion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/ForkUnion/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2FForkUnion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279014328,"owners_count":26085492,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arm","atomics","compare-and-swap","concurrency","memory-model","mpi","multithreading","openmp","parallel-computing","parallel-stl","parallelism","rayon","thread-pool","threadpool"],"created_at":"2025-10-13T08:47:35.358Z","updated_at":"2025-10-13T08:47:36.431Z","avatar_url":"https://github.com/ashvardanian.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fork Union 🍴\n\nFork Union is arguably the lowest-latency OpenMP-style NUMA-aware minimalistic scoped thread-pool designed for 'Fork-Join' parallelism in C++, C, and Rust, avoiding × [mutexes \u0026 system calls](#locks-and-mutexes), × [dynamic memory allocations](#memory-allocations), × [CAS-primitives](#atomics-and-cas), and × [false-sharing](#alignment--false-sharing) of CPU cache-lines on the hot path 🍴\n\n## Motivation\n\nMost \"thread-pools\" are not, in fact, thread-pools, but rather \"task-queues\" that are designed to synchronize a concurrent dynamically growing list of heap-allocated globally accessible shared objects.\nIn C++ terms, think of it as a `std::queue\u003cstd::function\u003cvoid()\u003e\u003e` protected by a `std::mutex`, where each thread waits for the next task to be available and then executes it on some random core chosen by the OS scheduler.\nAll of that is slow... and true across C++, C, and Rust projects.\nShort of [OpenMP](https://en.wikipedia.org/wiki/OpenMP), practically every other solution has high dispatch latency and noticeable memory overhead.\nOpenMP, however, is not ideal for fine-grained parallelism and is less portable than the C++ and Rust standard libraries.\n\n[![`fork_union` banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/fork_union.jpg?raw=true)](https://github.com/ashvardanian/fork_union)\n\nThis is where __`fork_union`__ comes in.\nIt's a C++ 17 library with C 99 and Rust bindings ([previously Rust implementation was standalone in v1](#why-not-reimplement-it-in-rust)).\nIt supports pinning threads to specific [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes or individual CPU cores, making it much easier to ensure data locality and halving the latency of individual loads in Big Data applications.\n\n## Basic Usage\n\n__`Fork Union`__ is dead-simple to use!\nThere is no nested parallelism, exception handling, or \"future promises\"; they are banned.\nThe thread pool itself has a few core operations:\n\n- `try_spawn` to initialize worker threads, and \n- `for_threads` to launch a blocking callback on all threads.\n\nHigher-level APIs for index-addressable tasks are also available:\n\n- `for_n` - for individual evenly-sized tasks,\n- `for_n_dynamic` - for individual unevenly-sized tasks,\n- `for_slices` - for slices of evenly-sized tasks.\n\nFor additional flow control and tuning, following helpers are available:\n\n- `sleep(microseconds)` - for longer naps,\n- `terminate` - to kill the threads before the destructor is called,\n- `unsafe_for_threads` - to broadcast a callback without blocking,\n- `unsafe_join` - to block until the completion of the current broadcast.\n\nOn Linux, in C++, given the maturity and flexibility of the HPC ecosystem, it provides [NUMA extensions](#non-uniform-memory-access-numa).\nThat includes the `linux_colocated_pool` analog of the `basic_pool` and the `linux_numa_allocator` for allocating memory on a specific NUMA node.\nThose are out-of-the-box compatible with the higher-level APIs.\nMost interestingly, for Big Data applications, a higher-level `distributed_pool` class will address and balance the work across all NUMA nodes.\n\n### Intro in Rust\n\nTo integrate into your Rust project, add the following lines to Cargo.toml:\n\n```toml\n[dependencies]\nfork_union = \"2.3.0\"                                    # default\nfork_union = { version = \"2.3.0\", features = [\"numa\"] } # with NUMA support on Linux\n```\n\nOr for the preview development version:\n\n```toml\n[dependencies]\nfork_union = { git = \"https://github.com/ashvardanian/fork_union.git\", branch = \"main-dev\" }\n```\n\nA minimal example may look like this:\n\n```rust\nuse fork_union as fu;\nlet mut pool = fu::spawn(2);\npool.for_threads(|thread_index, colocation_index| {\n    println!(\"Hello from thread # {} on colocation # {}\", thread_index + 1, colocation_index + 1);\n});\n```\n\nHigher-level APIs distribute index-addressable tasks across the threads in the pool:\n\n```rust\npool.for_n(100, |prong| {\n    println!(\"Running task {} on thread # {}\",\n        prong.task_index + 1, prong.thread_index + 1);\n});\npool.for_slices(100, |prong, count| {\n    println!(\"Running slice [{}, {}) on thread # {}\",\n        prong.task_index, prong.task_index + count, prong.thread_index + 1);\n});\npool.for_n_dynamic(100, |prong| {\n    println!(\"Running task {} on thread # {}\",\n        prong.task_index + 1, prong.thread_index + 1);\n});\n```\n\nA more realistic example with named threads and error handling may look like this:\n\n```rust\nuse std::error::Error;\nuse fork_union as fu;\n\nfn heavy_math(_: usize) {}\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn Error\u003e\u003e {\n    let mut pool = fu::ThreadPool::try_spawn(4)?;\n    let mut pool = fu::ThreadPool::try_named_spawn(\"heavy-math\", 4)?;\n    pool.for_n_dynamic(400, |prong| {\n        heavy_math(prong.task_index);\n    });\n    Ok(())\n}\n```\n\nFor advanced usage, refer to the [NUMA section below](#non-uniform-memory-access-numa).\nFor convenience Rayon-style parallel iterators pull the `prelude` module and [check out related examples](#rayon-style-parallel-iterators).\n\n### Intro in C++\n\nTo integrate into your C++ project, either just copy the `include/fork_union.hpp` file into your project, add a Git submodule, or CMake.\nFor a Git submodule, run:\n\n```bash\ngit submodule add https://github.com/ashvardanian/fork_union.git extern/fork_union\n```\n\nAlternatively, using CMake:\n\n```cmake\nFetchContent_Declare(\n    fork_union\n    GIT_REPOSITORY https://github.com/ashvardanian/fork_union\n    GIT_TAG v2.3.0\n)\nFetchContent_MakeAvailable(fork_union)\ntarget_link_libraries(your_target PRIVATE fork_union::fork_union)\n```\n\nThen, include the header in your C++ code:\n\n```cpp\n#include \u003cfork_union.hpp\u003e   // `basic_pool_t`\n#include \u003ccstdio\u003e           // `stderr`\n#include \u003ccstdlib\u003e          // `EXIT_SUCCESS`\n\nnamespace fu = ashvardanian::fork_union;\n\nint main() {\n    alignas(fu::default_alignment_k) fu::basic_pool_t pool;\n    if (!pool.try_spawn(std::thread::hardware_concurrency())) {\n        std::fprintf(stderr, \"Failed to fork the threads\\n\");\n        return EXIT_FAILURE;\n    }\n\n    // Dispatch a callback to each thread in the pool\n    pool.for_threads([\u0026](std::size_t thread_index) noexcept {\n        std::printf(\"Hello from thread # %zu (of %zu)\\n\", thread_index + 1, pool.threads_count());\n    });\n\n    // Execute 1000 tasks in parallel, expecting them to have comparable runtimes\n    // and mostly co-locating subsequent tasks on the same thread. Analogous to:\n    //\n    //      #pragma omp parallel for schedule(static)\n    //      for (int i = 0; i \u003c 1000; ++i) { ... }\n    //\n    // You can also think about it as a shortcut for the `for_slices` + `for`.\n    pool.for_n(1000, [](std::size_t task_index) noexcept {\n        std::printf(\"Running task %zu of 1000\\n\", task_index + 1);\n    });\n    pool.for_slices(1000, [](std::size_t first_index, std::size_t count) noexcept {\n        std::printf(\"Running slice [%zu, %zu)\\n\", first_index, first_index + count);\n    });\n\n    // Like `for_n`, but each thread greedily steals tasks, without waiting for  \n    // the others or expecting individual tasks to have same runtimes. Analogous to:\n    //\n    //      #pragma omp parallel for schedule(dynamic, 1)\n    //      for (int i = 0; i \u003c 3; ++i) { ... }\n    pool.for_n_dynamic(3, [](std::size_t task_index) noexcept {\n        std::printf(\"Running dynamic task %zu of 3\\n\", task_index + 1);\n    });\n    return EXIT_SUCCESS;\n}\n```\n\nFor advanced usage, refer to the [NUMA section below](#non-uniform-memory-access-numa).\nNUMA detection on Linux defaults to AUTO. Override with `-D FORK_UNION_ENABLE_NUMA=ON` or `OFF`.\n\n## Alternatives \u0026 Differences\n\nMany other thread-pool implementations are more feature-rich but have different limitations and design goals.\n\n- Modern C++: [`taskflow/taskflow`](https://github.com/taskflow/taskflow), [`progschj/ThreadPool`](https://github.com/progschj/ThreadPool), [`bshoshany/thread-pool`](https://github.com/bshoshany/thread-pool)\n- Traditional C++: [`vit-vit/CTPL`](https://github.com/vit-vit/CTPL), [`mtrebi/thread-pool`](https://github.com/mtrebi/thread-pool)\n- Rust: [`tokio-rs/tokio`](https://github.com/tokio-rs/tokio), [`rayon-rs/rayon`](https://github.com/rayon-rs/rayon), [`smol-rs/smol`](https://github.com/smol-rs/smol)\n\nThose are not designed for the same OpenMP-like use cases as __`fork_union`__.\nInstead, they primarily focus on task queuing, which requires significantly more work.\n\n### Locks and Mutexes\n\nUnlike the `std::atomic`, the `std::mutex` is a system call, and it can be expensive to acquire and release.\nIts implementations generally have 2 executable paths:\n\n- the fast path, where the mutex is not contended, where it first tries to grab the mutex via a compare-and-swap operation, and if it succeeds, it returns immediately.\n- the slow path, where the mutex is contended, and it has to go through the kernel to block the thread until the mutex is available.\n\nOn Linux, the latter translates to [\"futex\"](https://en.wikipedia.org/wiki/Futex) [\"system calls\"](https://en.wikipedia.org/wiki/System_call), which is expensive.\n\n### Memory Allocations\n\nC++ has rich functionality for concurrent applications, like `std::future`, `std::packaged_task`, `std::function`, `std::queue`, `std::condition_variable`, and so on.\nMost of those, I believe, are unusable in Big-Data applications, where you always operate in memory-constrained environments:\n\n- The idea of raising a `std::bad_alloc` exception when there is no memory left and just hoping that someone up the call stack will catch it is not a great design idea for any Systems Engineering.\n- The threat of having to synchronize ~200 physical CPU cores across 2-8 sockets and potentially dozens of [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes around a shared global memory allocator practically means you can't have predictable performance.\n\nAs we focus on a simpler ~~concurrency~~ parallelism model, we can avoid the complexity of allocating shared states, wrapping callbacks into some heap-allocated \"tasks\", and other boilerplate code.\nLess work - more performance.\n\n### Atomics and [CAS](https://en.wikipedia.org/wiki/Compare-and-swap)\n\nOnce you get to the lowest-level primitives on concurrency, you end up with the `std::atomic` and a small set of hardware-supported atomic instructions.\nHardware implements it differently:\n\n- x86 is built around the \"Total Store Order\" (TSO) [memory consistency model](https://en.wikipedia.org/wiki/Memory_ordering) and provides `LOCK` variants of the `ADD` and `CMPXCHG`, which act as full-blown \"fences\" - no loads or stores can be reordered across it.\n- Arm, on the other hand, has a \"weak\" memory model and provides a set of atomic instructions that are not fences, that match the C++ concurrency model, offering `acquire`, `release`, and `acq_rel` variants of each atomic instruction—such as `LDADD`, `STADD`, and `CAS` - which allow precise control over visibility and order, especially with the introduction of \"Large System Extension\" (LSE) instructions in Armv8.1.\n\nIn practice, a locked atomic on x86 requires the cache line in the Exclusive state in the requester's L1 cache.\nThis would incur a coherence transaction (Read-for-Ownership) if some other core had the line.\nBoth Intel and AMD handle this similarly.\n\nIt makes [Arm and Power much more suitable for lock-free programming](https://arangodb.com/2021/02/cpp-memory-model-migrating-from-x86-to-arm/) and concurrent data structures, but some observations hold for both platforms.\nMost importantly, \"Compare and Swap\" (CAS) is a costly operation and should be avoided whenever possible.\n\nOn x86, for example, the `LOCK ADD` [can easily take 50 CPU cycles](https://travisdowns.github.io/blog/2020/07/06/concurrency-costs), being 50x slower than a regular `ADD` instruction, but still easily 5-10x faster than a `LOCK CMPXCHG` instruction.\nOnce contention rises, the gap naturally widens and is further amplified by the increased \"failure\" rate of the CAS operation, particularly when the value being compared has already changed.\nThat's why, for the \"dynamic\" mode, we resort to using an additional atomic variable as opposed to more typical CAS-based implementations.\n\n### Alignment \u0026 False Sharing\n\nThe thread-pool needs several atomic variables to synchronize the state.\nIf those variables are located on the same cache line, they will be \"falsely shared\" between threads.\nThis means that when one thread updates one of the variables, it will invalidate the cache line in all other threads, causing them to reload it from memory.\nThis is a common problem, and the C++ standard recommends addressing it with `alignas(std::hardware_destructive_interference_size)` for your hot variables.\n\nThere are, however, caveats.\nThe `std::hardware_destructive_interference_size` is [generally 64 bytes on x86](https://stackoverflow.com/a/39887282), matching the size of a single cache line.\nBut in reality, on most x86 machines, [depending on the BIOS \"spatial prefetcher\" settings](https://www.techarp.com/bios-guide/cpu-adjacent-sector-prefetch/), will [fetch 2 cache lines at a time starting with Sandy Bridge](https://stackoverflow.com/a/72127222).\nBecause of these rules, padding hot variables to 128 bytes is a conservative but often sensible defensive measure adopted by Folly's `cacheline_align` and Java's `jdk.internal.vm.annotation.Contended`. ￼\n\n## Pro Tips\n\n### Non-Uniform Memory Access (NUMA)\n\nHandling NUMA isn't trivial and is only supported on Linux with the help of the [`libnuma` library](https://github.com/numactl/numactl).\nIt provides the `mbind` interface to pin specific memory regions to particular NUMA nodes, as well as helper functions to query the system topology, which are exposed via the `fork_union::numa_topology` template.\n\nLet's say you are working on a Big Data application, like brute-forcing Vector Search using the [SimSIMD](https://github.com/ashvardanian/simsimd) library on a 2 dual-socket CPU system, similar to [USearch](https://github.com/unum-cloud/usearch/pulls).\nThe first part of that program may be responsible for sharding the incoming stream of data between distinct memory regions.\nThat part, in our simple example will be single-threaded:\n\n```cpp\n#include \u003cvector\u003e // `std::vector`\n#include \u003cspan\u003e // `std::span`\n#include \u003cfork_union.hpp\u003e // `linux_numa_allocator`, `numa_topology_t`, `linux_distributed_pool_t`\n#include \u003csimsimd/simsimd.h\u003e // `simsimd_f32_cos`, `simsimd_distance_t`\n\nnamespace fu = ashvardanian::fork_union;\nusing floats_alloc_t = fu::linux_numa_allocator\u003cfloat\u003e;\n\nconstexpr std::size_t dimensions = 768; /// Matches most BERT-like models\nstatic std::vector\u003cfloat, floats_alloc_t\u003e first_half(floats_alloc_t(0));\nstatic std::vector\u003cfloat, floats_alloc_t\u003e second_half(floats_alloc_t(1));\nstatic fu::numa_topology_t numa_topology;\nstatic fu::linux_distributed_pool_t distributed_pool;\n\n/// Dynamically shards incoming vectors across 2 nodes in a round-robin fashion.\nvoid append(std::span\u003cfloat, dimensions\u003e vector) {\n    bool put_in_second = first_half.size() \u003e second_half.size();\n    if (put_in_second) second_half.insert(second_half.end(), vector.begin(), vector.end());\n    else first_half.insert(first_half.end(), vector.begin(), vector.end());\n}\n```\n\nThe concurrent part would involve spawning threads adjacent to every memory pool to find the best `search_result_t`.\nThe primary `search` function, in ideal world would look like this:\n\n1. Each thread finds the best match within its \"slice\" of a NUMA node, tracking the best distance and index in a local CPU register.\n2. All threads in each NUMA node atomically synchronize using a NUMA-local instance of `search_result_t`.\n3. The main thread collects aggregates of partial results from all NUMA nodes.\n\nThat is, however, overly complicated to implement.\nSuch tree-like hierarchical reductions are optimal in a theoretical sense. Still, assuming the relative cost of spin-locking once at the end of a thread scope and the complexity of organizing the code, the more straightforward path is better.\nA minimal example would look like this:\n\n```cpp\n/// On each NUMA node we'll synchronize the threads\nstruct search_result_t {\n    simsimd_distance_t best_distance {std::numeric_limits\u003csimsimd_distance_t\u003e::max()};\n    std::size_t best_index {0};\n};\n\ninline search_result_t pick_best(search_result_t const\u0026 a, search_result_t const\u0026 b) noexcept {\n    return a.best_distance \u003c b.best_distance ? a : b;\n}\n\n/// Uses all CPU threads to search for the closest vector to the @p query.\nsearch_result_t search(std::span\u003cfloat, dimensions\u003e query) {\n\n    bool const need_to_spawn_threads = distributed_pool.threads_count() == 0;\n    if (need_to_spawn_threads) {\n        assert(numa_topology.try_harvest() \u0026\u0026 \"Failed to harvest NUMA topology\");\n        assert(numa_topology.nodes_count() == 2 \u0026\u0026 \"Expected exactly 2 NUMA nodes\");\n        assert(distributed_pool.try_spawn(numa_topology, sizeof(search_result_t)) \u0026\u0026 \"Failed to spawn NUMA pools\");\n    }\n\n    search_result_t result;\n    fu::spin_mutex_t result_update; // ? Lighter `std::mutex` alternative w/out system calls\n\n    std::size_t const total_vectors =\n        (first_half.size() + second_half.size()) / dimensions;\n\n    auto slices = distributed_pool.for_slices(total_vectors,\n        [\u0026](fu::colocated_prong\u003c\u003e first, std::size_t count) noexcept {\n\n        bool const in_second = first.colocation != 0;\n        auto const \u0026shard = in_second ? second_half : first_half;\n        std::size_t const shard_base = in_second ? first_half.size() / dimensions : 0;\n        std::size_t const local_begin = first.task - shard_base;\n\n        search_result_t thread_local_result;\n        for (std::size_t i = 0; i \u003c count; ++i) {\n            std::size_t const local_index = local_begin + i;\n            std::size_t const global_index = shard_base + local_index;\n\n            simsimd_distance_t distance;\n            simsimd_f32_cos(query.data(), shard.data() + local_index * dimensions, dimensions, \u0026distance);\n            thread_local_result = pick_best(thread_local_result, {distance, global_index});\n        }\n\n        // ! Still synchronizing over a shared mutex for brevity.\n        std::lock_guard\u003cfu::spin_mutex_t\u003e lock(result_update);\n        result = pick_best(result, thread_local_result);\n    });\n    slices.join();\n    return result;\n}\n```\n\nIn a dream world, we would call `distributed_pool.for_n`, but there is no clean way to make the scheduling processes aware of the data distribution in an arbitrary application, so that's left to the user.\nThe `for_slices` helper provides colocated metadata (`fu::colocated_prong`) that lets you pick the right shard of data based on the NUMA node, while keeping scheduling inside the distributed pool.\nFor more flexibility around building higher-level low-latency systems, there are unsafe APIs expecting you to manually \"join\" the broadcasted calls, like `unsafe_for_threads` and `unsafe_join`.\n\n### Efficient Busy Waiting\n\nHere's what \"busy waiting\" looks like in C++:\n\n```cpp\nwhile (!has_work_to_do())\n    std::this_thread::yield();\n```\n\nOn Linux, the `std::this_thread::yield()` translates into a `sched_yield` system call, which means context switching to the kernel and back.\nInstead, you can replace the `standard_yield_t` wrapper with a platform-specific micro-wait instruction, which is much cheaper.\nThose instructions, like [`WFET` on Arm](https://developer.arm.com/documentation/ddi0602/2025-03/Base-Instructions/WFET--Wait-for-event-with-timeout-), generally hint the CPU to transition to a low-power state.\n\n| Wrapper         | ISA          | Instruction | Privileges |\n| --------------- | ------------ | ----------- | ---------- |\n| `x86_pause_t`   | x86          | `PAUSE`     | R3         |\n| `x86_tpause_t`  | x86+WAITPKG  | `TPAUSE`    | R3         |\n| `arm64_yield_t` | AArch64      | `YIELD`     | EL0        |\n| `arm64_wfet_t`  | AArch64+WFXT | `WFET`      | EL0        |\n| `risc5_pause_t` | RISC-V       | `PAUSE`     | U          |\n\nNo kernel calls.\nNo futexes.\nWorks in tight loops.\n\n### Rayon-style Parallel Iterators\n\nFor Rayon-style ergonomics, use the parallel iterator API with the `prelude`.\nUnlike Rayon, Fork Union's parallel iterators don't depend on the global state and allow explicit control over the thread pool and scheduling strategy.\nFor statically shaped workloads, the default static scheduling is more efficient: \n\n```rust\nuse fork_union as fu;\nuse fork_union::prelude::*;\n\nlet mut pool = fu::spawn(4);\nlet mut data: Vec\u003cusize\u003e = (0..1000).collect();\n\n(\u0026data[..])\n    .into_par_iter()\n    .with_pool(\u0026mut pool)\n    .for_each(|value| {\n        println!(\"Value: {}\", value);\n    });\n```\n\nFor dynamic work-stealing, use `with_schedule` with `DynamicScheduler`:\n\n```rust\n(\u0026mut data[..])\n    .into_par_iter()\n    .with_schedule(\u0026mut pool, DynamicScheduler)\n    .for_each(|value| {\n        *value *= 2;\n    });\n```\n\nThis easily composes with other iterator adaptors, like `map`, `filter`, and `zip`:\n\n```rust\n(\u0026data[..])\n    .into_par_iter()\n    .filter(|\u0026x| x % 2 == 0)\n    .map(|x| x * x)\n    .with_pool(\u0026mut pool)\n    .for_each(|value| {\n        println!(\"Squared even: {}\", value);\n    });\n```\n\nMoreover, each thread can maintain its own scratch space to avoid contention during reductions.\nCache-line alignment via `CacheAligned` prevents false sharing:\n\n```rust\n// Cache-line aligned wrapper to prevent false sharing\nlet mut scratch: Vec\u003cCacheAligned\u003cusize\u003e\u003e =\n    (0..pool.threads()).map(|_| CacheAligned(0)).collect();\n\n(\u0026data[..])\n    .into_par_iter()\n    .with_pool(\u0026mut pool)\n    .fold_with_scratch(scratch.as_mut_slice(), |acc, value, _prong| {\n        acc.0 += *value;\n    });\nlet total: usize = scratch.iter().map(|a| a.0).sum();\n```\n\n## Performance\n\nOne of the most common parallel workloads is the N-body simulation ¹.\nImplementations are available in both C++ and Rust in `scripts/nbody.cpp` and `scripts/nbody.rs`, respectively.\nBoth are lightweight and involve little logic outside of number-crunching, so both can be easily profiled with `time` and introspected with `perf` Linux tools.\nAdditional NUMA-aware Search examples are available in `scripts/search.rs`.\n\n---\n\nC++ benchmarking results for $N=128$ bodies and $I=1e6$ iterations:\n\n| Machine        | OpenMP (D) | OpenMP (S) | Fork Union (D) | Fork Union (S) |\n| :------------- | ---------: | ---------: | -------------: | -------------: |\n| 16x Intel SPR  |      18.9s |      12.4s |          16.8s |           8.7s |\n| 12x Apple M2   | 1m:34.8s ² | 1m:25.9s ² |          31.5s |          20.3s |\n| 96x Graviton 4 |      32.2s |      20.8s |          39.8s |          26.0s |\n\nRust benchmarking results for $N=128$ bodies and $I=1e6$ iterations:\n\n| Machine        |  Rayon (D) |  Rayon (S) | Fork Union (D) | Fork Union (S) |\n| :------------- | ---------: | ---------: | -------------: | -------------: |\n| 16x Intel SPR  |    🔄 45.4s |    🔄 32.1s | 18.1s, 🔄 22.4s | 12.4s, 🔄 12.9s |\n| 12x Apple M2   | 🔄 1m:47.8s | 🔄 1m:07.1s | 24.5s, 🔄 26.8s | 11.0s, 🔄 11.8s |\n| 96x Graviton 4 | 🔄 2m:13.9s | 🔄 1m:35.6s |          18.9s |          10.1s |\n\n\u003e ¹ Another common workload is \"Parallel Reductions\" covered in a separate [repository](https://github.com/ashvardanian/ParallelReductionsBenchmark).\n\u003e ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing. It's also fair to say, that OpenMP is not optimized for AppleClang.\n\u003e 🔄 Rotation emoji stands for iterators, the default way to use Rayon and the opt-in slower, but more convenient variant for Fork Union.\n\nYou can rerun those benchmarks with the following commands:\n\n```bash\ncmake -B build_release -D CMAKE_BUILD_TYPE=Release\ncmake --build build_release --config Release\ntime NBODY_COUNT=128 NBODY_ITERATIONS=1000000 NBODY_BACKEND=fork_union_static build_release/fork_union_nbody\ntime NBODY_COUNT=128 NBODY_ITERATIONS=1000000 NBODY_BACKEND=fork_union_dynamic build_release/fork_union_nbody\n```\n\n\u003e Consult the header of `scripts/nbody.cpp` and `scripts/nbody.rs` for additional benchmarking options.\n\n## Safety \u0026 Logic\n\nThere are only 3 core atomic variables in this thread-pool, and 1 for dynamically-stealing tasks.\nLet's call every invocation of a `for_*` API - a \"fork\", and every exit from it a \"join\".\n\n| Variable           | Users Perspective            | Internal Usage                        |\n| :----------------- | :--------------------------- | :------------------------------------ |\n| `stop`             | Stop the entire thread-pool  | Tells workers when to exit the loop   |\n| `fork_generation`  | \"Forks\" called since init    | Tells workers to wake up on new forks |\n| `threads_to_sync`  | Threads not joined this fork | Tells main thread when workers finish |\n| `dynamic_progress` | Progress within this fork    | Tells workers which jobs to take      |\n\n### Why don't we need atomics for \"total_threads\"?\n\nThe only way to change the number of threads is to `terminate` the entire thread-pool and then `try_spawn` it again.\nEither of those operations can only be called from one thread at a time and never coincide with any running tasks.\nThat's ensured by the `stop`.\n\n### Why don't we need atomics for a \"job pointer\"?\n\nA new task can only be submitted from one thread that updates the number of parts for each new fork.\nDuring that update, the workers are asleep, spinning on old values of `fork_generation` and `stop`.\nThey only wake up and access the new value once `fork_generation` increments, ensuring safety.\n\n### How do we deal with overflows and `SIZE_MAX`-sized tasks?\n\nThe library entirely avoids saturating multiplication and only uses one saturating addition in \"release\" builds.\nTo test the consistency of arithmetic, the C++ template class can be instantiated with a custom `index_t`, such as `std::uint8_t` or `std::uint16_t`.\nIn the former case, no more than 255 threads can operate, and no more than 255 tasks can be addressed, allowing us to easily test every weird corner case of [0:255] threads competing for [0:255] tasks.\n\n### Why not reimplement it in Rust?\n\nThe original Rust implementation was a standalone library, but in essence, Rust doesn't feel designed for parallelism, concurrency, and expert Systems Engineering.\nIt enforces stringent safety rules, which is excellent for building trustworthy software, but realistically, it makes lock-free concurrent programming with minimal memory allocations too complicated.\nNow, the Rust library is a wrapper over the C binding of the C++ core implementation.\n\n## Testing and Benchmarking\n\nTo run the C++ tests, use CMake:\n\n```bash\ncmake -B build_release -D CMAKE_BUILD_TYPE=Release -D BUILD_TESTING=ON\ncmake --build build_release --config Release -j\nctest --test-dir build_release                  # run all tests\nbuild_release/fork_union_nbody                  # run the benchmarks\n```\n\nFor C++ debug builds, consider using the VS Code debugger presets or the following commands:\n\n```bash\ncmake -B build_debug -D CMAKE_BUILD_TYPE=Debug -D BUILD_TESTING=ON\ncmake --build build_debug --config Debug        # build with Debug symbols\nbuild_debug/fork_union_test_cpp20               # run a single test executable\n```\n\nTo run static analysis:\n\n```bash\nsudo apt install cppcheck clang-tidy\ncmake --build build_debug --target cppcheck     # detects bugs \u0026 undefined behavior\ncmake --build build_debug --target clang-tidy   # suggest code improvements\n```\n\nTo include NUMA, Huge Pages, and other optimizations on Linux, make sure to install dependencies:\n\n```bash\nsudo apt-get -y install libnuma-dev libnuma1                # NUMA\nsudo apt-get -y install libhugetlbfs-dev libhugetlbfs-bin   # Huge Pages\nsudo ln -s /usr/bin/ld.hugetlbfs /usr/share/libhugetlbfs/ld # Huge Pages linker\n```\n\nTo build with an alternative compiler, like LLVM Clang, use the following command:\n\n```bash\nsudo apt-get install libomp-15-dev clang++-15 # OpenMP version must match Clang\ncmake -B build_debug -D CMAKE_BUILD_TYPE=Debug -D CMAKE_CXX_COMPILER=clang++-15\ncmake --build build_debug --config Debug\nbuild_debug/fork_union_test_cpp20\n```\n\nOr on macOS with Apple Clang:\n\n```bash\nbrew install llvm@20\ncmake -B build_debug -D CMAKE_BUILD_TYPE=Debug -D CMAKE_CXX_COMPILER=$(brew --prefix llvm@20)/bin/clang++\ncmake --build build_debug --config Debug\nbuild_debug/fork_union_test_cpp20\n```\n\nFor Rust, use the following command:\n\n```bash\nrustup toolchain install                # for Alloc API\ncargo miri test                         # to catch UBs\ncargo build --features numa             # for NUMA support on Linux\ncargo test --release                    # to run the tests fast\ncargo test --features numa --release    # for NUMA tests on Linux\n```\n\nTo automatically detect the Minimum Supported Rust Version (MSRV):\n\n```sh\ncargo +stable install cargo-msrv\ncargo msrv find --ignore-lockfile\n```\n\n## License\n\nLicensed under the Apache License, Version 2.0. See `LICENSE` for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fforkunion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Fforkunion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fforkunion/lists"}