{"id":28605645,"url":"https://github.com/ashvardanian/fork_union","last_synced_at":"2025-06-13T22:09:57.527Z","repository":{"id":291327821,"uuid":"974172475","full_name":"ashvardanian/fork_union","owner":"ashvardanian","description":"Low(est?)-latency OpenMP-style minimalistic scoped thread-pool designed for 'Fork-Join' parallelism in Rust and C++, avoiding memory allocations, mutexes, CAS-primitives, and false-sharing on the hot path 🍴","archived":false,"fork":false,"pushed_at":"2025-06-13T15:53:18.000Z","size":228,"stargazers_count":80,"open_issues_count":6,"forks_count":10,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-06-13T16:58:11.540Z","etag":null,"topics":["arm","atomics","compare-and-swap","concurrency","memory-model","mpi","multithreading","openmp","parallel-computing","parallel-stl","parallelism","rayon","thread-pool","threadpool"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/posts/beyond-openmp-in-cpp-rust/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-28T11:13:59.000Z","updated_at":"2025-06-13T15:53:21.000Z","dependencies_parsed_at":"2025-06-13T16:53:24.379Z","dependency_job_id":"427e7743-3f2d-4376-a6c5-030ed1570cd4","html_url":"https://github.com/ashvardanian/fork_union","commit_stats":null,"previous_names":["ashvardanian/fork_union"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/ashvardanian/fork_union","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Ffork_union","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Ffork_union/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Ffork_union/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Ffork_union/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/fork_union/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Ffork_union/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259727148,"owners_count":22902183,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arm","atomics","compare-and-swap","concurrency","memory-model","mpi","multithreading","openmp","parallel-computing","parallel-stl","parallelism","rayon","thread-pool","threadpool"],"created_at":"2025-06-11T19:01:25.383Z","updated_at":"2025-06-13T22:09:57.516Z","avatar_url":"https://github.com/ashvardanian.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# Fork Union 🍴\n\nThe __`fork_union`__ library is a thread-pool for \"Fork-Join\" [SIMT-style](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads) parallelism for Rust and C++.\nIt's quite different from most open-source thread-pool implementations, generally designed around heap-allocated \"queues of tasks\", synchronized by a \"mutex\".\nIn C++, wrapping tasks into a `std::function` is expensive, as is growing the `std::queue` and locking the `std::mutex` under contention.\nSame for Rust.\nWhen you can avoid it - you should.\nOpenMP-like use-cases are the perfect example of that!\n\n![`fork_union` banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/fork_union.jpg?raw=true)\n\nOpenMP, however, isn't great for fine-grained parallelism, when different pieces of your application logic need to work on different sets of threads.\nThis is where __`fork_union`__ comes in with a minimalistic STL implementation of a thread-pool, avoiding dynamic memory allocations and exceptions on the hot path, and prioritizing lock-free and [CAS](https://en.wikipedia.org/wiki/Compare-and-swap)-free user-space \"atomics\" to [system calls](https://en.wikipedia.org/wiki/System_call).\n\n## Usage\n\nThe __`fork_union`__ is dead-simple!\nThere is no nested parallelism, exception-handling, or \"futures promises\".\nThe thread pool has just one core API - `broadcast` to launch a callback on each thread.\nThe higher-level API for index-addressable tasks are:\n\n- `for_n` - for individual evenly-sized tasks.\n- `for_n_dynamic` - for individual unevenly-sized tasks.\n- `for_slices` - for slices of evenly-sized tasks.\n\nBoth are available in C++ and Rust.\n\n### Usage in Rust\n\nA minimal example may look like this:\n\n```rs\nuse fork_union as fu;\nlet pool = fu::spawn(2);\npool.broadcast(|thread_index| {\n    println!(\"Hello from thread # {}\", thread_index + 1);\n});\n```\n\nHigher-level APIs distribute tasks across the threads in the pool:\n\n```rs\nfu::for_n(pool, 100, |prong| {\n    println!(\"Running task {} on thread # {}\",\n        prong.task_index + 1, prong.thread_index + 1);\n});\nfu::for_slices(pool, 100, |prong, count| {\n    println!(\"Running slice [{}, {}) on thread # {}\",\n        prong.task_index, prong.task_index + count, prong.thread_index + 1);\n});\nfu::for_n_dynamic(pool, 100, |prong| {\n    println!(\"Running task {} on thread # {}\",\n        prong.task_index + 1, prong.thread_index + 1);\n});\n```\n\nA safer `try_spawn_in` interface is recommended, using the Allocator API.\nA more realistic example may look like this:\n\n```rs\n#![feature(allocator_api)]\nuse std::thread;\nuse std::error::Error;\nuse std::alloc::Global;\nuse fork_union as fu;\n\nfn heavy_math(_: usize) {}\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn Error\u003e\u003e {\n    let pool = fu::ThreadPool::try_spawn(4)?;\n    let pool = fu::ThreadPool::try_spawn_in(4, Global)?;\n    let pool = fu::ThreadPool::try_named_spawn(\"heavy-math\", 4)?;\n    let pool = fu::ThreadPool::try_named_spawn_in(\"heavy-math\", 4, Global)?;\n    fu::for_n_dynamic(pool, 400, |prong| {\n        heavy_math(prong.1);\n    });\n    Ok(())\n}\n```\n\n### Usage in C++\n\nTo integrate into your C++ project, either just copy the `include/fork_union.hpp` file into your project, add a Git submodule, or CMake.\nFor a Git submodule, run:\n\n```bash\ngit submodule add https://github.com/ashvardanian/fork_union.git extern/fork_union\n```\n\nAlternatively, using CMake:\n\n```cmake\nFetchContent_Declare(\n    fork_union\n    GIT_REPOSITORY\n    https://github.com/ashvardanian/fork_union\n)\nFetchContent_MakeAvailable(fork_union)\ntarget_link_libraries(your_target PRIVATE fork_union::fork_union)\n```\n\nThen, include the header in your C++ code:\n\n```cpp\n#include \u003cfork_union.hpp\u003e   // `thread_pool_t`\n#include \u003ccstdio\u003e           // `stderr`\n#include \u003ccstdlib\u003e          // `EXIT_SUCCESS`\n\nnamespace fu = ashvardanian::fork_union;\n\nint main() {\n    fu::thread_pool_t pool;\n    if (!pool.try_spawn(std::thread::hardware_concurrency())) {\n        std::fprintf(stderr, \"Failed to fork the threads\\n\");\n        return EXIT_FAILURE;\n    }\n\n    // Dispatch a callback to each thread in the pool\n    pool.broadcast([\u0026](std::size_t thread_index) noexcept {\n        std::printf(\"Hello from thread # %zu (of %zu)\\n\", thread_index + 1, pool.count_threads());\n    });\n\n    // Execute 1000 tasks in parallel, expecting them to have comparable runtimes\n    // and mostly co-locating subsequent tasks on the same thread. Analogous to:\n    //\n    //      #pragma omp parallel for schedule(static)\n    //      for (int i = 0; i \u003c 1000; ++i) { ... }\n    //\n    // You can also think about it as a shortcut for the `for_slices` + `for`.\n    fu::for_n(pool, 1000, [](std::size_t task_index) noexcept {\n        std::printf(\"Running task %zu of 3\\n\", task_index + 1);\n    });\n    fu::for_slices(pool, 1000, [](std::size_t first_index, std::size_t count) noexcept {\n        std::printf(\"Running slice [%zu, %zu)\\n\", first_index, first_index + count);\n    });\n\n    // Like `for_n`, but each thread greedily steals tasks, without waiting for  \n    // the others or expecting individual tasks to have same runtimes. Analogous to:\n    //\n    //      #pragma omp parallel for schedule(dynamic, 1)\n    //      for (int i = 0; i \u003c 3; ++i) { ... }\n    fu::for_n_dynamic(pool, 3, [](std::size_t task_index) noexcept {\n        std::printf(\"Running dynamic task %zu of 1000\\n\", task_index + 1);\n    });\n    return EXIT_SUCCESS;\n}\n```\n\nThat's it.\n\n## Why Not Use $𝑋$\n\nThere are many other thread-pool implementations, that are more feature-rich, but have different limitations and design goals.\n\n- Modern C++: [`taskflow/taskflow`](https://github.com/taskflow/taskflow), [`progschj/ThreadPool`](https://github.com/progschj/ThreadPool), [`bshoshany/thread-pool`](https://github.com/bshoshany/thread-pool)\n- Traditional C++: [`vit-vit/CTPL`](https://github.com/vit-vit/CTPL), [`mtrebi/thread-pool`](https://github.com/mtrebi/thread-pool)\n- Rust: [`tokio-rs/tokio`](https://github.com/tokio-rs/tokio), [`rayon-rs/rayon`](https://github.com/rayon-rs/rayon), [`smol-rs/smol`](https://github.com/smol-rs/smol)\n\nThose are not designed for the same OpenMP-like use-cases as __`fork_union`__.\nInstead, they primarily focus on task queueing, which requires a lot more work.\n\n### Locks and Mutexes\n\nUnlike the `std::atomic`, the `std::mutex` is a system call, and it can be expensive to acquire and release.\nIts implementations generally have 2 executable paths:\n\n- the fast path, where the mutex is not contended, where it first tries to grab the mutex via a compare-and-swap operation, and if it succeeds, it returns immediately.\n- the slow path, where the mutex is contended, and it has to go through the kernel to block the thread until the mutex is available.\n\nOn Linux, the latter translates to a [\"futex\" syscall](https://en.wikipedia.org/wiki/Futex), which is expensive.\n\n### Memory Allocations\n\nC++ has rich functionality for concurrent applications, like `std::future`, `std::packaged_task`, `std::function`, `std::queue`, `std::conditional_variable`, and so on.\nMost of those, I believe, aren't unusable in Big-Data applications, where you always operate in memory-constrained environments:\n\n- The idea of raising a `std::bad_alloc` exception, when there is no memory left, and just hoping that someone up the call stack will catch it is simply not a great design idea for any Systems Engineering.\n- The threat of having to synchronize ~200 physical CPU cores across 2-8 sockets, and potentially dozens of [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes around a shared global memory allocator, practically means you can't have predictable performance.\n\nAs we focus on a simpler ~~concurrency~~ parallelism model, we can avoid the complexity of allocating shared states, wrapping callbacks into some heap-allocated \"tasks\", and a lot of other boilerplate.\nLess work - more performance.\n\n### Atomics and CAS\n\nOnce you get to the lowest-level primitives on concurrency you end up with the `std::atomic` and a small set of hardware-supported atomic instructions.\nHardware implements it differently:\n\n- x86 is built around the \"Total Store Order\" (TSO) [memory consistency model](https://en.wikipedia.org/wiki/Memory_ordering) and provides `LOCK` variants of the `ADD` and `CMPXCHG`, which act as full-blown \"fences\" - no loads or stores can be reordered across it.\n- Arm, on the other hand, has a \"weak\" memory model, and provides a set of atomic instructions that are not fences, that match C++ concurrency model, offering `acquire`, `release`, and `acq_rel` variants of each atomic instruction—such as `LDADD`, `STADD`, and `CAS` - which allow precise control over visibility and ordering, especially with the introduction of \"Large System Extension\" (LSE) instructions in Armv8.1.\n\nIn practice, a locked atomic on x86 requires the cache line in the Exclusive state in the requester's L1 cache.\nThis will incur a coherence transaction (Read-for-Ownership) if some other core had the line.\nBoth Intel and AMD handle this similarly.\n\nIt makes [Arm and Power much more suitable for lock-free programming](https://arangodb.com/2021/02/cpp-memory-model-migrating-from-x86-to-arm/) and concurrent data structures, but some observations hold for both platforms.\nMost importantly, \"Compare and Swap\" (CAS) is a very expensive operation, and should be avoided at all costs.\n\nOn x86, for example, the `LOCK ADD` [can easily take 50 CPU cycles](https://travisdowns.github.io/blog/2020/07/06/concurrency-costs), being 50x slower than a regular `ADD` instruction, but still easily 5-10x faster than a `LOCK CMPXCHG` instruction.\nOnce the contention rises, the gap naturally widens, and is further amplified by the increased \"failure\" rate of the CAS operation, when the value being compared has already changed.\nThat's why for the \"dynamic\" mode, we resort to using an additional atomic variable as opposed to more typical CAS-based implementations.\n\n### Alignment\n\nAssuming a thread-pool is a heavy object anyway, nobody will care if it's a bit larger than expected.\nThat allows us to over-align the internal counters to `std::hardware_destructive_interference_size` to avoid false sharing.\nIn that case, even on x86, where the entire cache will be exclusively owned by a single thread, in eager mode, we end up effectively \"pipelining\" the execution, where one thread may be incrementing the \"in-flight\" counter, while the other is decrementing the \"remaining\" counter, and others are executing the loop body in-between.\n\n## Performance\n\nOne of the most common parallel workloads is the N-body simulation ¹.\nAn implementation is available in both C++ and Rust in `scripts/nbody.cpp` and `scripts/nbody.rs` respectively.\nBoth are extremely light-weight and involve little logic outside of number-crunching, so both can be easily profiled with `time` and introspected with `perf` Linux tools. \n\n---\n\nC++ benchmarking results for $N=128$ bodies and $I=1e6$ iterations:\n\n| Machine        | OpenMP (D) | OpenMP (S) | Fork Union (D) | Fork Union (S) |\n| :------------- | ---------: | ---------: | -------------: | -------------: |\n| 16x Intel SPR  |      20.3s |      16.0s |          18.1s |          10.3s |\n| 12x Apple M2   |          ? |   1m:16.7s |     1m:30.3s ² |     1m:40.7s ² |\n| 96x Graviton 4 |      32.2s |      20.8s |          39.8s |          26.0s |\n\nRust benchmarking results for $N=128$ bodies and $I=1e6$ iterations:\n\n| Machine        | Rayon (D) | Rayon (S) | Fork Union (D) | Fork Union (S) |\n| :------------- | --------: | --------: | -------------: | -------------: |\n| 16x Intel SPR  |     51.4s |     38.1s |          15.9s |           9.8s |\n| 12x Apple M2   |  3m:23.5s |   2m:0.6s |        4m:8.4s |       1m:20.8s |\n| 96x Graviton 4 |  2m:13.9s |  1m:35.6s |          18.9s |          10.1s |\n\n\u003e ¹ Another common workload is \"Parallel Reductions\" covered in a separate [repository](https://github.com/ashvardanian/ParallelReductionsBenchmark).\n\u003e ² When a combination of performance and efficiency cores is used, dynamic stealing may be more efficient than static slicing.\n\n## Safety \u0026 Logic\n\nThere are only 3 core atomic variables in this thread-pool, and some of them are practically optional.\nLet's call every invocation of a `for_*` API - a \"fork\", and every exit from it a \"join\".\n\n| Variable          | Users Perspective            | Internal Usage                        |\n| :---------------- | :--------------------------- | :------------------------------------ |\n| `stop`            | Stop the entire thread-pool  | Tells workers when to exit the loop   |\n| `fork_generation` | \"Forks\" called since init    | Tells workers to wake up on new forks |\n| `threads_to_sync` | Threads not joined this fork | Tells main thread when workers finish |\n\n__Why don't we need atomics for \"total_threads\"?__\nThe only way to change the number of threads is to `stop_and_reset` the entire thread-pool and then `try_spawn` it again.\nEither of those operations can only be called from one thread at a time and never coincides with any running tasks.\nThat's ensured by the `stop`.\n\n__Why don't we need atomics for a \"job pointer\"?__\nA new task can only be submitted from one thread, that updates the number of parts for each new fork.\nDuring that update, the workers are asleep, spinning on old values of `fork_generation` and `stop`.\nThey only wake up and access the new value once `fork_generation` increments, ensuring safety.\n\n__How do we deal with overflows and `SIZE_MAX`-sized tasks?__\nThe library entirely avoids saturating multiplication and only uses one saturating addition in \"release\" builds.\nTo test the consistency of arithmetic, the C++ template class can be instantiated with a custom `index_t`, such as `std::uint8_t` or `std::uint16_t`.\nIn the former case, no more than 255 threads can operate and no more than 255 tasks can be addressed, allowing us to easily test every weird corner case of [0:255] threads competing for [0:255] tasks.\n\n## Testing and Benchmarking\n\nTo run the C++ tests, use CMake:\n\n```bash\ncmake -B build_release -D CMAKE_BUILD_TYPE=Release -D CMAKE_CXX_COMPILER=clang++-15\ncmake --build build_release --config Release\nbuild_release/scripts/fork_union_test_cpp20\n```\n\nFor C++ debug builds, consider using the VS Code debugger presets or the following commands:\n\n```bash\ncmake --build build_debug --config Debug\nbuild_debug/scripts/fork_union_test_cpp20\n```\n\nTo build with an alternative compiler, like LLVM Clang, use the following command:\n\n```bash\nsudo apt-get install libomp-15-dev clang++-15 # OpenMP version must match Clang\ncmake -B build_debug -D CMAKE_BUILD_TYPE=Debug -D CMAKE_CXX_COMPILER=clang++-15\ncmake --build build_debug --config Debug\nbuild_debug/scripts/fork_union_test_cpp20\n```\n\nFor Rust, use the following command:\n\n```bash\nrustup toolchain install # for Alloc API\ncargo miri test          # to catch UBs\ncargo test --release     # to run the tests fast\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Ffork_union","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Ffork_union","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Ffork_union/lists"}