{"id":13611004,"url":"https://github.com/mratsim/weave","last_synced_at":"2025-04-05T18:05:00.540Z","repository":{"id":36197965,"uuid":"197920590","full_name":"mratsim/weave","owner":"mratsim","description":"A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead","archived":false,"fork":false,"pushed_at":"2024-06-29T05:19:43.000Z","size":9001,"stargazers_count":555,"open_issues_count":44,"forks_count":21,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-03T19:52:09.201Z","etag":null,"topics":["data-parallelism","fork-join","message-passing","multithreading","openmp","parallelism","runtime","scheduler","task-parallelism","task-scheduler","threadpool","work-stealing"],"latest_commit_sha":null,"homepage":"","language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mratsim.png","metadata":{"files":{"readme":"README.md","changelog":"changelog.md","contributing":null,"funding":null,"license":"LICENSE-APACHEv2","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-20T11:22:00.000Z","updated_at":"2025-03-31T15:01:22.000Z","dependencies_parsed_at":"2024-01-12T03:35:39.660Z","dependency_job_id":"7fb92c72-17bb-4564-924a-ceba71d7c833","html_url":"https://github.com/mratsim/weave","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fweave","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fweave/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fweave/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fweave/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mratsim","download_url":"https://codeload.github.com/mratsim/weave/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247378135,"owners_count":20929296,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-parallelism","fork-join","message-passing","multithreading","openmp","parallelism","runtime","scheduler","task-parallelism","task-scheduler","threadpool","work-stealing"],"created_at":"2024-08-01T19:01:50.806Z","updated_at":"2025-04-05T18:05:00.516Z","avatar_url":"https://github.com/mratsim.png","language":"Nim","funding_links":[],"categories":["Nim","Language Features"],"sub_categories":["Threading"],"readme":"# Weave, a state-of-the-art multithreading runtime\n[![Github Actions CI](https://github.com/mratsim/weave/workflows/Weave%20CI/badge.svg)](https://github.com/mratsim/weave/actions?query=workflow%3A%22Weave+CI%22+branch%3Amaster)\\\n[![License: Apache](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n![Stability: experimental](https://img.shields.io/badge/stability-experimental-orange.svg)\n\n_\"Good artists borrow, great artists steal.\"_ -- Pablo Picasso\n\nWeave (codenamed \"Project Picasso\") is a multithreading runtime for the [Nim programming language](https://nim-lang.org/).\n\nIt is continuously tested on Linux, MacOS and Windows for the following CPU architectures: x86, x86_64 and ARM64 with the C and C++ backends.\n\nWeave aims to provide a composable, high-performance, ultra-low overhead and fine-grained parallel runtime that frees developers from the common worries of\n\"are my tasks big enough to be parallelized?\", \"what should be my grain size?\", \"what if the time they take is completely unknown or different?\" or \"is parallel-for worth it if it's just a matrix addition? On what CPUs? What if it's exponentiation?\".\n\nThorough benchmarks track Weave performance against industry standard runtimes in C/C++/Cilk language on both Task parallelism and Data parallelism with a variety of workloads:\n- Compute-bound\n- Memory-bound\n- Load Balancing\n- Runtime-overhead bound (i.e. trillions of tasks in a couple milliseconds)\n- Nested parallelism\n\nBenchmarks are issued from recursive tree algorithms, finance, linear algebra and High Performance Computing, game simulations.\nIn particular Weave displays as low as 3x to 10x less overhead than Intel TBB and GCC OpenMP\non overhead-bound benchmarks.\n\nAt implementation level, Weave unique feature is being-based on Message-Passing\ninstead of being based on traditional work-stealing with shared-memory deques.\n\n\u003e ⚠️ Disclaimer:\n\u003e\n\u003e Only 1 out of 2 complex synchronization primitives was formally verified\n\u003e to be deadlock-free. They were not submitted to an additional data race\n\u003e detection tool to ensure proper implementation.\n\u003e\n\u003e Furthermore worker threads are state-machines and\n\u003e were not formally verified either.\n\u003e\n\u003e Weave does limit synchronization to only simple SPSC and MPSC channels which greatly reduces\n\u003e the potential bug surface.\n\n## Installation\n\nWeave can be simply installed with\n```bash\nnimble install weave\n```\n\nor for the devel version\n```bash\nnimble install weave@#master\n```\n\nWeave requires at least Nim v1.2.0\n\n## Changelog\n\nThe latest changes are available in the ![changelog.md](changelog.md) file.\n\n## Demos\n\nA raytracing demo is available, head over to [demos/raytracing](demos/raytracing).\n\n![ray_trace_300samples_nim_threaded.png](demos/raytracing/ray_trace_300samples_nim_threaded.png)\n\n\n## Table of Contents\n\n- [Weave, a state-of-the-art multithreading runtime](#weave-a-state-of-the-art-multithreading-runtime)\n  - [Installation](#installation)\n  - [Changelog](#changelog)\n  - [Demos](#demos)\n  - [Table of Contents](#table-of-contents)\n  - [API](#api)\n    - [Task parallelism](#task-parallelism)\n    - [Data parallelism](#data-parallelism)\n      - [Strided loops](#strided-loops)\n    - [Complete list](#complete-list)\n      - [Root thread](#root-thread)\n      - [Weave worker thread](#weave-worker-thread)\n      - [Foreign thread \u0026 Background service (experimental)](#foreign-thread--background-service-experimental)\n  - [Platforms supported](#platforms-supported)\n    - [C++ compilation](#c-compilation)\n    - [Windows 32-bit](#windows-32-bit)\n    - [Resource-restricted devices](#resource-restricted-devices)\n  - [Backoff mechanism](#backoff-mechanism)\n    - [Weave using all CPUs](#weave-using-all-cpus)\n  - [Experimental features](#experimental-features)\n    - [Data parallelism (experimental features)](#data-parallelism-experimental-features)\n      - [Awaitable loop](#awaitable-loop)\n      - [Parallel For Staged](#parallel-for-staged)\n      - [Parallel Reduction](#parallel-reduction)\n    - [Dataflow parallelism](#dataflow-parallelism)\n      - [Delayed computation with single dependencies](#delayed-computation-with-single-dependencies)\n      - [Delayed computation with multiple dependencies](#delayed-computation-with-multiple-dependencies)\n      - [Delayed loop computation](#delayed-loop-computation)\n    - [Lazy Allocation of Flowvars](#lazy-allocation-of-flowvars)\n  - [Limitations](#limitations)\n  - [Statistics](#statistics)\n  - [Tuning](#tuning)\n  - [Unique features](#unique-features)\n  - [Research](#research)\n  - [License](#license)\n\n## API\n\n### Task parallelism\n\nWeave provides a simple API based on spawn/sync which works like async/await for IO-based futures.\n\nThe traditional parallel recursive Fibonacci would be written like this:\n```Nim\nimport weave\n\nproc fib(n: int): int =\n  # int64 on x86-64\n  if n \u003c 2:\n    return n\n\n  let x = spawn fib(n-1)\n  let y = fib(n-2)\n\n  result = sync(x) + y\n\nproc main() =\n  var n = 20\n\n  init(Weave)\n  let f = fib(n)\n  exit(Weave)\n\n  echo f\n\nmain()\n```\n\n### Data parallelism\n\nWeave provides nestable parallel for loop.\n\nA nested matrix transposition would be written like this:\n\n```Nim\nimport weave\n\nfunc initialize(buffer: ptr UncheckedArray[float32], len: int) =\n  for i in 0 ..\u003c len:\n    buffer[i] = i.float32\n\nproc transpose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =\n  ## Transpose a MxN matrix into a NxM matrix with nested for loops\n\n  parallelFor j in 0 ..\u003c N:\n    captures: {M, N, bufIn, bufOut}\n    parallelFor i in 0 ..\u003c M:\n      captures: {j, M, N, bufIn, bufOut}\n      bufOut[j*M+i] = bufIn[i*N+j]\n\nproc main() =\n  let M = 200\n  let N = 2000\n\n  let input = newSeq[float32](M*N)\n  # We can't work with seq directly as it's managed by GC, take a ptr to the buffer.\n  let bufIn = cast[ptr UncheckedArray[float32]](input[0].unsafeAddr)\n  bufIn.initialize(M*N)\n\n  var output = newSeq[float32](N*M)\n  let bufOut = cast[ptr UncheckedArray[float32]](output[0].addr)\n\n  init(Weave)\n  transpose(M, N, bufIn, bufOut)\n  exit(Weave)\n\nmain()\n```\n\n#### Strided loops\n\nYou might want to use loops with a non unit-stride, this can be done with the following syntax.\n\n```Nim\nimport weave\n\ninit(Weave)\n\n# expandMacros:\nparallelForStrided i in 0 ..\u003c 100, stride = 30:\n  parallelForStrided j in 0 ..\u003c 200, stride = 60:\n    captures: {i}\n    log(\"Matrix[%d, %d] (thread %d)\\n\", i, j, myID())\n\nexit(Weave)\n```\n\n### Complete list\n\nWe separate the list depending on the threading context\n\n#### Root thread\n\nThe root thread is the thread that started the Weave runtime. It has special privileges.\n\n- `init(Weave)`, `exit(Weave)` to start and stop the runtime. Forgetting this will give you nil pointer exceptions on spawn.\\\n  The thread that calls `init` will become the root thread.\n- `syncRoot(Weave)` is a global barrier. The root thread will not continue beyond\n  until all tasks in the runtime are finished.\n\n#### Weave worker thread\n\nA worker thread is automatically created per (logical) core on the machine.\nThe root thread is also a worker thread.\nWorker threads are tuned to maximize throughput of computational **tasks**.\n\n- `spawn fnCall(args)` which spawns a function that may run on another thread and gives you an awaitable `Flowvar` handle.\n- `newFlowEvent`, `trigger`, `spawnOnEvent` and `spawnOnEvents` (experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships.\n- `sync(Flowvar)` will await a Flowvar and block until you receive a result.\n- `isReady(Flowvar)` will check if `sync` will actually block or return the result immediately.\n\n- `syncScope` is a scope barrier. The thread will not move beyond the scope until\n  all tasks and parallel loops spawned and their descendants are finished.\n  `syncScope` is composable, it can be called by any thread, it can be nested.\n  It has the syntax of a block statement:\n  ```Nim\n  syncScope():\n    parallelFor i in 0 ..\u003c N:\n      captures: {a, b}\n      parallelFor j in 0 ..\u003c N:\n        captures: {i, a, b}\n    spawn foo()\n  ```\n  In this example, the thread encountering syncScope will create all the tasks for parallel loop i, will spawn foo() and then will be waiting at the end of the scope.\n  A thread blocked at the end of its scope is not idle, it still helps processing all the work existing and that\n  may be created by the current tasks.\n- `parallelFor`, `parallelForStrided`, `parallelForStaged`, `parallelForStagedStrided` are described above and in the experimental section.\n- `loadBalance(Weave)` gives the runtime the opportunity to distribute work. Insert this within long computation as due to Weave design, it's the busy workers that are also in charge of load balancing. This is done automatically when using `parallelFor`.\n- `isSpawned(Flowvar)` allows you to build speculative algorithm where a thread is spawned only if certain conditions are valid. See the `nqueens` benchmark for an example.\n- `getThreadId(Weave)` returns a unique thread ID. The thread ID is in the range 0 ..\u003c number of threads.\n\nThe max number of worker threads can be configured by the environment variable WEAVE_NUM_THREADS\nand default to your number of logical cores (including HyperThreading).\nWeave uses Nim's `countProcessors()` in `std/cpuinfo`\n\n#### Foreign thread \u0026 Background service (experimental)\n\nWeave can also be run as a background service and process `jobs` similar to the `Executor` concept in C++.\nJobs will be processed in FIFO order.\n\n\u003e **Experimental**:\n\u003e   The distinction between spawn/sync on a Weave thread\n\u003e   and submit/waitFor on a foreign thread may be removed in the future.\n\nA background service can be started with either:\n- `thr.runInBackground(Weave)`\n- or `thr.runInBackground(Weave, signalShutdown: ptr Atomic[bool])`\n\nwith `thr` an uninitialized `Thread[void]` or `Thread[ptr Atomic[bool]]`\n\nThen the foreign thread should call:\n- `setupSubmitterThread(Weave)`: Configure a thread so that it can send jobs to a background Weave service\nand on shutdown\n- `waitUntilReady(Weave)`: Block the foreign thread until the Weave runtime is ready to accept jobs.\n\nand for shutdown\n- `teardownSubmitterThread(Weave)`: Cleanup Weave resources allocated on the thread.\n\nOnce setup, a foreign thread can submit jobs via:\n\n- `submit fnCall(args)` which submits a function to the Weave runtime and gives you an awaitable `Pending` handle.\n- `newFlowEvent`, `trigger`, `submitOnEvent` and `submitOnEvents` (experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships.\n- `waitFor(Pending)` which await a Pending job result and blocks the current thread\n- `isReady(Pending)` will check if `waitFor` will actually block or return the result immediately.\n- `isSubmitted(job)` allows you to build speculative algorithm where a job is submitted only if certain conditions are valid.\n\nWithin a job, tasks can be spawned and parallel for constructs can be used.\n\nIf `runInBackground()` does not provide fine enough control, a Weave background event loop\ncan be customized using the following primitive:\n- at a very low-level:\n  - The root thread primitives: `init(Weave)` and `exit(Weave)`\n  - `processAllandTryPark(Weave)`: Process all pending jobs and try sleeping. The sleep may fail to avoid deadlocks\n      if a job is submitted concurrently. This should be used in a `while true` event loop.\n- at a medium level:\n  - `runForever(Weave)`: Start a never-ending event loop that processes all pending jobs and sleep until new work arrives.\n  - `runUntil(Weave, signalShutdown: ptr Atomic[bool])`: Start an event-loop that quits on signal.\n\nFor example:\n```Nim\nproc runUntil*(_: typedesc[Weave], signal: ptr Atomic[bool]) =\n  ## Start a Weave event loop until signal is true on the current thread.\n  ## It wakes-up on job submission, handles multithreaded load balancing,\n  ## help process tasks\n  ## and spin down when there is no work anymore.\n  preCondition: not signal.isNil\n  while not signal[].load(moRelaxed):\n    processAllandTryPark(Weave)\n  syncRoot(Weave)\n\nproc runInBackground*(\n       _: typedesc[Weave],\n       signalShutdown: ptr Atomic[bool]\n     ): Thread[ptr Atomic[bool]] =\n  ## Start the Weave runtime on a background thread.\n  ## It wakes-up on job submissions, handles multithreaded load balancing,\n  ## help process tasks\n  ## and spin down when there is no work anymore.\n  proc eventLoop(shutdown: ptr Atomic[bool]) {.thread.} =\n    init(Weave)\n    Weave.runUntil(shutdown)\n    exit(Weave)\n  result.createThread(eventLoop, signalShutdown)\n```\n\n## Platforms supported\n\nWeave supports all platforms with `pthread` and Windows.\nMissing pthread functionality may be emulated or unused.\nFor example on MacOS, the `pthread` implementation does not expose barrier functionality or affinity settings.\n\n### C++ compilation\n\nThe `syncScope` feature will not compile correctly in C++ mode if it is used in a for loop.\nUpstream: https://github.com/nim-lang/Nim/issues/14118\n\n### Windows 32-bit\n\nWindows 32-bit targets cannot use the MinGW compiler as it is missing support\nfor `EnterSynchronizationBarrier`. MSVC should work instead.\n\n### Resource-restricted devices\n\nWeave uses a flexible and efficient memory subsystem that has been optimized for a wide range of hardware: low power Raspberry Pi, phones, laptops, desktops and 30+ cores workstations.\nIt currently assumes by default that 16KB at least are available on your hardware for a memory pool and that this memory pool can grow as needed.\nThis can be tuned with `-d:WV_MemArenaSize=2048` to have the base pool use 2KB for example.\nThe pool size should be a multiple of 256 bytes.\nPRs to improve support of very restricted devices are welcome.\n\n## Backoff mechanism\n\nA Backoff mechanism is enabled by default. It allows workers with no tasks to sleep instead of spinning aimlessly and burning CPU cycles.\n\nIt can be disabled with `-d:WV_Backoff=off`.\n\n### Weave using all CPUs\n\nWeave multithreading is cooperative, idle threads send steal requests instead of actively stealing in other workers queue. This is called \"work-requesting\" in the literature as opposed to \"work-stealing\".\n\nThis means that a thread sleeping or stuck in a long computation may starve other threads and they will spin burning CPU cycles.\n\n- Don't sleep or block a thread as this blocks Weave scheduler. This is a similar to `async`/`await` libraries.\n- If you really need to sleep or block the root thread, make sure to empty all the tasks beforehand with `syncRoot(Weave)` in the root thread. The child threads will be put to sleep until new tasks are spawned.\n- The `loadBalance(Weave)` call can be used in the middle of heavy computations to force the worker to answer steal requests. This is automatically done in `parallelFor` loops.\n  `loadBalance(Weave)` is a very fast call that makes a worker thread checks its queue\n  and dispatch its pending tasks to others. It does not block.\n\nWe call the root thread the thread that called `init(Weave)`\n\n## Experimental features\n\nExperimental features might see API and/or implementation changes.\n\nFor example both parallelForStaged and parallelReduce allow reductions but\nparallelForStaged is more flexible, it however requires explicit use of locks and/or atomics.\n\nLazyFlowvars may be enabled by default for certain sizes or if escape analysis become possible\nor if we prevent Flowvar from escaping their scope.\n\n### Data parallelism (experimental features)\n\n#### Awaitable loop\n\nLoops can be awaited. Awaitable loops return a normal Flowvar.\n\nThis blocks the thread that spawned the parallel loop from continuing until the loop is resolved. The thread does not stay idle and will steal and run other tasks while being blocked.\n\nCalling `sync` on the awaitable loop Flowvar will return `true` for the last thread to exit the loop and `false` for the others.\n- Due to dynamic load-balancing, an unknown amount of threads will execute the loop.\n- It's the thread that spawned the loop task that will always be the last thread to exit.\n  The `false` value is only internal to `Weave`.\n\n\u003e ⚠️ This is not a barrier: if that loop spawns tasks (including via a nested loop) and exits, the thread will continue, it will not wait for the grandchildren tasks to be finished. Use a `syncScope` section to wait on all tasks and descendants including grandchildren.\n\n```Nim\nimport weave\n\ninit(Weave)\n\n# expandMacros:\nparallelFor i in 0 ..\u003c 10:\n  awaitable: iLoop\n  echo \"iteration: \", i\n\nlet wasLastThread = sync(iLoop)\necho wasLastThread\n\nexit(Weave)\n```\n\n\n#### Parallel For Staged\n\nWeave provides a `parallelForStaged` construct with supports for thread-local prologue and epilogue.\n\nA parallel sum would look like this:\n```Nim\nproc sumReduce(n: int): int =\n  let res = result.addr # For mutation we need to capture the address.\n\n  parallelForStaged i in 0 .. n:\n    captures: {res}\n    awaitable: iLoop\n    prologue:\n      var localSum = 0\n    loop:\n      localSum += i\n    epilogue:\n      echo \"Thread \", getThreadID(Weave), \": localsum = \", localSum\n      res[].atomicInc(localSum)\n\n  let wasLastThread = sync(iLoop)\n\ninit(Weave)\nlet sum1M = sumReduce(1000000)\necho \"Sum reduce(0..1000000): \", sum1M\ndoAssert sum1M == 500_000_500_000\nexit(Weave)\n```\n\n`parallelForStagedStrided` is also provided.\n\n#### Parallel Reduction\n\nWeave provides a parallel reduction construct that avoids having to use explicit synchronization like atomics or locks\nbut instead uses Weave `sync(Flowvar)` under-the-hood.\n\nSyntax is the following:\n\n```Nim\nproc sumReduce(n: int): int =\n  var waitableSum: Flowvar[int]\n\n  # expandMacros:\n  parallelReduceImpl i in 0 .. n, stride = 1:\n    reduce(waitableSum):\n      prologue:\n        var localSum = 0\n      fold:\n        localSum += i\n      merge(remoteSum):\n        localSum += sync(remoteSum)\n      return localSum\n\n  result = sync(waitableSum)\n\ninit(Weave)\nlet sum1M = sumReduce(1000000)\necho \"Sum reduce(0..1000000): \", sum1M\ndoAssert sum1M == 500_000_500_000\nexit(Weave)\n```\n\nIn the future the `waitableSum` will probably be not required to be declared beforehand.\nOr parallel reduce might be removed to only keep parallelForStaged.\n\n### Dataflow parallelism\n\nDataflow parallelism allows expressing fine-grained data dependencies between tasks.\nConcretely a task is delayed until all its dependencies are met and once met,\nit is triggered immediately.\n\nThis allows precise specification of data producer-consumer relationships.\n\nIn contrast, classic task parallelism can only express control-flow dependencies (i.e. parent-child function calls relationships) and classic tasks are eagerly scheduled.\n\nIn the literature, it is also called:\n- Stream parallelism\n- Pipeline parallelism\n- Graph parallelism\n- Data-driven task parallelism\n\nTagged experimental as the API and its implementation are unique\ncompared to other libraries/language-extensions. Feedback welcome.\n\nNo specific ordering is required between calling the event producer and its consumer(s).\n\nDependencies are expressed by a handle called `FlowEvent`.\nAn flow event can express either a single dependency, initialized with `newFlowEvent()`\nor a dependencies on parallel for loop iterations, initialized with `newFlowEvent(start, exclusiveStop, stride)`\n\nTo await on a single event pass it to `spawnOnEvent` or the `parallelFor` invocation.\nTo await on an iteration, pass a tuple:\n- `(FlowEvent, 0)` to await precisely and only for iteration 0. This works with both `spawnOnEvent` or `parallelFor` (via a dependsOnEvent statement)\n- `(FlowEvent, loop_index_variable)` to await on a whole iteration range.\n  For example\n  ```Nim\n  parallelFor i in 0 ..\u003c n:\n    dependsOnEvent: (e, i) # Each \"i\" will independently depends on their matching event\n    body\n  ```\n  This only works with `parallelFor`. The `FlowEvent` iteration domain and the `parallelFor` domain must be the same. As soon as a subset of the pledge is ready, the corresponding `parallelFor` tasks will be scheduled.\n\n#### Delayed computation with single dependencies\n\n```Nim\nimport weave\n\nproc echoA(eA: FlowEvent) =\n  echo \"Display A, sleep 1s, create parallel streams 1 and 2\"\n  sleep(1000)\n  eA.trigger()\n\nproc echoB1(eB1: FlowEvent) =\n  echo \"Display B1, sleep 1s\"\n  sleep(1000)\n  eB1.trigger()\n\nproc echoB2() =\n  echo \"Display B2, exit stream\"\n\nproc echoC1() =\n  echo \"Display C1, exit stream\"\n\nproc main() =\n  echo \"Dataflow parallelism with single dependency\"\n  init(Weave)\n  let eA = newFlowEvent()\n  let eB1 = newFlowEvent()\n  spawnOnEvent eB1, echoC1()\n  spawnOnEvent eA, echoB2()\n  spawnOnEvent eA, echoB1(eB1)\n  spawn echoA(eA)\n  exit(Weave)\n\nmain()\n```\n\n#### Delayed computation with multiple dependencies\n\n```Nim\nimport weave\n\nproc echoA(eA: FlowEvent) =\n  echo \"Display A, sleep 1s, create parallel streams 1 and 2\"\n  sleep(1000)\n  eA.trigger()\n\nproc echoB1(eB1: FlowEvent) =\n  echo \"Display B1, sleep 1s\"\n  sleep(1000)\n  eB1.trigger()\n\nproc echoB2(eB2: FlowEvent) =\n  echo \"Display B2, no sleep\"\n  eB2.trigger()\n\nproc echoC12() =\n  echo \"Display C12, exit stream\"\n\nproc main() =\n  echo \"Dataflow parallelism with multiple dependencies\"\n  init(Weave)\n  let eA = newFlowEvent()\n  let eB1 = newFlowEvent()\n  let eB2 = newFlowEvent()\n  spawnOnEvents eB1, eB2, echoC12()\n  spawnOnEvent eA, echoB2(eB2)\n  spawnOnEvent eA, echoB1(eB1)\n  spawn echoA(eA)\n  exit(Weave)\n\nmain()\n```\n\n#### Delayed loop computation\n\nYou can combine data parallelism and dataflow parallelism.\n\nCurrently parallel loops only support one dependency (single, fixed iteration or range iteration).\n\nHere is an example with a range iteration dependency. _Note: when sleeping threads are unresponsive, meaning a sleeping thread cannot schedule other ready tasks._\n\n```Nim\nimport weave\n\nproc main() =\n  init(Weave)\n\n  let eA = newFlowEvent(0, 10, 1)\n  let pB = newFlowEvent(0, 10, 1)\n\n  parallelFor i in 0 ..\u003c 10:\n    captures: {eA}\n    sleep(i * 10)\n    eA.trigger(i)\n    echo \"Step A - stream \", i, \" at \", i * 10, \" ms\"\n\n  parallelFor i in 0 ..\u003c 10:\n    dependsOn: (eA, i)\n    captures: {pB}\n    sleep(i * 10)\n    pB.trigger(i)\n    echo \"Step B - stream \", i, \" at \", 2 * i * 10, \" ms\"\n\n  parallelFor i in 0 ..\u003c 10:\n    dependsOn: (pB, i)\n    sleep(i * 10)\n    echo \"Step C - stream \", i, \" at \", 3 * i * 10, \" ms\"\n\n  exit(Weave)\n\nmain()\n```\n\n### Lazy Allocation of Flowvars\n\nFlowvars can be lazily allocated, this reduces overhead by at least 2x on very fine-grained tasks like Fibonacci or Depth-First-Search that may spawn trillions of tasks in less than\na couple hundreds of milliseconds. This can be enabled with `-d:WV_LazyFlowvar`.\n\n⚠️ This only works for Flowvar of a size up to your machine word size (int64, float64, pointer on 64-bit machines)\n⚠️ Flowvars cannot be returned in that mode, you will at best trigger stack smashing protection or crash\n\n## Limitations\n\nWeave has not been tested with GC-ed types. Pass a pointer around or use Nim channels which are GC-aware.\nIf it works, a heads-up would be valuable.\n\nThis might improve with Nim ARC/newruntime.\n\n## Statistics\n\nCurious minds can access the low-level runtime statistic with the flag `-d:WV_metrics`\nwhich will give you the information on number of tasks executed, steal requests sent, etc.\n\nVery curious minds can also enable high resolution timers with `-d:WV_metrics -d:WV_profile -d:CpuFreqMhz=3000` assuming you have a 3GHz CPU.\n\nThe timers will give you in this order:\n```\nTime spent running tasks, Time spent recv/send steal requests, Time spent recv/send tasks, Time spent caching tasks, Time spent idle, Total\n```\n\n## Tuning\n\nA number of configuration options are available in [weave/config.nim](weave/config.nim).\n\nIn particular:\n- `-d:WV_StealAdaptativeInterval=25` defines the number of steal requests after which thieves reevaluate their steal strategy (steal one task or steal half the victim's tasks). Default: 25\n- `-d:WV_StealEarly=0` allows worker to steal early, when only `WV_StealEraly tasks are leftin their queue. Default: don't steal early\n\n## Unique features\n\nWeave provides an unique scheduler with the following properties:\n- Message-Passing based:\n  unlike alternative work-stealing schedulers, this means that Weave is usable\n  on any architecture where message queues, channels or locks are available and not only atomics.\n  Architectures without atomics include distributed clusters or non-cache coherent processors\n  like the Cell Broadband Engine (for the PS3) that favors Direct memory Access (DMA),\n  the many-core mesh Tile CPU from Mellanox (EzChip/Tilera) with 64 to 100 ARM cores,\n  or the network-on-chip (NOC) CPU Epiphany V from Adapteva with 1024 cores,\n  or the research CPU Intel SCC.\n- Scalable:\n  As the number of cores in computer is growing steadily, developers need to find new avenues of parallelism\n  to exploit them.\n  Unfortunately existing framework requires computation to take 10000 cycles at minimum (Intel TBB)\n  which corresponds to 3.33 µs on a 3 GHz CPU to amortize the cost of scheduling.\n  This burden the developers with questions of grain size, heuristics on distributing parallel loop\n  for the common case and mischeduling on recursive tree algorithms with potentially very low compute-intensive leaves.\n  - Weave uses an adaptative work-stealing scheduler that adapts its stealing strategy depending\n    on each core load and the intensity of tasks.\n    Small tasks will be packaged into chunks to amortize scheduling overhead.\n  - Weave also uses an adaptative lazy loop splitting strategy.\n    Loops will only be split when needed. There is no partitioning issue or grain size issue,\n    or estimating if the workload is memory-bound or compute-bound, see [PyTorch OpenMP woes on parallel map](https://github.com/zy97140/omp-benchmark-for-pytorch).\n  - Weave aims efficient multicore scaling for very fine-grained tasks starting from the 2000 cycles range upward (0.67 µs on 3GHz).\n- Fast and low-overhead:\n  While the number of cores have been growing steadily, many programs\n  are now hitting the limit of memory bandwidth and require tuning allocators,\n  cache lines, CPU caches.\n  Enormous care has been given to optimize Weave to keep it very low-overhead.\n  Weave uses efficient memory allocation and caches to avoid stressing\n  the system allocator and prevent memory fragmentation.\n  Soon, a thread-safe caching system that can release memory to the OS will be added\n  to prevent reserving memory for a long-time.\n- Ergonomic and composable:\n  Weave API is based on futures similar to async/await for concurrency.\n  The task dependency graph is implicitly built when awaiting a result\n  An OpenMP-syntax is planned.\n\nThe \"Project Picasso\" RFC is available for discussion in [Nim RFC #160](https://github.com/nim-lang/RFCs/issues/160)\nor in the (potentially outdated) [picasso_RFC.md](Weave_RFC.md) file\n\n## Research\n\nWeave is based on the research by [Andreas Prell](https://github.com/aprell/).\nYou can read his [PhD Thesis](https://epub.uni-bayreuth.de/2990) or access his [C implementation](https://github.com/aprell/tasking-2.0).\n\nSeveral enhancements were built into Weave, in particular:\n\n- Memory management was carefully studied to allow releasing memory to the OS\n  while still providing very high performance and solving the decades old cactus stack problem.\n  The solution, coupling a threadsafe memory pool with a lookaside buffer, is\n  inspired by Microsoft's Mimalloc and Snmalloc, a message-passing based allocator (also by Microsoft). Details are provided in the multiple Markdown file in the [memory folder](weave/memory).\n- The channels were reworked to not use locks. In particular the MPSC channel (Multi-Producer Single-Consumer) supports batching for both producers and consumers without any lock.\n\n## License\n\nLicensed and distributed under either of\n\n* MIT license: [LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT\n\nor\n\n* Apache License, Version 2.0, ([LICENSE-APACHEv2](LICENSE-APACHEv2) or http://www.apache.org/licenses/LICENSE-2.0)\n\nat your option. These files may not be copied, modified, or distributed except according to those terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmratsim%2Fweave","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmratsim%2Fweave","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmratsim%2Fweave/lists"}