{"id":19115407,"url":"https://github.com/polytypic/par-ml","last_synced_at":"2025-06-24T10:38:06.428Z","repository":{"id":78838732,"uuid":"580673882","full_name":"polytypic/par-ml","owner":"polytypic","description":"Experimental parallel and concurrent OCaml","archived":false,"fork":false,"pushed_at":"2023-01-28T08:51:12.000Z","size":415,"stargazers_count":14,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-30T23:03:20.334Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"OCaml","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/polytypic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-12-21T06:39:29.000Z","updated_at":"2024-04-10T12:23:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"d0208cc2-e072-4569-9e3c-5d696aff53da","html_url":"https://github.com/polytypic/par-ml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/polytypic/par-ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polytypic%2Fpar-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polytypic%2Fpar-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polytypic%2Fpar-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polytypic%2Fpar-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/polytypic","download_url":"https://codeload.github.com/polytypic/par-ml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/polytypic%2Fpar-ml/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261653244,"owners_count":23190420,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T04:46:18.343Z","updated_at":"2025-06-24T10:38:06.403Z","avatar_url":"https://github.com/polytypic.png","language":"OCaml","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Experimental parallel and concurrent OCaml\n\n_*NOTE*_: There are multiple different approaches implemented in this\nrepository. See the different\n[branches](https://github.com/polytypic/par-ml/branches/all).\n\nThis particular approach uses the DCYL work-stealing deque. This gives\nreasonable overhead and good parallelization of the last few work items.\n\nKey differences compared to\n[lockfree](https://github.com/ocaml-multicore/lockfree) work-stealing deque and\nthe version implemented [here](src/main/DCYL.ml):\n\n- Padding is added, see the\n  [`multicore-magic`](https://github.com/polytypic/multicore-magic) library, to\n  long-lived objects to avoid false-sharing.\n- Fewer atomic variables and fenced operations are used (closer to\n  [original paper](https://www.semanticscholar.org/paper/Dynamic-circular-work-stealing-deque-Chase-Lev/f856a996e7aec0ea6db55e9247a00a01cb695090)).\n- A level of (pointer) indirection is avoided by using a different technique to\n  release stolen elements (look for `clear`).\n- [`mark` and `drop_at` operations](src/main/DCYL.mli) are provided to allow\n  owner to remove elements from deque without dropping to main loop.\n\nKey differences compared to the worker pool of\n[domainslib](https://github.com/ocaml-multicore/domainslib) and the approach\nimplemented [here](src/main/Par.ml):\n\n- A more general `Suspend` effect is used to allow synchronization primitives to\n  be built on top of the scheduler.\n- The pool of workers is not exposed. The idea is that there is only one system\n  level pool of workers to be used by all parallel and concurrent code.\n  - `Domain.self ()` is used as index for efficient per domain storage. The\n    domain ids are assumed to be consecutive numbers in the range `[0, n[`.\n- A lower overhead [`par`](src/main/Par.mli) operation is provided for parallel\n  execution.\n- Work items are defunctionalized replacing a closure with an existential inside\n  an atomic.\n- A simple parking/wake-up mechanism using a `Mutex`, a `Condition` variable,\n  and a shared non-atomic flag (look for `num_waiters_non_zero`) is used.\n\nIt would seem that the ability to drop work items from the owned deque, and\nthereby avoid accumulation of stale work items, and, at the same time, ability\nto avoid capturing continuations, can provide major performance benefits\n(roughly 5x) in cases where it applies. Other optimizations provide small\nbenefits (combining to roughly 2x).\n\nAvoiding false-sharing is crucial for stable performance.\n\nTODO:\n\n- Support for cancellation.\n- `sleep` mechanism.\n- Composable synchronization primitives (e.g. ability to `race` fibers).\n- Various synchronization primitives (mutex, condition, ...) as examples.\n\n## Benchmarks to be taken with plenty of salt\n\nThese have been run on Apple M1 with 4 + 4 cores (in normal mode).\n\n\u003e As an aside, let's assume cache size differences do not matter. As Apple M1\n\u003e has 4 cores at 3228 MHz and 4 cores at 2064 MHz, one could estimate that the\n\u003e best possible parallel speed up would be (4 \\* (3228 + 2064)) / 3228 or\n\u003e roughly 6.5.\n\n```sh\n➜  P=FibFiber.exe; N=37; hyperfine --warmup 1 --shell none \"$P 1 $N\" \"$P 2 $N\" \"$P 4 $N\" \"$P 8 $N\"\nBenchmark 1: FibFiber.exe 1 37\n  Time (mean ± σ):      1.179 s ±  0.004 s    [User: 1.175 s, System: 0.004 s]\n  Range (min … max):    1.174 s …  1.184 s    10 runs\n\nBenchmark 2: FibFiber.exe 2 37\n  Time (mean ± σ):     642.4 ms ±   0.9 ms    [User: 1271.2 ms, System: 3.9 ms]\n  Range (min … max):   641.0 ms … 644.0 ms    10 runs\n\nBenchmark 3: FibFiber.exe 4 37\n  Time (mean ± σ):     344.3 ms ±   0.8 ms    [User: 1336.9 ms, System: 7.5 ms]\n  Range (min … max):   343.2 ms … 345.6 ms    10 runs\n\nBenchmark 4: FibFiber.exe 8 37\n  Time (mean ± σ):     412.5 ms ±  13.6 ms    [User: 2697.6 ms, System: 85.2 ms]\n  Range (min … max):   390.1 ms … 432.3 ms    10 runs\n\nSummary\n  'FibFiber.exe 4 37' ran\n    1.20 ± 0.04 times faster than 'FibFiber.exe 8 37'\n    1.87 ± 0.01 times faster than 'FibFiber.exe 2 37'\n    3.43 ± 0.01 times faster than 'FibFiber.exe 1 37'\n```\n\n```sh\n➜  P=FibPar.exe; N=37; hyperfine --warmup 1 --shell none \"$P 1 $N\" \"$P 2 $N\" \"$P 4 $N\" \"$P 8 $N\"\nBenchmark 1: FibPar.exe 1 37\n  Time (mean ± σ):     872.9 ms ±   1.4 ms    [User: 869.5 ms, System: 2.8 ms]\n  Range (min … max):   870.0 ms … 875.4 ms    10 runs\n\nBenchmark 2: FibPar.exe 2 37\n  Time (mean ± σ):     517.4 ms ±   2.7 ms    [User: 1021.1 ms, System: 4.0 ms]\n  Range (min … max):   513.9 ms … 522.3 ms    10 runs\n\nBenchmark 3: FibPar.exe 4 37\n  Time (mean ± σ):     278.9 ms ±   1.5 ms    [User: 1075.9 ms, System: 7.4 ms]\n  Range (min … max):   277.1 ms … 281.6 ms    10 runs\n\nBenchmark 4: FibPar.exe 8 37\n  Time (mean ± σ):     432.4 ms ±  13.5 ms    [User: 2734.2 ms, System: 91.0 ms]\n  Range (min … max):   410.5 ms … 451.2 ms    10 runs\n\nSummary\n  'FibPar.exe 4 37' ran\n    1.55 ± 0.05 times faster than 'FibPar.exe 8 37'\n    1.86 ± 0.01 times faster than 'FibPar.exe 2 37'\n    3.13 ± 0.02 times faster than 'FibPar.exe 1 37'\n```\n\nIn the following, the `fib_par` example of domainslib\n\n```ocaml\nlet rec fib_par pool n =\n  if n \u003c 2 then n\n  else\n    let b = T.async pool (fun _ -\u003e fib_par pool (n - 1)) in\n    let a = fib_par pool (n - 2) in\n    a + T.await pool b\n```\n\nhas been modified as above\n\n- to not drop down to sequential version (intention is to measure overheads),\n- to perform better (using fewer `async`/`await`s), and\n- to give same numerical result as the `par-ml` versions.\n\n```sh\n➜  P=fib_par.exe; N=37; hyperfine --warmup 1 --shell none \"$P 1 $N\" \"$P 2 $N\" \"$P 4 $N\" \"$P 8 $N\"\nBenchmark 1: fib_par.exe 1 37\n  Time (mean ± σ):      7.101 s ±  0.027 s    [User: 7.084 s, System: 0.017 s]\n  Range (min … max):    7.065 s …  7.172 s    10 runs\n\nBenchmark 2: fib_par.exe 2 37\n  Time (mean ± σ):      4.647 s ±  0.038 s    [User: 9.264 s, System: 0.016 s]\n  Range (min … max):    4.610 s …  4.712 s    10 runs\n\nBenchmark 3: fib_par.exe 4 37\n  Time (mean ± σ):      3.095 s ±  0.062 s    [User: 12.309 s, System: 0.018 s]\n  Range (min … max):    3.028 s …  3.205 s    10 runs\n\nBenchmark 4: fib_par.exe 8 37\n  Time (mean ± σ):      4.950 s ±  0.053 s    [User: 36.023 s, System: 0.269 s]\n  Range (min … max):    4.852 s …  5.040 s    10 runs\n\nSummary\n  'fib_par.exe 4 37' ran\n    1.50 ± 0.03 times faster than 'fib_par.exe 2 37'\n    1.60 ± 0.04 times faster than 'fib_par.exe 8 37'\n    2.29 ± 0.05 times faster than 'fib_par.exe 1 37'\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolytypic%2Fpar-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpolytypic%2Fpar-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolytypic%2Fpar-ml/lists"}