https://github.com/polytypic/par-ml

Experimental parallel and concurrent OCaml
https://github.com/polytypic/par-ml

Last synced: 7 months ago
JSON representation

Experimental parallel and concurrent OCaml

Host: GitHub
URL: https://github.com/polytypic/par-ml
Owner: polytypic
Created: 2022-12-21T06:39:29.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-01-28T08:51:12.000Z (almost 3 years ago)
Last Synced: 2025-04-30T23:03:20.334Z (9 months ago)
Language: OCaml
Homepage:
Size: 405 KB
Stars: 14
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Experimental parallel and concurrent OCaml

_*NOTE*_: There are multiple different approaches implemented in this

repository. See the different

[branches](https://github.com/polytypic/par-ml/branches/all).

This particular approach uses the DCYL work-stealing deque. This gives

reasonable overhead and good parallelization of the last few work items.

Key differences compared to

[lockfree](https://github.com/ocaml-multicore/lockfree) work-stealing deque and

the version implemented [here](src/main/DCYL.ml):

- Padding is added, see the

  [`multicore-magic`](https://github.com/polytypic/multicore-magic) library, to

  long-lived objects to avoid false-sharing.

- Fewer atomic variables and fenced operations are used (closer to

  [original paper](https://www.semanticscholar.org/paper/Dynamic-circular-work-stealing-deque-Chase-Lev/f856a996e7aec0ea6db55e9247a00a01cb695090)).

- A level of (pointer) indirection is avoided by using a different technique to

  release stolen elements (look for `clear`).

- [`mark` and `drop_at` operations](src/main/DCYL.mli) are provided to allow

  owner to remove elements from deque without dropping to main loop.

Key differences compared to the worker pool of

[domainslib](https://github.com/ocaml-multicore/domainslib) and the approach

implemented [here](src/main/Par.ml):

- A more general `Suspend` effect is used to allow synchronization primitives to

  be built on top of the scheduler.

- The pool of workers is not exposed. The idea is that there is only one system

  level pool of workers to be used by all parallel and concurrent code.

  - `Domain.self ()` is used as index for efficient per domain storage. The

    domain ids are assumed to be consecutive numbers in the range `[0, n[`.

- A lower overhead [`par`](src/main/Par.mli) operation is provided for parallel

  execution.

- Work items are defunctionalized replacing a closure with an existential inside

  an atomic.

- A simple parking/wake-up mechanism using a `Mutex`, a `Condition` variable,

  and a shared non-atomic flag (look for `num_waiters_non_zero`) is used.

It would seem that the ability to drop work items from the owned deque, and

thereby avoid accumulation of stale work items, and, at the same time, ability

to avoid capturing continuations, can provide major performance benefits

(roughly 5x) in cases where it applies. Other optimizations provide small

benefits (combining to roughly 2x).

Avoiding false-sharing is crucial for stable performance.

TODO:

- Support for cancellation.

- `sleep` mechanism.

- Composable synchronization primitives (e.g. ability to `race` fibers).

- Various synchronization primitives (mutex, condition, ...) as examples.

## Benchmarks to be taken with plenty of salt

These have been run on Apple M1 with 4 + 4 cores (in normal mode).

> As an aside, let's assume cache size differences do not matter. As Apple M1

> has 4 cores at 3228 MHz and 4 cores at 2064 MHz, one could estimate that the

> best possible parallel speed up would be (4 \* (3228 + 2064)) / 3228 or

> roughly 6.5.

```sh

➜  P=FibFiber.exe; N=37; hyperfine --warmup 1 --shell none "$P 1 $N" "$P 2 $N" "$P 4 $N" "$P 8 $N"

Benchmark 1: FibFiber.exe 1 37

  Time (mean ± σ):      1.179 s ±  0.004 s    [User: 1.175 s, System: 0.004 s]

  Range (min … max):    1.174 s …  1.184 s    10 runs

Benchmark 2: FibFiber.exe 2 37

  Time (mean ± σ):     642.4 ms ±   0.9 ms    [User: 1271.2 ms, System: 3.9 ms]

  Range (min … max):   641.0 ms … 644.0 ms    10 runs

Benchmark 3: FibFiber.exe 4 37

  Time (mean ± σ):     344.3 ms ±   0.8 ms    [User: 1336.9 ms, System: 7.5 ms]

  Range (min … max):   343.2 ms … 345.6 ms    10 runs

Benchmark 4: FibFiber.exe 8 37

  Time (mean ± σ):     412.5 ms ±  13.6 ms    [User: 2697.6 ms, System: 85.2 ms]

  Range (min … max):   390.1 ms … 432.3 ms    10 runs

Summary

  'FibFiber.exe 4 37' ran

    1.20 ± 0.04 times faster than 'FibFiber.exe 8 37'

    1.87 ± 0.01 times faster than 'FibFiber.exe 2 37'

    3.43 ± 0.01 times faster than 'FibFiber.exe 1 37'

```

```sh

➜  P=FibPar.exe; N=37; hyperfine --warmup 1 --shell none "$P 1 $N" "$P 2 $N" "$P 4 $N" "$P 8 $N"

Benchmark 1: FibPar.exe 1 37

  Time (mean ± σ):     872.9 ms ±   1.4 ms    [User: 869.5 ms, System: 2.8 ms]

  Range (min … max):   870.0 ms … 875.4 ms    10 runs

Benchmark 2: FibPar.exe 2 37

  Time (mean ± σ):     517.4 ms ±   2.7 ms    [User: 1021.1 ms, System: 4.0 ms]

  Range (min … max):   513.9 ms … 522.3 ms    10 runs

Benchmark 3: FibPar.exe 4 37

  Time (mean ± σ):     278.9 ms ±   1.5 ms    [User: 1075.9 ms, System: 7.4 ms]

  Range (min … max):   277.1 ms … 281.6 ms    10 runs

Benchmark 4: FibPar.exe 8 37

  Time (mean ± σ):     432.4 ms ±  13.5 ms    [User: 2734.2 ms, System: 91.0 ms]

  Range (min … max):   410.5 ms … 451.2 ms    10 runs

Summary

  'FibPar.exe 4 37' ran

    1.55 ± 0.05 times faster than 'FibPar.exe 8 37'

    1.86 ± 0.01 times faster than 'FibPar.exe 2 37'

    3.13 ± 0.02 times faster than 'FibPar.exe 1 37'

```

In the following, the `fib_par` example of domainslib

```ocaml

let rec fib_par pool n =

  if n < 2 then n

  else

    let b = T.async pool (fun _ -> fib_par pool (n - 1)) in

    let a = fib_par pool (n - 2) in

    a + T.await pool b

```

has been modified as above

- to not drop down to sequential version (intention is to measure overheads),

- to perform better (using fewer `async`/`await`s), and

- to give same numerical result as the `par-ml` versions.

```sh

➜  P=fib_par.exe; N=37; hyperfine --warmup 1 --shell none "$P 1 $N" "$P 2 $N" "$P 4 $N" "$P 8 $N"

Benchmark 1: fib_par.exe 1 37

  Time (mean ± σ):      7.101 s ±  0.027 s    [User: 7.084 s, System: 0.017 s]

  Range (min … max):    7.065 s …  7.172 s    10 runs

Benchmark 2: fib_par.exe 2 37

  Time (mean ± σ):      4.647 s ±  0.038 s    [User: 9.264 s, System: 0.016 s]

  Range (min … max):    4.610 s …  4.712 s    10 runs

Benchmark 3: fib_par.exe 4 37

  Time (mean ± σ):      3.095 s ±  0.062 s    [User: 12.309 s, System: 0.018 s]

  Range (min … max):    3.028 s …  3.205 s    10 runs

Benchmark 4: fib_par.exe 8 37

  Time (mean ± σ):      4.950 s ±  0.053 s    [User: 36.023 s, System: 0.269 s]

  Range (min … max):    4.852 s …  5.040 s    10 runs

Summary

  'fib_par.exe 4 37' ran

    1.50 ± 0.03 times faster than 'fib_par.exe 2 37'

    1.60 ± 0.04 times faster than 'fib_par.exe 8 37'

    2.29 ± 0.05 times faster than 'fib_par.exe 1 37'

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/polytypic/par-ml

Awesome Lists containing this project

README