{"id":13726222,"url":"https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml","last_synced_at":"2026-03-02T20:36:11.731Z","repository":{"id":44746278,"uuid":"276060993","full_name":"ocaml-multicore/parallel-programming-in-multicore-ocaml","owner":"ocaml-multicore","description":"Tutorial on Multicore OCaml parallel programming with domainslib","archived":false,"fork":false,"pushed_at":"2024-03-12T03:54:04.000Z","size":817,"stargazers_count":288,"open_issues_count":6,"forks_count":7,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-05-20T06:05:52.130Z","etag":null,"topics":["multicore","ocaml"],"latest_commit_sha":null,"homepage":"","language":"OCaml","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ocaml-multicore.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-06-30T09:54:26.000Z","updated_at":"2025-01-16T18:58:44.000Z","dependencies_parsed_at":"2024-02-01T20:07:36.547Z","dependency_job_id":"b58a6ea2-3403-4c18-9c99-e6e5d86bbeb1","html_url":"https://github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ocaml-multicore/parallel-programming-in-multicore-ocaml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-multicore%2Fparallel-programming-in-multicore-ocaml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-multicore%2Fparallel-programming-in-multicore-ocaml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-multicore%2Fparallel-programming-in-multicore-ocaml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-multicore%2Fparallel-programming-in-multicore-ocaml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ocaml-multicore","download_url":"https://codeload.github.com/ocaml-multicore/parallel-programming-in-multicore-ocaml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ocaml-multicore%2Fparallel-programming-in-multicore-ocaml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30018584,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T17:00:27.440Z","status":"ssl_error","status_checked_at":"2026-03-02T17:00:03.402Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multicore","ocaml"],"created_at":"2024-08-03T01:02:56.218Z","updated_at":"2026-03-02T20:36:11.697Z","avatar_url":"https://github.com/ocaml-multicore.png","language":"OCaml","funding_links":[],"categories":["OCaml","Resources","Parallelism"],"sub_categories":["Wiki"],"readme":"# Parallel Programming in Multicore OCaml\n\nThis tutorial will help you get started with writing parallel programs in\nMulticore OCaml. All the code examples along with their corresponding **dune** file\nare available in the `code/` directory. The tutorial is organised into the\nfollowing sections:\n\n- [Introduction](#introduction)\n  * [Installation](#installation)\n  * [Compatibility with existing code](#compatibility-with-existing-code)\n- [Domains](#domains)\n- [Domainslib](#domainslib)\n  * [Task pool](#task-pool)\n  * [Parallel for](#parallel-for)\n  * [Async-Await](#async-await)\n    + [Fibonacci numbers in parallel](#fibonacci-numbers-in-parallel)\n- [Channels](#channels)\n  * [Bounded Channels](#bounded-channels)\n  * [Task Distribution using Channels](#task-distribution-using-channels)\n- [Profiling your code](#profiling-your-code)\n  * [Perf](#perf)\n  * [Eventlog](#eventlog)\n\n# Introduction\n\nMulticore OCaml is an extension of OCaml with native support for Shared-Memory\nParallelism (SMP) through `Domains` and Concurrency through `Algebraic Effects`.\nIt is merged to trunk OCaml. OCaml 5.0 will be the first release to officially\nsupport Multicore.\n\n**Concurrency** is how we partition multiple computations such that they can\nrun in overlapping time periods rather than strictly sequentially.\n**Parallelism** is the act of running multiple computations simultaneously,\nprimarily by using multiple cores on a multicore machine. The Multicore Wiki\nhas [comprehensive notes](https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes) on the design decisions and\ncurrent status of Concurrency and Parallelism in Multicore OCaml.\n\nThe Multicore OCaml compiler ships with a concurrent major and a stop-the-world\nminor *garbage collector* (GC). The parallel minor GC doesn't require any\nchanges to the C API, thereby not breaking any associated code with C API.\nOCaml 5.0 is expected to land with support for Shared-Memory Parallelism and\nAlgebraic Effects. A historical variant of the Multicore minor\ngarbage collector is the concurrent minor collector. Benchmarking experiments\nshowed better results in terms of throughput and latency on the stop-the-world\nparallel minor collector, hence that's chosen to be the default minor collector\non Multicore OCaml, and the concurrent minor collector is not actively developed.\nFor the intrigued, details on the design and evaluation of the Multicore GC and\ncompiler are in our\n[academic publications](https://github.com/ocaml-multicore/ocaml-multicore/wiki#articles).\n\nThe Multicore ecosystem also has the following libraries to complement the\ncompiler:\n\n* [**Domainslib**](https://github.com/ocaml-multicore/domainslib): data and\ncontrol structures for parallel programming\n* [**Eio**](https://github.com/ocaml-multicore/eio): effects-based direct-style IO for multicore OCaml\n* [**Saturn**](https://github.com/ocaml-multicore/saturn): [lock-free](https://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom) data\nstructures (list, hash, bag and queue)\n* [**Reagents**](https://github.com/ocaml-multicore/reagents): composable lock-free \nconcurrency library for expressing fine grained parallel programs on\nMulticore OCaml\n* [**Kcas**](https://github.com/ocaml-multicore/kcas): software\n  transactional memory (STM) implementation based on an atomic\n  lock-free multi-word compare-and-set (MCAS) algorithm\n\nFind ways to profitably write parallel programs in Multicore OCaml. The reader\nis assumed to be familiar with OCaml. If not, they are encouraged to read [Real\nWorld OCaml](https://dev.realworldocaml.org/toc.html). The effect handlers'\nstory is not covered here. For anyone interested, please check out this\n[tutorial](https://github.com/ocamllabs/ocaml-effects-tutorial) and some\n[examples](https://github.com/ocaml-multicore/effects-examples).\n\n## Installation\n\nInstructions to install OCaml 5 compiler is [here](https://github.com/ocaml-multicore/awesome-multicore-ocaml#installation).\n\nIt will also be useful to install `utop` on your Multicore switch by running\n`opam install utop`, which should work out of the box.\n\n# Domains\n\nDomains are the basic unit of Parallelism in Multicore OCaml.\n\n```ocaml\nlet square n = n * n\n\nlet x = 5\nlet y = 10\n\nlet _ =\n  let d = Domain.spawn (fun _ -\u003e square x) in\n  let sy = square y in\n  let sx = Domain.join d in\n  Printf.printf \"x = %d, y = %d\\n\" sx sy\n```\n`Domain.spawn` creates a new execution process that runs along with the\ncurrent domain.\n\n`Domain.join d` blocks until the domain `d` runs to completion. If the domain\nreturns a result after its execution, `Domain.join d` also returns that value.\nIf it raises an uncaught exception, that is thrown. When the parent domain\nterminates, all other domains also terminate. To ensure that a domain runs to\ncompletion, we have to join the domain.\n\nNote that the square of x is computed in a new domain and that of y in the\nparent domain.\n\nTo create its corresponding **dune** file, run this code:\n\n```\n(executable\n  (name square_domain)\n  (modules square_domain))\n```\n\nMake sure to use a Multicore switch to build this and all other subsequent\nexamples in this tutorial.\n\nTo execute the code:\n\n```\n$ dune build square_domain.exe\n$ ./_build/default/square_domain.exe\nx = 25, y = 100\n```\n\nAs expected, the squares of x and y are 25 and 100.\n\n**Common Error Message**\n\nSome common errors while compiling Multicore code are:\n\n```\nError: Unbound module Domain\n```\n\n```\nError: Unbound module Atomic\n```\n\n```\nError: Library \"domainslib\" not found.\n```\n\nThese errors usually mean that the compiler switch used is\nnot a Multicore switch. Using a Multicore compiler variant should resolve them.\n\n# Domainslib\n\n`Domainslib` is a parallel programming library for Multicore OCaml. It provides\nthe following APIs which enable easy ways to parallelise OCaml code with only a few\nmodifications to sequential code:\n\n* **Task**: Work stealing task pool with async/await Parallelism and `parallel_{for, scan}`\n* **Channels**: Multiple Producer Multiple Consumer channels which come in two flavours—bounded and unbounded\n\n`Domainslib` is effective in scaling performance when parallelisable\nworkloads are available.\n\n## Task.pool\n\nIn the **Domains** section, we saw how to run programs on multiple cores by\nspawning new domains. We often find ourselves spawning and joining\nnew domains numerous times in the same program, if we were to use that approach\nfor executing code in parallel. Creating new domains is an expensive operation, so \nwe should attempt to limit those when possible. `Task.pool` allows \nexecution of all parallel workloads in the same set of domains spawned at\nthe beginning of the program. Here is how they work:\n\nNote: If you are running this on `utop,` run `#require \"domainslib\"` with the hash before this.\n\n```ocaml\n# open Domainslib\n\n# let pool = Task.setup_pool ~num_domains:3 ()\nval pool : Task.pool = \u003cabstr\u003e\n```\nWe have created a new *task pool* with three new domains. The parent domain is\nalso part of this pool, thus making it a pool of four domains. After the pool is\nsetup, we can use it to execute all tasks we want to run in parallel. The\n`setup_pool` function requires us to specify the number of new domains to be\nspawned in the task pool. Ideally, the number of domains used to initiate a task pool \nwill match the number of available cores. Since the parent domain also\ntakes part in the pool, the `num_domains` parameter should be one\nless than the number of available cores.\n\nAlthough not strictly necessary, we highly recommended closing the task pool \nafter execution of all tasks. This can be done as follows:\n\n```ocaml\n# Task.teardown_pool pool\n```\n\nThis deactivates the pool, so it's no longer usable. Make sure to do this only\nafter all tasks are done.\n\n## Parallel_for\n\nIn the Task API, a powerful primitive called `parallel_for` can be used to\nparallelise computations used in `for` loops. It scales well with very little\nchange to the sequential code.\n\nConsider the example of matrix multiplication.\n\nFirst, write the sequential version of a function which performs\nmatrix multiplication of two matrices and returns the result:\n\n```ocaml\nlet matrix_multiply a b =\n  let i_n = Array.length a in\n  let j_n = Array.length b.(0) in\n  let k_n = Array.length b in\n  let res = Array.make_matrix i_n j_n 0 in\n  for i = 0 to i_n - 1 do\n    for j = 0 to j_n - 1 do\n      for k = 0 to k_n - 1 do\n        res.(i).(j) \u003c- res.(i).(j) + a.(i).(k) * b.(k).(j)\n      done\n    done\n  done;\n  res\n```\n\nTo make this function run in parallel, one might be inclined to spawn a new\ndomain for every iteration in the loop, which would look like:\n\n```ocaml\n  let domains = Array.init i_n (fun i -\u003e\n    Domain.spawn(fun _ -\u003e\n      for j = 0 to j_n - 1 do\n        for k = 0 to k_n - 1 do\n          res.(i).(j) \u003c- res.(i).(j) + a.(i).(k) * b.(k).(j)\n        done\n      done)) in\n   Array.iter Domain.join domains\n```\nThis will be *disastrous* in terms of performance, mostly because \nspawning a new domain is an expensive operation. Alternatively, a task pool offers \na finite set of available domains that can be used to run your\ncomputations in parallel.\n\nArrays are usually more efficient compared with lists in Multicore OCaml. \nAlthough they are not generally favoured in functional\nprogramming, using arrays for the sake of efficiency is a reasonable trade-off.\n\nA better way to parallelise matrix multiplication is with the help of a\n`parallel_for`.\n\n```ocaml\nlet parallel_matrix_multiply pool a b =\n  let i_n = Array.length a in\n  let j_n = Array.length b.(0) in\n  let k_n = Array.length b in\n  let res = Array.make_matrix i_n j_n 0 in\n\n  Task.parallel_for pool ~start:0 ~finish:(i_n - 1) ~body:(fun i -\u003e\n    for j = 0 to j_n - 1 do\n      for k = 0 to k_n - 1 do\n        res.(i).(j) \u003c- res.(i).(j) + a.(i).(k) * b.(k).(j)\n      done\n    done);\n  res\n```\n\nObserve quite a few differences between the parallel and sequential\nversions: The parallel version takes an additional parameter `pool` because \nthe `parallel_for` executes the `for` loop on the domains present in\nthat task pool. While it is possible to initialise a task pool inside the\nfunction itself, it's always better to have a single task pool used across the\nentire program. As mentioned earlier, this is to minimise the cost involved in\nspawning a new domain. It's also possible to create a global task pool to use \nacross, but for the sake of reasoning better about your code, it's recommended \nto use it as a function parameter.\n\nLet's examine the parameters of `parallel_for`. It takes in \n- `pool`, as discussed earlier \n- `start` and `finish`, as the names suggest, are the starting\nand ending values of the loop iterations\n- `body` contains the actual loop body to be executed\n\n`parallel_for` also has an optional parameter: `chunk_size`, which determines the\ngranularity of tasks when executing on multiple domains. If no parameter\nis given for `chunk size`, the program determines a default chunk size that performs\nwell in most cases. Only if the default chunk size doesn't work well is it\nrecommended to experiment with different chunk sizes. The ideal `chunk_size`\ndepends on a combination of factors:\n\n* **Nature of the Loop:** There are two things to consider pertaining to the\nloop when deciding on a `chunk_size`—the *number of iterations* in the\nloop and the *amount of time* each iteration takes. If the amount of time is roughly equal, \nthen the `chunk_size` could be the number of\niterations divided by the number of cores. On the other hand, if the amount of\ntime taken is different for every iteration, the chunks should be smaller. If\nthe total number of iterations is a sizeable number, a `chunk_size` like 32 or\n16 is safe to use, whearas if the number of iterations is low, like 10, a\n`chunk_size` of 1 would perform best.\n\n* **Machine:** Optimal chunk size varies across machines, so it's recommended\nto experiment with a range of values to find out what works best on yours.\n\n### Speedup\n\nLet's find how the parallel matrix multiplication scales on multiple cores.\n\n**Speedup**\n\nThe speedup vs. core is enumerated below for input matrices of size 1024x1024:\n\n| Cores | Time (s) | Speedup     |\n|-------|----------|-------------|\n| 1     | 9.172    | 1           |\n| 2     | 4.692    | 1.954816709 |\n| 4     | 2.293    | 4           |\n| 8     | 1.196    | 7.668896321 |\n| 12    | 0.854    | 10.74004684 |\n| 16    | 0.76     | 12.06842105 |\n| 20    | 0.66     | 13.8969697  |\n| 24    | 0.587    | 15.62521295 |\n\n![matrix-graph](images/matrix_multiplication.png)\n\nWe've achieved a speedup of 16 with the help of a `parallel_for`. It's very\nmuch possible to achieve linear speedups when parallelisable workloads are\navailable.\n\nNote that parallel code performance heavily depends on the machine. Some\nmachine settings specific to Linux systems for obtaining optimal results are\ndescribed [here](https://github.com/ocaml-bench/ocaml_bench_scripts#notes-on-hardware-and-os-settings-for-linux-benchmarking).\n\n### Properties and Caveats of `parallel_for`\n\n#### Implicit Barrier\n\nThe `parallel_for` has an implicit barrier, meaning any other tasks \nwaiting to be executed in the same pool will start only after all chunks\nin the `parallel_for` are complete, so we need not worry about creating and\ninserting barriers explicitly between two `parallel_for` loops (or some other\noperation) after a `parallel_for`. Consider this scenario: we have three\nmatrices `m1`, `m2`, and `m3`. We want to compute `(m1*m2) * m3`, where `*`\nindicates matrix multiplication. For the sake of simplicity, let's assume all\nthree are square matrices of the same size.\n\n```ocaml\nlet parallel_matrix_multiply_3 pool m1 m2 m3 =\n  let size = Array.length m1 in\n  let t = Array.make_matrix size size 0 in (* stores m1*m2 *)\n  let res = Array.make_matrix size size 0 in\n\n  Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i -\u003e\n    for j = 0 to size - 1 do\n      for k = 0 to size - 1 do\n        t.(i).(j) \u003c- t.(i).(j) + m1.(i).(k) * m2.(k).(j)\n      done\n    done);\n\n  Task.parallel_for pool ~start:0 ~finish:(size - 1) ~body:(fun i -\u003e\n    for j = 0 to size - 1 do\n      for k = 0 to size - 1 do\n        res.(i).(j) \u003c- res.(i).(j) + t.(i).(k) * m3.(k).(j)\n      done\n    done);\n\n    res\n```\n\nIn a hypothetical situation where `parallel_for` didn't have an implicit\nbarrier, as in the example above, it's very likely that the computation of `res`\nwouldn't be correct. Since we already have an implicit barrier, it will perform \nthe right computation.\n\n#### Order of Execution\n\n```\nfor i = start to finish do\n  \u003cbody\u003e\ndone\n```\n\nA sequential `for` loop, like the one above, runs its iterations in the exact\nsame order, from `start` to `finish`. However, `parallel_for` makes the order of\nexecution arbitrary and varies it between two runs of the exact same code. If\nthe iteration order is important for your code, it's\nadvisable to use `parallel_for` with some caution.\n\n#### Dependencies Within the Loop\n\nIf there are any dependencies within the loop, such as a current iteration\ndepending on the result of a previous iteration, odds are very high that the\ncorrectness of the code no longer holds if `parallel_for` is used. Task API has\na primitive `parallel_scan` which might come in handy in scenarios like this.\n\n## Async-Await\n\nA `parallel_for` loop easily parallelises iterative tasks. *Async-Await* offers more\nflexibility to execute parallel tasks, which is especially useful in\nrecursive functions. Earlier we saw how to setup and tear down a task\npool. The Task API also has the facility to run specific tasks on a task pool.\n\n### Fibonacci Numbers in Parallel\n\nTo calculate a Fibonacci Sequence in parallel, first write a sequential function to calculate Fibonacci numbers. \nThe following is a naive Fibonacci function without tail-recursion:\n\n```ocaml\nlet rec fib n =\nif n \u003c 2 then 1\nelse fib (n - 1) + fib (n - 2)\n```\n\nObserve that the two operations in recursive case `fib (n - 1)` and `fib (n -2)` \ndo not have any mutual dependencies, which makes it convenient to\ncompute them in parallel. Essentially, we can calculate `fib (n - 1)` and `fib (n - 2)` \nin parallel and then add the results to get the answer.\n\nDo this by spawning a new domain for performing the calculation and joining\nit to obtain the result. Be careful to not spawn more domains\nthan number of cores available.\n\n```ocaml\nlet rec fib_par n d =\n  if d \u003c= 1 then fib n\n  else\n    let a = fib_par (n-1) (d-1) in\n    let b = Domain.spawn (fun _ -\u003e fib_par (n-2) (d-1)) in\n    a + Domain.join b\n```\nWe can also use task pools to execute tasks asynchronously, which is less tedious and scales better.\n\n```ocaml\nlet rec fib_par pool n =\n  if n \u003c= 40 then fib n\n  else\n    let a = Task.async pool (fun _ -\u003e fib_par pool (n-1)) in\n    let b = Task.async pool (fun _ -\u003e fib_par pool (n-2)) in\n    Task.await pool a + Task.await pool b\n```\n\nNote some differences from the sequential version of Fibonacci:\n\n* `pool` —\u003e an additional parameter for the same reasons in `parallel_for`\n\n* `if n \u003c= 40 then fib n` -\u003e when the input is less than 40, run the\nsequential `fib` function. When the input number is small enough, it's better \nto perform the calculations sequentially. We've taken `40` as the\nthreshold (above). Some experimentation would help find an acceptible \nthreshold, below which the computation can be performed sequentially.\n\n* `Task.async` and `Task.await` -\u003e used to run the tasks in parallel\n  + **Task.async** executes the task in the pool asynchronously and returns\n  a promise, a computation that is not yet complete. After the execution finishes, \n  it result will be stored in the promise.\n\n  + **Task.await** waits for the promise to complete its execution. Once it's \n  done, it returns the result of the task. In case the task raises an\n  uncaught exception, `await` also raises the same exception.\n\n\n# Channels\n\n## Bounded Channels\n\nChannels act as a medium to communicate data between domains and can be shared\nbetween multiple sending and receiving domains. Channels in Multicore OCaml\ncome in two flavours:\n\n* **Bounded**: buffered channels with a fixed size. A channel with the buffer size\n0 corresponds to a synchronised channel, and buffer size 1 gives the `MVar`\nstructure. Bounded channels can be created with any buffer size.\n\n* **Unbounded**: unbounded channels have no limit on the number of objects they\ncan hold, so they are only constrained by memory availability.\n\n```ocaml\nopen Domainslib\n\nlet c = Chan.make_bounded 0\n\nlet _ =\n  let send = Domain.spawn(fun _ -\u003e Chan.send c \"hello\") in\n  let msg =  Chan.recv c in\n  Domain.join send;\n  Printf.printf \"Message: %s\\n\" msg\n```\n\nIn the above example, we have a bounded channel `c` of size 0. Any `send` to the channel will be blocked \nuntil a corresponding receive (`recv`) is encountered. So, if we\nremove the `recv`, the program would be blocked indefinitely.\n\n```ocaml\nopen Domainslib\n\nlet c = Chan.make_bounded 0\n\nlet _ =\n  let send = Domain.spawn(fun _ -\u003e Chan.send c \"hello\") in\n  Domain.join send;\n```\n\nThe above example would block indefinitely because the `send`\ndoes not have a corresponding `recv`. If we instead create a bounded channel\nwith buffer size n, it can store up to [n] objects in the channel without a\ncorresponding receive, exceeding which the sending would block. We can try it\nwith the same example as above by changing the buffer size to 1:\n\n```ocaml\nopen Domainslib\n\nlet c = Chan.make_bounded 1\n\nlet _ =\n  let send = Domain.spawn(fun _ -\u003e Chan.send c \"hello\") in\n  Domain.join send;\n```\n\nNow the send will not block anymore.\n\nIf you don't want to block in `send` or `recv`, `send_poll` and `recv_poll` might\ncome in handy. They return a Boolean value, so if the operation was successful we\nget a `true`, otherwise a `false`.\n\n```ocaml\nopen Domainslib\n\nlet c = Chan.make_bounded 0\n\nlet _ =\n  let send = Domain.spawn(fun _ -\u003e\n          let b = Chan.send_poll c \"hello\" in\n          Printf.printf \"%B\\n\" b) in\n  Domain.join send;\n```\n\nHere the buffer size is 0 and the channel cannot hold any object, so this program\nprints a false.\n\nThe same channel may be shared by multiple sending and receiving domains.\n\n```ocaml\nopen Domainslib\n\nlet num_domains = try int_of_string Sys.argv.(1) with _ -\u003e 4\n\nlet c = Chan.make_bounded num_domains\n\nlet send c =\n  Printf.printf \"Sending from: %d\\n\" (Domain.self () :\u003e int);\n  Chan.send c \"howdy!\"\n\nlet recv c =\n  Printf.printf \"Receiving at: %d\\n\" (Domain.self () :\u003e int);\n  Chan.recv c |\u003e ignore\n\nlet _ =\n  let senders = Array.init num_domains\n                  (fun _ -\u003e Domain.spawn(fun _ -\u003e send c )) in\n  let receivers = Array.init num_domains\n                  (fun _ -\u003e Domain.spawn(fun _ -\u003e recv c)) in\n\n  Array.iter Domain.join senders;\n  Array.iter Domain.join receivers\n```\n\n`(Domain.self () :\u003e int)` returns the id of current domain.\n\n## Task Distribution Using Channels\n\nNow that we have some idea about how channels work, let's consider a more\nrealistic example by writing a generic task distributor that\nexecutes tasks on multiple domains:\n\n```ocaml\nmodule C = Domainslib.Chan\nlet num_domains = try int_of_string Sys.argv.(1) with _ -\u003e 4\nlet n = try int_of_string Sys.argv.(2) with _ -\u003e 100\n\ntype 'a message = Task of 'a | Quit\n\nlet c = C.make_unbounded ()\n\nlet create_work tasks =\n  Array.iter (fun t -\u003e C.send c (Task t)) tasks;\n  for _ = 1 to num_domains do\n    C.send c Quit\n  done\n\nlet rec worker f () =\n  match C.recv c with\n  | Task a -\u003e\n      f a;\n      worker f ()\n  | Quit -\u003e ()\n\nlet _ =\n  let tasks = Array.init n (fun i -\u003e i) in\n  create_work tasks ;\n  let factorial n =\n    let rec aux n acc =\n        if (n \u003e 0) then aux (n-1) (acc*n)\n        else acc in\n    aux n 1\n  in\n  let results = Array.make n 0 in\n  let update r i = r.(i) \u003c- factorial i in\n  let domains = Array.init (num_domains - 1)\n              (fun _ -\u003e Domain.spawn(worker (update results))) in\n  worker (update results) ();\n  Array.iter Domain.join domains;\n  Array.iter (Printf.printf \"%d \") results\n```\n\nWe have created an unbounded channel `c` which acts as a store for all \ntasks. We'll pay attention to two functions here: `create_work` and `worker`.\n\n`create_work` takes an array of tasks and pushes all task elements to the\nchannel `c`. The `worker` function receives tasks from the channel and executes\na function `f` with the received task as a parameter. It keeps repeating until it\nencounters a `Quit` message, which indicates `worker` can terminate.\n\nUse this template to run any task on multiple cores by running the\n`worker` function on all domains. This example runs a naive factorial\nfunction. The granularity of a task could also be tweaked by changing it in\nthe `worker` function. For instance, `worker` can run for a range of tasks instead\nof single one.\n\n\n# Profiling Your Code\n\nWhile writing parallel programs in Multicore OCaml, it's quite common to\nencounter overheads that might deteriorate the code's performance. This\nsection describes ways to discover and fix those overheads. Within the Multicore runtime, \nLinux commands `perf` and `eventlog` are particularly useful tools for\nperformance debugging. Let's do that with the help of an example:\n\n## Perf\n\nThe Linux `perf` tool has proven to be very useful when profiling Multicore\nOCaml code.\n\n**Profiling Serial Code**\n\nProfiling serial code can help identify parts of code that can potentially\nbenefit from parallelising. Let's do it for the sequential version of matrix\nmultiplication:\n\n```\nperf record --call-graph dwarf ./matrix_multiplication.exe 1024\n```\n\nThis results in a profile showing how much time is spent in the `matrix_multiply`\nfunction, which we wanted to parallelise. Remember, if a lot more time is spent \noutside the function we'd like to parallelise,\nthe maximum speedup possible to achieve would be lower.\n\nProfiling serial code can help reveal the hotspots where we might want to\nintroduce parallelism.\n\n```\nSamples: 51K of event 'cycles:u', Event count (approx.): 28590830181\n  Children      Self  Command     Shared Object     Symbol\n+   99.84%     0.00%  matmul.exe  matmul.exe        [.] caml_start_program\n+   99.84%     0.00%  matmul.exe  matmul.exe        [.] caml_program\n+   99.84%     0.00%  matmul.exe  matmul.exe        [.] camlDune__exe__Matmul__entry\n+   99.32%    99.31%  matmul.exe  matmul.exe        [.] camlDune__exe__Matmul__matrix_multiply_211\n+    0.57%     0.04%  matmul.exe  matmul.exe        [.] camlStdlib__array__init_104\n     0.47%     0.37%  matmul.exe  matmul.exe        [.] camlStdlib__random__intaux_278\n```\n\n\n\n### Overheads in Parallel Code\n\nLinux `perf` can be helpful when identifying overheads in parallel code, which can improve \nthe performance by removing overheads.\n\n**Parallel Initialisation of a Float Array with Random Numbers**\n\nArray initialisation using the standard library's `Array.init` is sequential.\nA program's parallel workloads scale according to the number of cores\nused, although the initialisation takes the same amount of time in all cases.\nThis might become a bottleneck for parallel workloads.\n\nFor float arrays, we have `Array.create_float` to create a fresh float\narray. Use this to allocate an array and perform the initialisation in\nparallel. Let's do the initialisation of a float array with random numbers in\nparallel.\n\n**Naive Implementation**\n\nBelow is a naive implementation that will initialise all array elements \nwith a Random number:\n\n```ocaml\nopen Domainslib\n\nlet num_domains = try int_of_string Sys.argv.(1) with _ -\u003e 4\nlet n = try int_of_string Sys.argv.(2) with _ -\u003e 100000\nlet a = Array.create_float n\n\nlet _ =\n  let pool = Task.setup_pool ~num_domains:(num_domains - 1) () in\n  Task.run pool (fun () -\u003e Task.parallel_for pool ~start:0\n  ~finish:(n - 1) ~body:(fun i -\u003e Array.set a i (Random.float 1000.)));\n  Task.teardown_pool pool\n```\n\nMeasure how it scales:\n\n| #Cores | Time(s) |\n|--------|---------|\n| 1      | 3.136   |\n| 2      | 10.19   |\n| 4      | 11.815  |\n\nAlthough we expected to see speedup executing in multiple cores, the code \nactually slows down as the number of cores increase. There's\nsomething unnoticably wrong with the code.\n\nLet's profile the performance with the Linux `perf` profiler:\n\n```\n$ perf record ./_build/default/float_init_par.exe 4 100_000_000\n$ perf report\n```\n\nThe `perf` report would look something like this:\n\n![perf-report-1](images/perf_random_1.png)\n\nThe overhead at Random bits is a whopping 87.99%! Typically there's\nno single cause that we can attribute to such overheads, since they are very\nspecific to the program. It might need a little careful inspection to find out\nwhat is causing them. In this case, the Random module shares the same state\namongst all domains, which causes contention when multiple domains are\ntrying to access it simultaneously.\n\nTo overcome that, use a different state for every domain so there\nisn't any contention from a shared state.\n\n```ocaml\nmodule T = Domainslib.Task\nlet n = try int_of_string Sys.argv.(2) with _ -\u003e 1000\nlet num_domains = try int_of_string Sys.argv.(1) with _ -\u003e 4\n\nlet arr = Array.create_float n\n\nlet _ =\n  let domains = T.setup_pool ~num_domains:(num_domains - 1) () in\n  let states = Array.init num_domains (fun _ -\u003e Random.State.make_self_init()) in\n  T.run domains (fun () -\u003e T.parallel_for domains ~start:0 ~finish:(n-1)\n  ~body:(fun i -\u003e\n    let d = (Domain.self() :\u003e int) mod num_domains in\n    Array.unsafe_set arr i (Random.State.float states.(d) 100. )))\n```\n\nWe have created `num_domains` different Random States, each to be used by a different domain. This might come \nacross as a hack, but if it helps achieve better performance, there is no harm in using them, \nas long as the correctness is intact.\n\nLet's run this on multiple cores:\n\n| #Cores | Time(s) |\n|--------|---------|\n| 1      | 3.828   |\n| 2      | 3.641   |\n| 4      | 3.119   |\n\nExamining the times, though it is not as bad as the previous case, it isn't \nclose to what we expected. Here's the `perf` report:\n\n![perf-report-2](images/perf_random_2.png)\n\nThe overheads at Random bits is less than the previous case, but it's still\nquite high at 59.73%. We've used a separate Random State for every domain, so\nthe overheads aren't caused by any shared state; however, if we look closely, the\nRandom States are all allocated by the same domain in an array with a small\nnumber of elements, possibly located close to each other in physical memory.\nWhen multiple domains try to access them, they might be sharing cache\nlines, or `false sharing`. We can confirm our suspicion with the\nhelp of `perf c2c` on Intel machines:\n\n```\n$ perf c2c record _build/default/float_init_par2.exe 4 100_000_000\n$ perf c2c report\n\nShared Data Cache Line Table     (2 entries, sorted on Total HITMs)\n       ----------- Cacheline ----------    Total      Tot  ----- LLC Load Hitm -----  ---- Store Reference ----  --- Loa\nIndex             Address  Node  PA cnt  records     Hitm    Total      Lcl      Rmt    Total    L1Hit   L1Miss       Lc\n    0      0x7f2bf49d7dc0     0   11473    13008   94.23%     1306     1306        0     1560      595      965        ◆\n    1      0x7f2bf49a7b80     0     271      368    5.48%       76       76        0      123       76       47\n```\n\nAs evident from the report, there's quite a considerable amount of false sharing happening in\nthe code. To eliminate false sharing, allocate the Random State in the\ndomain that is going to use it, so the states will be allocated with\nmemory locations far from each other.\n\n```ocaml\nmodule T = Domainslib.Task\nlet n = try int_of_string Sys.argv.(2) with _ -\u003e 1000\nlet num_domains = try int_of_string Sys.argv.(1) with _ -\u003e 4\n\nlet arr = Array.create_float n\n\nlet init_part s e arr =\n    let my_state = Random.State.make_self_init () in\n    for i = s to e do\n      Array.unsafe_set arr i (Random.State.float my_state 100.)\n    done\n\nlet _ =\n  let domains = T.setup_pool ~num_domains:(num_domains - 1) () in\n  T.run domains (fun () -\u003e T.parallel_for domains ~chunk_size:1 ~start:0 ~finish:(num_domains - 1)\n  ~body:(fun i -\u003e init_part (i * n / num_domains) ((i+1) * n / num_domains - 1) arr));\n  T.teardown_pool domains\n```\n\nNow the results are:\n\n| Cores | Time  | Speedup     |\n|-------|-------|-------------|\n| 1     | 3.055 | 1           |\n| 2     | 1.552 | 1.968427835 |\n| 4     | 0.799 | 3.823529412 |\n| 8     | 0.422 | 7.239336493 |\n| 12    | 0.302 | 10.11589404 |\n| 16    | 0.242 | 12.62396694 |\n| 20    | 0.208 | 14.6875     |\n| 24    | 0.186 | 16.42473118 |\n\n\n![initialisation](images/initialisation.png)\n\nIn this process, we have essentially identified bottlenecks for scaling and\neliminated them to achieve better speedups. For more details on profiling with\n`perf`, please refer [these notes](https://github.com/ocaml-bench/notes/blob/master/profiling_notes.md).\n\n## Eventlog\n\nThe Multicore runtime supports [OCaml instrumented\nruntime](https://ocaml.org/manual/runtime-tracing.html).\nThe instrumented runtime enables capturing metrics about various GC activities.\n[Eventlog-tools](https://github.com/ocaml-multicore/eventlog-tools/tree/multicore)\nis a library that provides tools to parse the instrumentation logs generated by\nthe runtime. Some handy tools are described [in the\nREADME](https://github.com/ocaml-multicore/eventlog-tools/tree/multicore).\n\nEventlog tools can be useful for optimizing Multicore programs.\n\n**Identify Large Pausetimes**\n\nIdentifying and fixing events that cause maximum latency can improve the overall\nthroughput of the program. `ocaml-eventlog-pausetimes` displays statistics from\nthe generated trace files. For Multicore programs, every domain has its own\ntrace file, and all of them need to be fed into the input.\n\n```\n$ ocaml-eventlog-pausetimes caml-10599-0.eventlog caml-10599-2.eventlog caml-10599-4.eventlog caml-10599-6.eventlog\n{\n  \"name\": \"caml-10599-6.eventlog\",\n  \"mean_latency\": 78328,\n  \"max_latency\": 5292643,\n  \"distr_latency\": [85,89,104,231,303,9923,117639,145118,179488,692880,2728990]\n}\n```\n\n**Diagnose Imbalance in Task Distribution**\n\n*Eventlog* can be useful to find imbalance in task distribution \nin a parallel program. Imbalance in task distribution essentially means that\nnot all domains are provided with equal amount of computation to perform, so some \ndomains take longer than others to finish their computations, while the idle domains \nkeep waiting. This can occur when a sub-\noptimal `chunk_size` is picked in a `parallel_for`.\n\nTime periods show when an idle domain is recorded as `domain/idle_wait` in the\n`eventlog`. Here is an example `eventlog` generated by a program with unbalanced\ntask distribution.\n\n![eventlog_task_imbalance](images/unbalanced_task1.png)\n\nIf we zoom in further, we see many `domain/idle_wait` events.\n\n![eventlog_task_imbalance_zoomed](images/unbalanced_zoomed.png)\n\nSo far we've only found an imbalance in task distribution\nin the code, so we'll need to change our code accordingly to make the task\ndistribution more balanced, which could increase the speedup.\n\n---\n\nPerformance debugging can be quite tricky at times, so if you could use some help in\ndebugging your Multicore OCaml code, feel free to create an Issue in the\nMulticore OCaml [issue tracker](https://github.com/ocaml-multicore/ocaml-multicore/issues) along with a minimal code example.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focaml-multicore%2Fparallel-programming-in-multicore-ocaml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Focaml-multicore%2Fparallel-programming-in-multicore-ocaml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focaml-multicore%2Fparallel-programming-in-multicore-ocaml/lists"}