{"id":17862672,"url":"https://github.com/stevana/pipelining-with-disruptor","last_synced_at":"2025-08-24T13:33:41.086Z","repository":{"id":191644760,"uuid":"685023301","full_name":"stevana/pipelining-with-disruptor","owner":"stevana","description":"Experiment in creating parallel pipelines using the Disruptor.","archived":false,"fork":false,"pushed_at":"2024-01-31T07:33:12.000Z","size":159,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-08-02T20:57:06.605Z","etag":null,"topics":["dataflow","disruptor","parallel-programming","pipelining"],"latest_commit_sha":null,"homepage":"","language":"Haskell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stevana.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-30T10:54:41.000Z","updated_at":"2024-08-31T04:51:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"27316901-b24b-4939-8300-1b970f31d65e","html_url":"https://github.com/stevana/pipelining-with-disruptor","commit_stats":null,"previous_names":["stevana/pipelining-with-disruptor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/stevana/pipelining-with-disruptor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stevana%2Fpipelining-with-disruptor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stevana%2Fpipelining-with-disruptor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stevana%2Fpipelining-with-disruptor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stevana%2Fpipelining-with-disruptor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stevana","download_url":"https://codeload.github.com/stevana/pipelining-with-disruptor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stevana%2Fpipelining-with-disruptor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271875578,"owners_count":24837304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-24T02:00:11.135Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataflow","disruptor","parallel-programming","pipelining"],"created_at":"2024-10-28T08:54:37.709Z","updated_at":"2025-08-24T13:33:41.044Z","avatar_url":"https://github.com/stevana.png","language":"Haskell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Parallel stream processing with zero-copy fan-out and sharding\n\nIn a previous [post](https://stevana.github.io/pipelined_state_machines.html) I\nexplored how we can make better use of our parallel hardware by means of\npipelining.\n\nIn a nutshell the idea of pipelining is to break up the problem in stages and\nhave one (or more) thread(s) per stage and then connect the stages with queues.\nFor example, imagine a service where we read some request from a socket, parse\nit, validate, update our state and construct a response, serialise the response\nand send it back over the socket. These are six distinct stages and we could\ncreate a pipeline with six CPUs/cores each working on a their own stage and\nfeeding the output to the queue of the next stage. If one stage is slow we can\nshard the input, e.g. even requests to go to one worker and odd requests go to\nanother thereby nearly doubling the throughput for that stage.\n\nOne of the concluding remarks to the previous post is that we can gain even more\nperformance by using a better implementation of queues, e.g. the [LMAX\nDisruptor](https://en.wikipedia.org/wiki/Disruptor_(software)).\n\nThe Disruptor is a low-latency high-throughput queue implementation with support\nfor multi-cast (many consumers can in parallel process the same event), batching\n(both on producer and consumer side), back-pressure, sharding (for scalability)\nand dependencies between consumers.\n\nIn this post we'll recall the problem of using \"normal\" queues, discuss how\nDisruptor helps solve this problem and have a look at how we can we provide a\ndeclarative high-level language for expressing pipelines backed by Disruptors\nwhere all low-level details are hidden away from the user of the library. We'll\nalso have a look at how we can monitor and visualise such pipelines for\ndebugging and performance troubleshooting purposes.\n\n## Motivation and inspiration\n\nBefore we dive into *how* we can achieve this, let's start with the question of\n*why* I'd like to do it.\n\nI believe the way we write programs for multiprocessor networks, i.e. multiple\nconnected computers each with multiple CPUs/cores, can be improved upon. Instead\nof focusing on the pitfalls of the current mainstream approaches to these\nproblems, let's have a look at what to me seems like the most promising way\nforward.\n\nJim Gray gave a great explanation of dataflow programming in this Turing Award\nRecipient [interview](https://www.youtube.com/watch?v=U3eo49nVxcA\u0026t=1949s). He\nuses props to make his point, which makes it a bit difficult to summaries in\ntext here. I highly recommend watching the video clip, the relevant part is only\nthree minutes long.\n\nThe key point is exactly that of pipelining. Each stage is running on a\nCPU/core, this program is completely sequential, but by connecting several\nstages we create a parallel pipeline. Further parallelism (what Jim calls\npartitioned parallelism) can be gained by partitioning the inputs, by say odd\nand even sequence number, and feeding one half of the inputs to one copy of the\npipeline and the other half to another copy, thereby almost doubling the\nthroughput. Jim calls this a \"natural\" way to achieve parallelism.\n\nWhile I'm not sure if \"natural\" is the best word, I do agree that it's a nice\nway to make good use of CPUs/cores on a single computer without introducing\nnon-determinism. Pipelining is also effectively used to achieve parallelism in\nmanufacturing and hardware, perhaps that's why Jim calls it \"natural\"?\n\nThings get a bit more tricky if we want to involve more computers. Part of the\nreason, I believe, is that we run into the problem highlighted by Barbara Liskov\nat the very end of her Turing award\n[lecture](https://youtu.be/qAKrMdUycb8?t=3058) (2009):\n\n\u003e \"There's a funny disconnect in how we write distributed programs. You\n\u003e  write your individual modules, but then when you want to connect\n\u003e  them together you're out of the programming language and into this\n\u003e  other world. Maybe we need languages that are a little bit more\n\u003e  complete now, so that we can write the whole thing in the language.\"\n\nIdeally we'd like our pipelines to seamlessly span over multiple computers. In\nfact it should be possible to deploy same pipeline to different configurations\nof processors without changing the pipeline code (nor having to add any\nnetworking related code).\n\nA pipeline that is redeployed with additional CPUs or computers might or might\nnot scale, it depends on whether it makes sense to partition the input of a\nstage further or if perhaps the introduction of an additional computer merely\nadds more overhead. How exactly the pipeline is best spread over the available\ncomputers and CPUs/cores will require some combination of domain knowledge,\nmeasurement and judgment. Depending on how quick we can make redeploying of\npipelines, it might be possible to autoscale them using a program that monitors\nthe queue lengths.\n\nAlso related to redeploying, but even more important than autoscaling, are\nupgrades of pipelines. That's both upgrading the code running at the individual\nstages, as well as how the stages are connected to each other, i.e. the\npipeline itself.\n\nMartin Thompson has given many\n[talks](https://www.youtube.com/watch?v=_KvFapRkR9I) which echo the general\nideas of Jim and Barbara. If you prefer reading then you can also have a look at\nthe [reactive manifesto](https://www.reactivemanifesto.org/) which he cowrote.\nMartin is also one of the people behind the Disruptor, which we will come back\nto soon, and he also [said](https://youtu.be/OqsAGFExFgQ?t=2532) the following:\n\n\u003e \"If there's one thing I'd say to the Erlang folks, it's you got the stuff right\n\u003e from a high-level, but you need to invest in your messaging infrastructure so\n\u003e it's super fast, super efficient and obeys all the right properties to let this\n\u003e stuff work really well.\"\n\nThis quote together with Joe Armstrong's\n[anecdote](https://youtu.be/bo5WL5IQAd0?t=2494) of an unmodified Erlang program\n*only* running 33 times faster on a 64 core machine, rather than 64 times faster\nas per the Ericsson higher-up's expectations, inspired me to think about how one\ncan improve upon the already excellent work that Erlang is doing in this space.\n\nLonger term, I like to think of pipelines spanning computers as a building block\nfor what Barbara [calls](https://www.youtube.com/watch?v=8M0wTX6EOVI) a\n\"substrate for distributed systems\". Unlike Barbara I don't think this substrate\nshould be based on shared memory, but overall I agree with her goal of making it\neasier to program distributed systems by providing generic building blocks.\n\n## Prior work\n\nWorking with streams of data is common. The reason for this is that it's a nice\nabstraction when dealing with data that cannot fit in memory. The alternative is\nto manually load chunks of data one wants to process into memory, load the next\nchunk etc, when we processes streams this is hidden away from us.\n\nParallelism is a related problem, in that when one has big volumes of data it's\nalso common to care about performance and how we can utilise multiple\nprocessors.\n\nSince dealing with limited memory and multiprocessors is a problem that as\nbothered programmers and computer scientists for a long time, at least since the\n1960s, there's a lot of work that has been done in this area. I'm at best\nfamiliar with a small fraction of this work, so please bear with me but also do\nlet me know if I missed any important development.\n\nIn 1963 Melvin Conway proposed\n[coroutines](https://dl.acm.org/doi/10.1145/366663.366704), which allows the\nuser to conveniently process very large, or even infinite, lists of items\nwithout first loading the list into memory, i.e. streaming.\n\nShortly after, in 1965, Peter Landin introduced\n[streams](https://dl.acm.org/doi/10.1145/363744.363749) as a functional analogue\nof Melvin's imperative coroutines.\n\nA more radical departure from Von Neumann style sequential programming can be\nseen in the work on [dataflow\nprogramming](https://en.wikipedia.org/wiki/Dataflow_programming) in general and\nespecially in Paul Morrison's [flow-based\nprogramming](https://jpaulm.github.io/fbp/index.html) (late 1960s). Paul uses\nthe following picture to illustrate the similarity between flow-based\nprogramming and an assembly line in manufacturing:\n\n![](https://raw.githubusercontent.com/stevana/pipelining-with-disruptor/main/data/bottling_factory.png)\n\nEach stage is its own process running in parallel with the other stages. In\nflow-based programming stages are computation and the conveyor belts are queues.\nThis gives us implicit parallelism and determinate outcome.\n\nDoug McIlroy, who was aware of some of the dataflow work[^1], wrote a\n[memo](http://doc.cat-v.org/unix/pipes/) in 1964 about the idea of pipes,\nalthough it took until 1973 for them to get implemented in Unix by Ken Thompson.\nUnix pipes have a strong feel of flow-based programming, although all data is of\ntype string. A pipeline of commands will start a process per command, so there's\nimplicit parallelism as well (assuming the operative system schedules different\nprocesses on different CPUs/cores). Fanning out can be done with `tee` and\nprocess substitution, e.g. `echo foo | tee \u003e(cat) \u003e(cat) | cat`, and more\ncomplicated non-linear flows can be achieved with `mkfifo`.\n\nWith the release of GNU [`parallel`](https://en.wikipedia.org/wiki/GNU_parallel)\nin 2010 more explicit control over parallelism was introduced as well as the\nability to run jobs on remote computers.\n\nAround the same time many (functional) programming languages started getting\nstreaming libraries. Haskell's\n[conduit](https://hackage.haskell.org/package/conduit) library had its first\nrelease in 2011 and Haskell's [pipes](https://hackage.haskell.org/package/pipes)\nlibrary came shortly after (2012). Java version 8, which has streams, was\nreleased in 2014. Both [Clojure](https://clojure.org/reference/transducers) and\n[Scala](https://doc.akka.io/docs/akka/current/stream/index.html), which also use\nthe JVM, got streams that same year (2014).\n\nAmong the more imperative programming languages, JavaScript and Python both have\ngenerators (a simple form of coroutines) since around 2006. Go has \"goroutines\",\na clear nod to coroutines, since its first version (2009). Coroutines are also\npart of the C++20 standard.\n\nAlmost all of the above mentioned streaming libraries are intended to be run on\na single computer. Often they even run in a single thread, i.e. not exploiting\nparallelism at all. Sometimes concurrent/async constructs are available which\ncreate a pool of worker threads that process the items concurrently, but they\noften break determinism (i.e. rerunning the same computation will yield\ndifferent results, because the workers do not preserve the order of the inputs).\n\nIf the data volumes are too big for a single computer then there's a different\nset of streaming tools, such as Apache Hadoop (2006), Apache Spark (2009),\nApache Kafka (2011), Apache Storm (2011), and Apache Flink (2011). While the\nApache tools can often be deployed locally for testing purposes, they are\nintended for distributed computations and are therefore perhaps a bit more\ncumbersome to deploy and use than the streaming libraries we mentioned earlier.\n\nInitially it might not seem like a big deal that streaming libraries don't\n\"scale up\" or distributed over multiple computers, and that streaming tools like\nthe Apache ones don't gracefully \"scale down\" to a single computer. Just pick\nthe right tool for the right job, right? Well, it turns out that\n[40-80%](https://youtu.be/XPlXNUXmcgE?t=2783) of jobs submitted to MapReduce\nsystems (such as Apache Hadoop) would run faster if they were ran on a single\ncomputer instead of a distributed cluster of computers, so picking the right\ntool is perhaps not as easy as it first seems.\n\nThere are two exceptions, that I know of, of streaming libraries that also work\nin a distributed setting. Scala's Akka/Pekko\n[streams](https://doc.akka.io/docs/akka/current/stream/stream-refs.html) (2014)\nwhen combined with Akka/Pekko\n[clusters](https://github.com/apache/incubator-pekko-management) and\n[Aeron](https://aeron.io/) (2014). Aeron is the spiritual successor of the\nDisruptor also written by Martin Thompson et al. The Disruptor's main use case\nwas as part of the LMAX exchange. From what I understand exchanges close in the\nevening (or at least did back then in the case of LMAX), which allows for\nupdates etc. These requirements changed for Aeron where 24/7 operation was\nnecessary and so distributed stream processing is necessary where upgrades can\nhappen without processing stopping (or even slowing down).\n\nFinally, I'd also like to mention functional reactive programming, or FRP,\n(1997). I like to think of it as a neat way of expressing stream processing\nnetworks. Disruptor's\n[\"wizard\"](https://github.com/LMAX-Exchange/disruptor/wiki/Disruptor-Wizard) DSL\nand Akka's [graph\nDSL](https://doc.akka.io/docs/akka/current/stream/stream-graphs.html) try to add\na high-level syntax for expressing networks, but they both have a rather\nimperative rather than declarative feel. It's however not clear (to me) how\neffectively implement, parallelise[^2], or distribute FRP. Some interesting work\nhas been done with hot code swapping in the FRP\n[setting](https://github.com/turion/essence-of-live-coding), which is\npotentially useful for a telling a good upgrade story.\n\nTo summarise, while there are many streaming libraries there seem to be few (at\nleast that I know of) that tick all of the following boxes:\n\n  1. Parallel processing:\n     * in a determinate way;\n     * fanning out and sharding without copying data (when run on a single\n       computer).\n  2. Potentially distributed over multiple computers for fault tolerance and\n     upgrades, without the need to change the code of the pipeline;\n  3. Observable, to ease debugging and performance analysis;\n  4. Declarative high-level way of expressing stream processing networks (i.e.\n     the pipeline);\n  5. Good deploy, upgrade, rescale story for stateful systems;\n  6. Elastic, i.e. ability to rescale automatically to meet the load.\n\nI think we need all of the above in order to build Barbara's \"substrate for\ndistributed systems\". We'll not get all the way there in this post, but at least\nthis should give you a sense of the direction I'd like to go.\n\n## Plan\n\nThe rest of this post is organised as follows.\n\nFirst we'll have a look at how to model pipelines as a transformation of lists.\nThe purpose of this is to give us an easy to understand sequential specification\nof what we would like our pipelines to do.\n\nWe'll then give our first parallel implementation of pipelines using \"normal\"\nqueues. The main point here is to recap of the problem with copying data that\narises from using \"normal\" queues, but we'll also sketch how one can test the\nparallel implementation using the model.\n\nAfter that we'll have a look at the Disruptor API, sketch its single producer\nimplementation and discuss how it helps solve the problems we identified in the\nprevious section.\n\nFinally we'll have enough background to be able to sketch the Disruptor\nimplementation of pipelines. We'll also discuss how monitoring/observability can\nbe added.\n\n## List transformer model\n\nLet's first introduce the type for our pipelines. We index our pipeline datatype\nby two types, in order to be able to precisely specify its input and output\ntypes. For example, the `Id`entity pipeline has the same input as output type,\nwhile pipeline composition (`:\u003e\u003e\u003e`) expects its first argument to be a pipeline\nfrom `a` to `b`, and the second argument a pipeline from `b` to `c` in order for\nthe resulting composed pipeline to be from `a` to `c` (similar to functional\ncomposition).\n\n```haskell\ndata P :: Type -\u003e Type -\u003e Type where\n  Id      :: P a a\n  (:\u003e\u003e\u003e)  :: P a b -\u003e P b c -\u003e P a c\n  Map     :: (a -\u003e b) -\u003e P a b\n  (:***)  :: P a c -\u003e P b d -\u003e P (a, b) (c, d)\n  (:\u0026\u0026\u0026)  :: P a b -\u003e P a c -\u003e P a (b, c)\n  (:+++)  :: P a c -\u003e P b d -\u003e P (Either a b) (Either c d)\n  (:|||)  :: P a c -\u003e P b c -\u003e P (Either a b) c\n  Shard   :: P a b -\u003e P a b\n```\n\nHere's a pipeline that takes a stream of integers as input and outputs a stream\nof pairs where the first component is the input integer and the second component\nis a boolean indicating if the first component was an even integer or not.\n\n```haskell\nexamplePipeline :: P Int (Int, Bool)\nexamplePipeline = Id :\u0026\u0026\u0026 Map even\n```\n\nSo far our pipelines are merely data which describes what we'd like to do. In\norder to actually perform a stream transformation we'd need to give semantics to\nour pipeline datatype[^3].\n\nThe simplest semantics we can give our pipelines is that in terms of list\ntransformations.\n\n```haskell\nmodel :: P a b -\u003e [a] -\u003e [b]\nmodel Id         xs  = xs\nmodel (f :\u003e\u003e\u003e g) xs  = model g (model f xs)\nmodel (Map f)    xs  = map f xs\nmodel (f :*** g) xys =\n  let\n    (xs, ys) = unzip xys\n  in\n    zip (model f xs) (model g ys)\nmodel (f :\u0026\u0026\u0026 g) xs = zip (model f xs) (model g xs)\nmodel (f :+++ g) es =\n  let\n    (xs, ys) = partitionEithers es\n  in\n    -- Note that we pass in the input list, in order to perserve the order.\n    merge es (model f xs) (model g ys)\n  where\n    merge []             []       []       = []\n    merge (Left  _ : es) (l : ls) rs       = Left  l : merge es ls rs\n    merge (Right _ : es) ls       (r : rs) = Right r : merge es ls rs\nmodel (f :||| g) es =\n  let\n    (xs, ys) = partitionEithers es\n  in\n    merge es (model f xs) (model g ys)\n  where\n    merge []             []       []       = []\n    merge (Left  _ : es) (l : ls) rs       = l : merge es ls rs\n    merge (Right _ : es) ls       (r : rs) = r : merge es ls rs\nmodel (Shard f) xs = model f xs\n```\n\nNote that this semantics is completely sequential and preserves the order of the\ninputs (determinism). Also note that since we don't have parallelism yet,\n`Shard`ing doesn't do anything. We'll introduce parallelism without breaking\ndeterminism in the next section.\n\nWe can now run our example pipeline in the REPL:\n\n```\n\u003e model examplePipeline [1,2,3,4,5]\n[(1,False),(2,True),(3,False),(4,True),(5,False)]\n```\n\n## Queue pipeline deployment\n\nIn the previous section we saw how to deploy pipelines in a purely sequential\nway in order to process lists. The purpose of this is merely to give ourselves\nan intuition of what pipelines should do as well as an executable model which we\ncan test our intuition against.\n\nNext we shall have a look at our first parallel deployment. The idea here is to\nshow how we can involve multiple threads in the stream processing, without\nmaking the output non-deterministic (same input should always give the same\noutput).\n\nWe can achieve this as follows:\n\n```haskell\ndeploy :: P a b -\u003e TQueue a -\u003e IO (TQueue b)\ndeploy Id         xs = return xs\ndeploy (f :\u003e\u003e\u003e g) xs = deploy g =\u003c\u003c deploy f xs\ndeploy (Map f)    xs = deploy (MapM (return . f)) xs\ndeploy (MapM f)   xs = do\n  -- (Where `MapM :: (a -\u003e IO b) -\u003e P a b` is the monadic generalisation of\n  -- `Map` from the list model that we saw earlier.)\n  ys \u003c- newTQueueIO\n  forkIO $ forever $ do\n    x \u003c- atomically (readTQueue xs)\n    y \u003c- f x\n    atomically (writeTQueue ys y)\n  return ys\ndeploy (f :\u0026\u0026\u0026 g) xs = do\n  xs1 \u003c- newTQueueIO\n  xs2 \u003c- newTQueueIO\n  forkIO $ forever $ do\n    x \u003c- atomically (readTQueue xs)\n    atomically $ do\n      writeTQueue xs1 x\n      writeTQueue xs2 x\n  ys \u003c- deploy f xs1\n  zs \u003c- deploy g xs2\n  yzs \u003c- newTQueueIO\n  forkIO $ forever $ do\n    y \u003c- atomically (readTQueue ys)\n    z \u003c- atomically (readTQueue zs)\n    atomically (writeTQueue yzs (y, z))\n  return yzs\n```\n\n(I've omitted the cases for `:|||` and `:+++` to not take up too much space.\nWe'll come back and handle `Shard` separately later.)\n\n```haskell\nexample' :: [Int] -\u003e IO [(Int, Bool)]\nexample' xs0 = do\n  xs \u003c- newTQueueIO\n  mapM_ (atomically . writeTQueue xs) xs0\n  ys \u003c- deploy (Id :\u0026\u0026\u0026 Map even) xs\n  replicateM (length xs0) (atomically (readTQueue ys))\n```\n\nRunning\n[this](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/QueueDeployment.hs)\nin our REPL, gives the same result as in the model:\n\n```\n\u003e example' [1,2,3,4,5]\n[(1,False),(2,True),(3,False),(4,True),(5,False)]\n```\n\nIn fact, we can use our model to define a property-based test which asserts that\nour queue deployment is faithful to the model:\n\n```haskell\nprop_commute :: Eq b =\u003e P a b -\u003e [a] -\u003e PropertyM IO ()\nprop_commute p xs = do\n  ys \u003c- run $ do\n    qxs \u003c- newTQueueIO\n    mapM_ (atomically . writeTQueue qxs) xs\n    qys \u003c- deploy p qxs\n    replicateM (length xs) (atomically (readTQueue qys))\n  assert (model p xs == ys)\n```\n\nActually running this property for arbitrary pipelines would require us to first\ndefine a pipeline generator, which is a bit tricky given the indexes of the\ndatatype[^4]. It can still me used as a helper for testing specific pipelines\nthough, e.g. `prop_commute examplePipeline`.\n\nA bigger problem is that we've spawned two threads, when deploying `:\u0026\u0026\u0026`, whose\nmere job is to copy elements from the input queue (`xs`) to the input queues of\n`f` and `g` (`xs{1,2}`), and from the outputs of `f` and `g` (`ys` and `zs`) to\nthe output of `f \u0026\u0026\u0026 g` (`ysz`). Copying data is expensive.\n\nWhen we shard a pipeline we effectively clone it and send half of the traffic to\none clone and the other half to the other. One way to achieve this is as\nfollows, notice how in `shard` we swap `qEven` and `qOdd` when we recurse:\n\n```haskell\ndeploy (Shard f) xs = do\n  xsEven \u003c- newTQueueIO\n  xsOdd  \u003c- newTQueueIO\n  _pid   \u003c- forkIO (shard xs xsEven xsOdd)\n  ysEven \u003c- deploy f xsEven\n  ysOdd  \u003c- deploy f xsOdd\n  ys     \u003c- newTQueueIO\n  _pid   \u003c- forkIO (merge ysEven ysOdd ys)\n  return ys\n  where\n    shard :: TQueue a -\u003e TQueue a -\u003e TQueue a -\u003e IO ()\n    shard  qIn qEven qOdd = do\n      atomically (readTQueue qIn \u003e\u003e= writeTQueue qEven)\n      shard qIn qOdd qEven\n\n    merge :: TQueue a -\u003e TQueue a -\u003e TQueue a -\u003e IO ()\n    merge qEven qOdd qOut = do\n      atomically (readTQueue qEven \u003e\u003e= writeTQueue qOut)\n      merge qOdd qEven qOut\n```\n\nThis alteration will shard the input queue (`qIn`) on even and odd indices, and\nwe can `merge` it back without losing determinism. Note that if we'd simply had\na pool of worker threads taking items from the input queue and putting them on\nthe output queue (`qOut`) after processing, then we wouldn't have a\ndeterministic outcome. Also notice that in the `deploy`ment of `Shard`ing we\nalso end up copying data between the queues, similar to the fan-out case\n(`:\u0026\u0026\u0026`)!\n\nBefore we move on to show how to avoid doing this copying, let's have a look at\na couple of examples to get a better feel for pipelining and sharding. If we\ngeneralise `Map` to `MapM` in our\n[model](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/ModelIO.hs)\nwe can write the following contrived program:\n\n```haskell\nmodelSleep :: P () ()\nmodelSleep =\n  MapM (const (threadDelay 250000)) :\u0026\u0026\u0026 MapM (const (threadDelay 250000)) :\u003e\u003e\u003e\n  MapM (const (threadDelay 250000)) :\u003e\u003e\u003e\n  MapM (const (threadDelay 250000))\n```\n\nThe argument to `threadDelay` (or sleep) is microseconds, so at each point in\nthe pipeline we are sleeping 1/4 of a second.\n\nIf we feed this pipeline `5` items:\n\n```haskell\nrunModelSleep :: IO ()\nrunModelSleep = void (model modelSleep (replicate 5 ()))\n```\n\nWe see that it takes roughly 5 seconds:\n\n```\n\u003e :set +s\n\u003e runModelSleep\n(5.02 secs, 905,480 bytes)\n```\n\nThis is expected, even though we pipeline and fan-out, as the model is completely\nsequential.\n\nIf we instead run the same pipeline using the queue deployment, we get:\n\n```\n\u003e runQueueSleep\n(1.76 secs, 907,160 bytes)\n```\n\nThe reason for this is that the two sleeps in the fan-out happen in parallel now\nand when the first item is at the second stage the first stage starts processing\nthe second item, and so on, i.e. we get a pipelining parallelism.\n\nIf we, for some reason, wanted to achieve a sequential running time using the\nqueue deployment, we'd have to write a one stage pipeline like so:\n\n```haskell\nqueueSleepSeq :: P () ()\nqueueSleepSeq =\n  MapM $ \\() -\u003e do\n    ()       \u003c- threadDelay 250000\n    ((), ()) \u003c- (,) \u003c$\u003e threadDelay 250000 \u003c*\u003e threadDelay 250000\n    ()       \u003c- threadDelay 250000\n    return ()\n```\n\n```\n\u003e runQueueSleepSeq\n(5.02 secs, 898,096 bytes)\n```\n\nUsing sharding we can get an even shorter running time:\n\n```haskell\nqueueSleepSharded :: P () ()\nqueueSleepSharded = Shard queueSleep\n```\n\n```\n\u003e runQueueSleepSharded\n(1.26 secs, 920,888 bytes)\n```\n\nThis is pretty much where we left off in my previous post. If the speed ups we\nare seeing from pipelining don't make sense, it might help to go back and reread\nthe [old post](https://stevana.github.io/pipelined_state_machines.html), as I\nspent some more time constructing an intuitive example there.\n\n## Disruptor\n\nBefore we can understand how the Disruptor can help us avoid the problem copying\nbetween queues that we just saw, we need to first understand a bit about how the\nDisruptor is implemented.\n\nWe will be looking at the implementation of the single-producer Disruptor,\nbecause in our pipelines there will never be more than one producer per queue\n(the stage before it)[^5].\n\nLet's first have a look at the datatype and then explain each field:\n\n```haskell\ndata RingBuffer a = RingBuffer\n  { capacity             :: Int\n  , elements             :: IOArray Int a\n  , cursor               :: IORef SequenceNumber\n  , gatingSequences      :: IORef (IOArray Int (IORef SequenceNumber))\n  , cachedGatingSequence :: IORef SequenceNumber\n  }\n\nnewtype SequenceNumber = SequenceNumber Int\n```\n\nThe Disruptor is a ring buffer queue with a fixed `capacity`. It's backed by an\narray whose length is equal to the capacity, this is where the `elements` of the\nring buffer are stored. There's a monotonically increasing counter called the\n`cursor` which keeps track of how many elements we have written. By taking the\nvalue of the `cursor` modulo the `capacity` we get the index into the array\nwhere we are supposed to write our next element (this is how we wrap around the\narray, i.e. forming a ring). In order to avoid overwriting elements which have\nnot yet been consumed we also need to keep track of the cursors of all consumers\n(`gatingSequences`). As an optimisation we cache where the last consumer is\n(`cachedGatingSequence`).\n\nThe API from the producing side looks as follows:\n\n```haskell\ntryClaimBatch   :: RingBuffer a -\u003e Int -\u003e IO (Maybe SequenceNumber)\nwriteRingBuffer :: RingBuffer a -\u003e SequenceNumber -\u003e a -\u003e IO ()\npublish         :: RingBuffer a -\u003e SequenceNumber -\u003e IO ()\n```\n\nWe first try to claim `n :: Int` slots in the ring buffer, if that fails\n(returns `Nothing`) then we know that there isn't space in the ring buffer and\nwe should apply backpressure upstream (e.g. if the producer is a web server, we\nmight want to temporarily rejecting clients with status code 503). Once we\nsuccessfully get a sequence number, we can start writing our data. Finally we\npublish the sequence number, this makes it available on the consumer side.\n\nThe consumer side of the API looks as follows:\n\n```haskell\naddGatingSequence :: RingBuffer a -\u003e IO (IORef SequenceNumber)\nwaitFor           :: RingBuffer a -\u003e SequenceNumber -\u003e IO SequenceNumber\nreadRingBuffer    :: RingBuffer a -\u003e SequenceNumber -\u003e IO a\n```\n\nFirst we need to add a consumer to the ring buffer (to avoid overwriting on wrap\naround of the ring), this gives us a consumer cursor. The consumer is\nresponsible for updating this cursor, the ring buffer will only read from it to\navoid overwriting. After the consumer reads the cursor, it calls `waitFor` on\nthe read value, this will block until an element has been `publish`ed on that\nslot by the producer. In the case that the producer is ahead it will return the\ncurrent sequence number of the producer, hence allowing the consumer to do a\nbatch of reads (from where it currently is to where the producer currently is).\nOnce the consumer has caught up with the producer it updates its cursor.\n\nHere's an example which hopefully makes things more concrete:\n\n```haskell\nexample :: IO ()\nexample = do\n  rb \u003c- newRingBuffer_ 2\n  c \u003c- addGatingSequence rb\n  let batchSize = 2\n  Just hi \u003c- tryClaimBatch rb batchSize\n  let lo = hi - (coerce batchSize - 1)\n  assertIO (lo == 0)\n  assertIO (hi == 1)\n  -- Notice that these writes are batched:\n  mapM_ (\\(i, c) -\u003e writeRingBuffer rb i c) (zip [lo..hi] ['a'..])\n  publish rb hi\n  -- Since the ring buffer size is only two and we've written two\n  -- elements, it's full at this point:\n  Nothing \u003c- tryClaimBatch rb 1\n  consumed \u003c- readIORef c\n  produced \u003c- waitFor rb consumed\n  -- The consumer can do batched reads, and only do some expensive\n  -- operation once it reaches the end of the batch:\n  xs \u003c- mapM (readRingBuffer rb) [consumed + 1..produced]\n  assertIO (xs == \"ab\")\n  -- The consumer updates its cursor:\n  writeIORef c produced\n  -- Now there's space again for the producer:\n  Just 2 \u003c- tryClaimBatch rb 1\n  return ()\n```\n\nSee the `Disruptor` [module](src/Disruptor.hs) in case you are interested in the\nimplementation details.\n\nHopefully by now we've seen enough internals to be able to explain why the\nDisruptor performs well. First of all, by using a ring buffer we only allocate\nmemory when creating the ring buffer, it's then reused when we wrap around the\nring. The ring buffer is implemented using an array, so the memory access\npatterns are predictable and the CPU can do prefetching. The consumers don't\nhave a copy of the data, they merely have a pointer (the sequence number) to how\nfar in the producer's ring buffer they are, which allows for fanning out or\nsharding to multiple consumers without copying data. The fact that we can batch\non both the write side (with `tryClaimBatch`) and on the reader side (with\n`waitFor`) also helps. All this taken together contributes to the Disruptor's\nperformance.\n\n## Disruptor pipeline deployment\n\nRecall that the reason we introduced the Disruptor was to avoid copying elements\nof the queue when fanning out (using the `:\u0026\u0026\u0026` combinator) and sharding.\n\nThe idea would be to have the workers we fan-out to both be consumers of the\nsame Disruptor, that way the inputs don't need to be copied. Avoiding to copy\nthe individual outputs from the worker's queues (of `a`s and `b`s) into the\ncombined output (of `(a, b)`s) is a bit trickier.\n\nOne way, that I think works, is to do something reminiscent what\n[`Data.Vector`](https://hackage.haskell.org/package/vector) does for pairs.\nThat's a vector of pairs (`Vector (a, b)`) is actually represented as a pair of\nvectors (`(Vector a, Vector b)`)[^6].\n\nWe can achieve this with [associated\ntypes](http://simonmar.github.io/bib/papers/assoc.pdf) as follows:\n\n```haskell\nclass HasRB a where\n  data RB a :: Type\n  newRB               :: Int -\u003e IO (RB a)\n  tryClaimBatchRB     :: RB a -\u003e Int -\u003e IO (Maybe SequenceNumber)\n  writeRingBufferRB   :: RB a -\u003e SequenceNumber -\u003e a -\u003e IO ()\n  publishRB           :: RB a -\u003e SequenceNumber -\u003e IO ()\n  addGatingSequenceRB :: RB a -\u003e IO Counter\n  waitForRB           :: RB a -\u003e SequenceNumber -\u003e IO SequenceNumber\n  readRingBufferRB    :: RB a -\u003e SequenceNumber -\u003e IO a\n```\n\nThe instances for this class for types that are not pairs will just use the\nDisruptor that we defined above.\n\n```haskell\ninstance HasRB String where\n  data RB String = RB (RingBuffer String)\n  newRB n        = RB \u003c$\u003e newRingBuffer_ n\n  ...\n```\n\nWhile the instance for pairs will use a pair of Disruptors:\n\n```haskell\ninstance (HasRB a, HasRB b) =\u003e HasRB (a, b) where\n  data RB (a, b) = RBPair (RB a) (RB b)\n  newRB n = RBPair \u003c$\u003e newRB n \u003c*\u003e newRB n\n  ...\n```\n\nThe `deploy` function for the fan-out combinator can now avoid copying:\n\n```haskell\ndeploy :: (HasRB a, HasRB b) =\u003e P a b -\u003e RB a -\u003e IO (RB b)\ndeploy (p :\u0026\u0026\u0026 q) xs = do\n  ys \u003c- deploy p xs\n  zs \u003c- deploy q xs\n  return (RBPair ys zs)\n```\n\nSharding, or partition parallelism as Jim calls it, is a way to make a copy of a\npipeline and divert half of the events to the first copy and the other half to\nthe other copy. Assuming there are enough unused CPUs/core, this could\neffectively double the throughput. It might be helpful to think of the events at\neven positions in the stream going to the first pipeline copy while the events\nin the odd positions in the stream go to the second copy of the pipeline.\n\nWhen we shard in the `TQueue` deployment of pipelines we end up copying events\nfrom the original stream into the two pipeline copies. This is similar to\ncopying when fanning out, which we discussed above, and the solution is similar.\n\nFirst we need to change the pipeline type so that the shard constructor has an\noutput type that's `Sharded`.\n\n```diff\ndata P :: Type -\u003e Type -\u003e Type where\n  ...\n- Shard :: P a b -\u003e P a b\n+ Shard :: P a b -\u003e P a (Sharded b)\n```\n\nThis type is in fact merely the identity type:\n\n```haskell\nnewtype Sharded a = Sharded a\n```\n\nBut it allows us to define a `HasRB` instance which does the sharding without\ncopying as follows:\n\n```haskell\ninstance HasRB a =\u003e HasRB (Sharded a) where\n  data RB (Sharded a) = RBShard Partition Partition (RB a) (RB a)\n  readRingBufferRB (RBShard p1 p2 xs ys) i\n    | partition i p1 = readRingBufferRB xs i\n    | partition i p2 = readRingBufferRB ys i\n  ...\n```\n\nThe idea being that we split the ring buffer into two, like when fanning out,\nand then we have a way of taking an index and figuring out which of the two ring\nbuffers it's actually in.\n\nThis partitioning information, `p`, is threaded though while deploying:\n\n```haskell\ndeploy (Shard f) p xs = do\n  let (p1, p2) = addPartition p\n  ys1 \u003c- deploy f p1 xs\n  ys2 \u003c- deploy f p2 xs\n  return (RBShard p1 p2 ys1 ys2)\n```\n\nFor the details of how this works see the following footnote[^7] and the `HasRB\n(Sharded a)` instance in the following\n[module](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/RingBufferClass.hs).\n\nIf we\n[run](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/LibMain/Sleep.hs)\nour sleep pipeline from before using the Disruptor\n[deployment](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/Pipeline.hs)\nwe get similar timings as with the queue deployment:\n\n```\n\u003e runDisruptorSleep False\n(2.01 secs, 383,489,976 bytes)\n\n\u003e runDisruptorSleepSharded False\n(1.37 secs, 286,207,264 bytes)\n```\n\nIn order to get a better understanding of how not copying when fanning out and\nsharding improves performance, let's instead have a look at this pipeline which\nfans out five times:\n\n```haskell\ncopyP :: P () ()\ncopyP =\n  Id :\u0026\u0026\u0026 Id :\u0026\u0026\u0026 Id :\u0026\u0026\u0026 Id :\u0026\u0026\u0026 Id\n  :\u003e\u003e\u003e Map (const ())\n```\n\nIf we deploy this pipeline using queues and feed it five million items we get\nthe following statistics from the profiler:\n\n```\n11,457,369,968 bytes allocated in the heap\n   198,233,200 bytes copied during GC\n     5,210,024 bytes maximum residency (27 sample(s))\n     4,841,208 bytes maximum slop\n           216 MiB total memory in use (0 MB lost due to fragmentation)\n\n\nreal    0m8.368s\nuser    0m10.647s\nsys     0m0.778s\n```\n\nWhile the same setup but using the Disruptor deployment gives us:\n\n```\n6,629,305,096 bytes allocated in the heap\n  110,544,544 bytes copied during GC\n    3,510,424 bytes maximum residency (17 sample(s))\n    5,090,472 bytes maximum slop\n          214 MiB total memory in use (0 MB lost due to fragmentation)\n\nreal    0m5.028s\nuser    0m7.000s\nsys     0m0.626s\n```\n\nSo about an half the amount of bytes allocated in the heap using the Disruptor.\n\nIf we double the fan-out factor from five to ten, we get the following stats with\nthe queue deployment:\n\n```\n35,552,340,768 bytes allocated in the heap\n 7,355,365,488 bytes copied during GC\n    31,518,256 bytes maximum residency (295 sample(s))\n       739,472 bytes maximum slop\n           257 MiB total memory in use (0 MB lost due to fragmentation)\n\nreal    0m46.104s\nuser    3m35.192s\nsys     0m1.387s\n```\n\nand the following for the Disruptor deployment:\n\n```\n11,457,369,968 bytes allocated in the heap\n   198,233,200 bytes copied during GC\n     5,210,024 bytes maximum residency (27 sample(s))\n     4,841,208 bytes maximum slop\n           216 MiB total memory in use (0 MB lost due to fragmentation)\n\nreal    0m8.368s\nuser    0m10.647s\nsys     0m0.778s\n```\n\nSo it seems that the gap between the two deployments widens as we introduce more\nfan-out, this expected as the queue implementation will have more copying of\ndata to do[^8].\n\n## Observability\n\nGiven that pipelines are directed acyclic graphs and that we have a concrete\ndatatype constructor for each pipeline combinator, it's relatively straight\nforward to add a visualisation of a deployment.\n\nFurthermore, since each Disruptor has a `cursor` keeping that of how many\nelements it produced and all the consumers of a Disruptor have one keeping track\nof how many elements they have consumed, we can annotate our deployment\nvisualisation with this data and get a good idea of the progress the pipeline is\nmaking over time as well as spot potential bottlenecks.\n\nHere's an example of such an visualisation, for a\n[word count](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/LibMain/WordCount.hs)\npipeline, as an interactive SVG (you need to click on the image):\n\n[![Demo](https://stevana.github.io/svg-viewer-in-svg/wordcount-pipeline.svg)](https://stevana.github.io/svg-viewer-in-svg/wordcount-pipeline.svg)\n\nThe way it's implemented is that we spawn a separate thread that read the\nproducer's cursors and consumer's gating sequences (`IORef SequenceNumber` in\nboth cases) every millisecond and saves the `SequenceNumber`s (integers). After\ncollecting this data we can create one dot diagram for every time the data\nchanged. In the demo above, we also collected all the elements of the Disruptor,\nthis is useful for debugging (the implementation of the pipeline library), but\nit would probably be too expensive to enable this when there's a lot of items to\nbe processed.\n\nI have written a separate write up on how to make the SVG interactive over\n[here](https://stevana.github.io/visualising_datastructures_over_time_using_svg.html).\n\n## Running\n\nAll of the above Haskell code is available on\n[GitHub](https://github.com/stevana/pipelining-with-disruptor/). The easiest way\nto install the right version of GHC and cabal is probably via\n[ghcup](https://www.haskell.org/ghcup/). Once installed the\n[examples](https://github.com/stevana/pipelining-with-disruptor/tree/main/src/LibMain)\ncan be run as follows:\n\n```bash\ncat data/test.txt | cabal run uppercase\ncat data/test.txt | cabal run wc # word count\n```\n\nThe [sleep\nexamples](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/LibMain/Sleep.hs)\nare run like this:\n\n```bash\ncabal run sleep\ncabal run sleep -- --sharded\n```\n\nThe different [copying\nbenchmarks](https://github.com/stevana/pipelining-with-disruptor/blob/main/src/LibMain/Copying.hs)\ncan be reproduced as follows:\n\n```bash\nfor flag in \"--no-sharding\" \\\n            \"--copy10\" \\\n            \"--tbqueue-no-sharding\" \\\n            \"--tbqueue-copy10\"; do \\\n  cabal build copying \u0026\u0026 \\\n    time cabal run copying -- \"$flag\" \u0026\u0026 \\\n    eventlog2html copying.eventlog \u0026\u0026 \\\n    ghc-prof-flamegraph copying.prof \u0026\u0026 \\\n    firefox copying.eventlog.html \u0026\u0026 \\\n    firefox copying.svg\ndone\n```\n\n## Further work and contributing\n\nThere's still a lot to do, but I thought it would be a good place to stop for\nnow. Here are a bunch of improvements, in no particular order:\n\n- [ ] Implement the `Arrow` instance for Disruptor `P`ipelines, this isn't as\n      straightforward as in the model case, because the combinators are littered\n      with `HasRB` constraints, e.g.: `(:\u0026\u0026\u0026) :: (HasRB b, HasRB c) =\u003e P a b -\u003e\n      P a c -\u003e P a (b, c)`. Perhaps taking inspiration from\n      constrained/restricted monads? In the `r/haskell` discussion, the user\n      `ryani` [pointed\n      out](https://old.reddit.com/r/haskell/comments/19ef2b6/parallel_stream_processing_with_zerocopy_fanout/kjhfyfk/)\n      a promising solution involving adding `Constraint`s to the `HasRB` class.\n      This would allow us to specify pipelines using the [arrow\n      notation](https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/arrows.html).\n- [ ] I believe the current pipeline combinator allow for arbitrary directed\n      acyclic graphs (DAGs), but what if feedback cycles are needed? Does an\n      `ArrowLoop` instance make sense in that case?\n- [ ] Can we avoid copying when using `Either` via `(:|||)` or `(:+++)`, e.g.\n      can we store all `Left`s in one ring buffer and all `Right`s in another?\n- [ ] Use unboxed arrays for types that can be unboxed in the `HasRB` instances?\n- [ ] In the word count example we get an input stream of lines, but we only\n      want to produce a single line as output when we reach the end of the input\n      stream. In order to do this I added a way for workers to say that\n      `NoOutput` was produced in one step. Currently that constructor still gets\n      written to the output Disruptor, would it be possible to not write it but\n      still increment the sequence number counter?\n- [ ] Add more monitoring? Currently we only keep track of the queue length,\n      i.e. saturation. Adding service time, i.e. how long it takes to process an\n      item, per worker shouldn't be hard. Latency (how long an item has been\n      waiting in the queue) would be more tricky as we'd need to annotate and\n      propagate a timestamp with the item?\n- [ ] Since monitoring adds a bit of overheard, it would be neat to be able to\n      turn monitoring on and off at runtime;\n- [ ] The `HasRB` instances are incomplete, and it's not clear if they need to\n      be completed? More testing and examples could help answer this question,\n      or perhaps a better visualisation?\n- [ ] Actually test using `prop_commute` partially applied to a concrete\n      pipeline?\n- [ ] Implement a property-based testing generator for pipelines and test using\n      `prop_commute` using random pipelines?\n- [ ] Add network/HTTP source and sink?\n- [ ] Deploy across network of computers?\n- [ ] Hot-code upgrades of workers/stages with zero downtime, perhaps continuing\n      on my earlier\n      [attempt](https://stevana.github.io/hot-code_swapping_a_la_erlang_with_arrow-based_state_machines.html)?\n- [ ] In addition to upgrading the workers/stages, one might also want to rewire\n      the pipeline itself. Doug made me aware of an old\n      [paper](https://inria.hal.science/inria-00306565) by Gilles Kahn and David\n      MacQueen (1976), where they reconfigure their network on the fly. Perhaps\n      some ideas can be stole from there?\n- [ ] Related to reconfiguring is to be able shard/scale/reroute pipelines and\n      add more machines without downtime. Can we do this automatically based on\n      our monitoring? Perhaps building upon my earlier\n      [attempt](https://stevana.github.io/elastically_scalable_thread_pools.html)?\n- [ ] More benchmarks, in particular trying to confirm that we indeed don't\n      allocate when fanning out and sharding[^8], as well as benchmarks against\n      other streaming libraries.\n\nIf any of this seems interesting, feel free to get involved.\n\n## See also\n\n* Guy Steele's talk [How to Think about Parallel Programming:\n  Not!](https://www.infoq.com/presentations/Thinking-Parallel-Programming/)\n  (2011);\n* [Understanding the Disruptor, a Beginner's Guide to Hardcore\n  Concurrency](https://youtube.com/watch?v=DCdGlxBbKU4) by Trisha Gee and Mike\n  Barker (2011);\n* Mike Barker's [brute-force solution to Guy's problem and\n  benchmarks](https://github.com/mikeb01/folklore/tree/master/src/main/java/performance);\n* [Streaming 101: The world beyond\n  batch](https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/)\n  (2015);\n* [Streaming 102: The world beyond\n  batch](https://www.oreilly.com/radar/the-world-beyond-batch-streaming-102/)\n  (2016);\n* [*SEDA: An Architecture for Well-Conditioned Scalable Internet\n  Services*](https://people.eecs.berkeley.edu/~brewer/papers/SEDA-sosp.pdf)\n  (2001);\n* [Microsoft\n  Naiad](https://www.microsoft.com/en-us/research/publication/naiad-a-timely-dataflow-system-2/):\n  a timely dataflow system (with stage notifications) (2013);\n* Elixir's ALF flow-based programming\n  [library](https://www.youtube.com/watch?v=2XrYd1W5GLo) (2021);\n* [How fast are Linux pipes anyway?](https://mazzo.li/posts/fast-pipes.html)\n  (2022);\n* [netmap](https://man.freebsd.org/cgi/man.cgi?query=netmap\u0026sektion=4): a\n  framework for fast packet I/O;\n* [The output of Linux pipes can be\n  indeterministic](https://www.gibney.org/the_output_of_linux_pipes_can_be_indeter)\n  (2019);\n* [Programming Distributed Systems](https://www.youtube.com/watch?v=Mc3tTRkjCvE)\n  by Mae Milano (Strange Loop, 2023);\n* [Pipeline-oriented programming](https://www.youtube.com/watch?v=ipceTuJlw-M)\n  by Scott Wlaschin (NDC Porto, 2023).\n\n## Discussion\n\n* [discourse.haskell.org](https://discourse.haskell.org/t/parallel-stream-processing-with-zero-copy-fan-out-and-sharding/8632);\n* [r/haskell](https://old.reddit.com/r/haskell/comments/19ef2b6/parallel_stream_processing_with_zerocopy_fanout/);\n* [lobste.rs](https://lobste.rs/s/mvgdev/parallel_stream_processing_with_zero).\n\n\n[^1]: I noticed that the Wikipedia page for [dataflow\n    programming](https://en.wikipedia.org/wiki/Dataflow_programming) mentions\n    that Jack Dennis and his graduate students pioneered that style of\n    programming while he was at MIT in the 60s. I knew Doug was at MIT around\n    that time as well, and so I sent an email to Doug asking if he knew of\n    Jack's work. Doug replied saying he had left MIT by the 60s, but was still\n    collaborating with people at MIT and was aware of Jack's work and also\n    the work by Kelly, Lochbaum and Vyssotsky on\n    [BLODI](https://archive.org/details/bstj40-3-669) (1961) was on his mind\n    when he wrote the garden hose memo (1964).\n\n[^2]: There's a paper called [Parallel Functional Reactive\n    Programming](http://flint.cs.yale.edu/trifonov/papers/pfrp.pdf) by Peterson\n    et al. (2000), but as Conal Elliott\n    [points](http://conal.net/papers/push-pull-frp/push-pull-frp.pdf) out:\n\n    \u003e \"Peterson et al. (2000) explored opportunities for parallelism in\n    \u003e implementing a variation of FRP. While the underlying semantic\n    \u003e model was not spelled out, it seems that semantic determinacy was\n    \u003e not preserved, in contrast to the semantically determinate concurrency\n    \u003e used in this paper (Section 11).\"\n\n    Conal's approach (his Section 11) seems to build upon very fine grained\n    parallelism provided by an \"unambiguous choice\" operator which is implemented\n    by spawning two threads. I don't understand where exactly this operator is\n    used in the implementation, but if it's used every time an element is\n    processed (in parallel) then the overheard of spawning the threads could\n    be significant?\n\n[^3]: The design space of what pipeline combinators to include in the pipeline\n    datatype is very big. I've chosen the ones I've done because they are\n    instances of already well established type classes:\n\n    ```haskell\n    instance Category P where\n      id    = Id\n      g . f = f :\u003e\u003e\u003e g\n\n    instance Arrow P where\n      arr     = Map\n      f *** g = f :*** g\n      f \u0026\u0026\u0026 g = f :\u0026\u0026\u0026 g\n\n    instance ArrowChoice P where\n      f +++ g = f :+++ g\n      f ||| g = f :||| g\n    ```\n\n    Ideally we'd also like to be able to use `Arrow` notation/syntax to describe our\n    pipelines. Even better would be if arrow notation worked for Cartesian categories.\n    See Conal Elliott's work on [compiling to\n    categories](http://conal.net/papers/compiling-to-categories/), as well as\n    Oleg Grenrus' GHC\n    [plugin](https://github.com/phadej/overloaded/blob/master/src/Overloaded/Categories.hs)\n    that does the right thing and translates arrow syntax into Cartesian\n    categories.\n\n[^4]: Search for \"QuickCheck GADTs\" if you are interested in finding out more\n    about this topic.\n\n[^5]: The Disruptor also comes in a multi-producer variant, see the following\n    [repository](https://github.com/stevana/pipelined-state-machines/tree/main/src/Disruptor/MP)\n    for a Haskell version or the\n    [LMAX](https://github.com/LMAX-Exchange/disruptor) repository for the\n    original Java implementation.\n\n[^6]: See also [array of structures vs structure of\n    arrays](https://en.wikipedia.org/wiki/AoS_and_SoA) in other programming\n    languages.\n\n[^7]: The partitioning information consists of the total number of partitions\n    and the index of the current partition.\n\n    ```haskell\n    data Partition = Partition\n      { pIndex :: Int\n      , pTotal :: Int\n      }\n    ```\n\n    No partitioning is represented as follows:\n\n    ```haskell\n    noPartition :: Partition\n    noPartition = Partition 0 1\n    ```\n\n    While creating a new partition is done as follows:\n\n    ```haskell\n    addPartition :: Partition -\u003e (Partition, Partition)\n    addPartition (Partition i total) =\n      ( Partition i (total * 2)\n      , Partition (i + total) (total * 2)\n      )\n    ```\n\n    So, for example, if we partition twice we get:\n\n    ```\n    \u003e let (p1, p2) = addPartition noPartition in (addPartition p1, addPartition p2)\n    ((Partition 0 4, Partition 2 4), (Partition 1 4, Partition 3 4))\n    ```\n\n    From this information we can compute if an index is in an partition or not as\n    follows:\n\n    ```haskell\n    partition :: SequenceNumber -\u003e Partition -\u003e Bool\n    partition i (Partition n total) = i `mod` total == 0 + n\n    ```\n\n    To understand why this works, it might be helpful to consider the case where we\n    only have two partitions. We can partition on even or odd indices as follows:\n    ``even i = i `mod` 2 == 0 + 0`` and ``odd i = i `mod` 2 == 0 + 1``. Written this\n    way we can easier see how to generalise to `total` partitions: ``partition i\n    (Partition n total) = i `mod` total == 0 + n``. So for `total = 2` then\n    `partition i (Partition 0 2) == even` while `partition i (Partition 1 2) ==\n    odd`.\n\n    Since partitioning and partitioning a partition, etc, always introduce a power\n    of two we can further optimise to use bitwise or as follows: `partition i\n    (Partition n total) = i .|. (total - 1) == 0 + n` thereby avoiding the expensive\n    modulus computation. This is a trick used in Disruptor as well, and the reason\n    why the capacity of a Disruptor always needs to be a power of two.\n\n[^8]: I'm not sure why \"bytes allocated in the heap\" gets doubled in the\n    Disruptor case and tripled in the queue cases though?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstevana%2Fpipelining-with-disruptor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstevana%2Fpipelining-with-disruptor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstevana%2Fpipelining-with-disruptor/lists"}