{"id":13671612,"url":"https://github.com/arindas/laminarmq","last_synced_at":"2026-03-12T14:17:15.927Z","repository":{"id":57133806,"uuid":"524087609","full_name":"arindas/laminarmq","owner":"arindas","description":"A scalable, distributed message queue powered by a segmented, partitioned, replicated and immutable log.","archived":false,"fork":false,"pushed_at":"2024-05-25T15:21:28.000Z","size":44478,"stargazers_count":66,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-07-13T07:37:15.062Z","etag":null,"topics":["io-uring","message-queue","segmented-log"],"latest_commit_sha":null,"homepage":"https://docs.rs/laminarmq/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arindas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-12T12:53:31.000Z","updated_at":"2025-06-20T08:25:26.000Z","dependencies_parsed_at":"2023-11-21T05:37:35.910Z","dependency_job_id":"48e92bc0-a37b-4ec8-b98f-d434562bca31","html_url":"https://github.com/arindas/laminarmq","commit_stats":{"total_commits":141,"total_committers":1,"mean_commits":141.0,"dds":0.0,"last_synced_commit":"a778f8179e441b8f73b6c4a1102d4e7a23060ad4"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/arindas/laminarmq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arindas%2Flaminarmq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arindas%2Flaminarmq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arindas%2Flaminarmq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arindas%2Flaminarmq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arindas","download_url":"https://codeload.github.com/arindas/laminarmq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arindas%2Flaminarmq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30428016,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-12T14:00:25.264Z","status":"ssl_error","status_checked_at":"2026-03-12T13:59:52.690Z","response_time":114,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["io-uring","message-queue","segmented-log"],"created_at":"2024-08-02T09:01:14.504Z","updated_at":"2026-03-12T14:17:15.883Z","avatar_url":"https://github.com/arindas.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/logo.png\" alt=\"laminarmq\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/arindas/laminarmq/actions/workflows/rust-ci.yml\"\u003e\n  \u003cimg src=\"https://github.com/arindas/laminarmq/actions/workflows/rust-ci.yml/badge.svg\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://codecov.io/gh/arindas/laminarmq\" \u003e\n  \u003cimg src=\"https://codecov.io/gh/arindas/laminarmq/branch/main/graph/badge.svg?token=6VLETF5REC\"/\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://crates.io/crates/laminarmq\"\u003e\n  \u003cimg src=\"https://img.shields.io/crates/v/laminarmq\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/arindas/laminarmq/actions/workflows/rustdoc.yml\"\u003e\n  \u003cimg src=\"https://github.com/arindas/laminarmq/actions/workflows/rustdoc.yml/badge.svg\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\nA scalable, distributed message queue powered by a segmented, partitioned, replicated and immutable\nlog.\u003cbr\u003e\u003ci\u003eThis is currently a work in progress.\u003c/i\u003e\n\u003c/p\u003e\n\n## Usage\n\n`laminarmq` provides a library crate and two binaries for managing `laminarmq` deployments. In order\nto use `laminarmq` as a library, add the following to your `Cargo.toml`:\n\n```toml\n[dependencies]\nlaminarmq = \"0.0.5\"\n```\n\nRefer to latest git [API Documentation](https://arindas.github.io/laminarmq/docs/laminarmq/) or\n[Crate Documentation](https://docs.rs/laminarmq) for more details. There's also a\n[book](https://arindas.github.io/laminarmq/book) being written to further describe design decisions,\nimplementation details and recipes.\n\n`laminarmq` presents an elementary commit-log abstraction (a series of records ordered by indices),\non top of which several message queue semantics such as publish subscribe or even full blown\nprotocols like MQTT could be implemented. Users are free to read the messages with offsets in any\norder they need.\n\n## Major milestones for `laminarmq`\n\n- [x] Locally persistent queue of records\n- [ ] Single node, multi threaded, eBPF based request to thread routed message queue\n- [ ] Service discovery with\n      [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf).\n- [ ] Replication and consensus of replicated records with [Raft](https://raft.github.io/raft.pdf).\n\n## Examples\n\nFind examples demonstrating different capabilities of `laminarmq` in the\n[examples](https://github.com/arindas/laminarmq/tree/main/examples) directory.\n\nInvoke any example as follows:\n\n```sh\ncargo run --example \u003cexample-name\u003e --release\n```\n\n## Media\n\nMedia associated with the `laminarmq` project.\n\n- `[BLOG]` [Building Segmented Logs in Rust: From Theory to Production!](https://arindas.github.io/blog/segmented-log-rust/)\n\n## Design\n\nThis section describes the internal design of `laminarmq`.\n\n### Cluster Hierarchy\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-cluster-hierarchy.svg\" alt=\"cluster-hierarchy\" /\u003e\n\u003c/p\u003e\n\n```text\npartition_id_x is of the form (topic_id, partition_idx)\n\nIn this example, consider:\n\npartition_id_0 = (topic_id_0, partition_idx_0)\npartition_id_1 = (topic_id_0, partition_idx_1)\n\npartition_id_2 = (topic_id_1, partition_idx_0)\n\n```\n\n\u003e The exact numerical ids don't have any pattern with respect to partition_id and topic_id; there can\n\u003e be multiple topics, each of which can have multiple partitions (identified by partition_idx).\n\n… alternatively:\n\n```text\n[cluster]\n├── node#001\n│   ├── (topic#001, partition#001) [L]\n│   │   └── segmented_log{[segment#001, segment#002, ...]}\n│   ├── (topic#001, partition#002) [L]\n│   │   └── segmented_log{[segment#001, segment#002, ...]}\n│   └── (topic#002, partition#001) [F]\n│       └── segmented_log{[segment#001, segment#002, ...]}\n├── node#002\n│   ├── (topic#001, partition#002) [F]\n│   │   └── segmented_log{[segment#001, segment#002, ...]}\n│   └── (topic#002, partition#001) [L]\n│       └── segmented_log{[segment#001, segment#002, ...]}\n└── ...other nodes\n\n```\n\n```text\n[L] := leader; [F] := follower\n```\n\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e \u003ccode\u003elaminarmq\u003c/code\u003e cluster hierarchy depicting partitioning and replication.\n\u003c/p\u003e\n\nA \"topic\" is a collection of records. A topic is divided into multiple \"partition\"(s). Each\n\"partition\" is then further replicated across multiple \"node\"(s). A \"node\" may contain some or all\n\"partition\"(s) of a \"topic\". In this way a topic is both partitioned and replicated across the\nnodes in the cluster.\n\nThere is no ordering of messages at the \"topic\" level. However, a \"partition\" is an ordered\ncollection of records, ordered by record indices.\n\nAlthough we conceptually maintain a hierarchy of partitions and topics, at the cluster level, we\nhave chosen to maintain a flat representation of topic partitions. We present an elementary\ncommit-log API at the partition level.\n\nUsers may hence do the following:\n\n- Directly interact with our message queue at the partition level\n- Use client side load balancing between topic partitions\n\nThis alleviates the burden of load balancing messages among partitions and message stream ownership\nrecord keeping from the cluster. Higher level constructs can be built on top of the partition\ncommit-log based API as necessary.\n\nEach partition replica group has a leader where writes go, and a set of followers which follow the\nleader and may be read from. Users may again use client side load balancing to balance reads across\nthe leader and all the followers.\n\nEach partition replica is backed by a segmented log for storage.\n\n### Service discovery and partition distribution to nodes\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-service-discovery-and-partition-distribution.svg\" alt=\"service-discovery-and-partition-distribution-to-nodes\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Rendezvous hashing based partition distribution and gossip style service discovery\nmechanism used by \u003ccode\u003elaminarmq\u003c/code\u003e\n\u003c/p\u003e\n\nIn our cluster, nodes discover each other using gossip style peer to peer mechanisms. One such\nmechanism is [SWIM](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf) (Scalable\nWeakly Consistent Infection-style Process Group Memberhsip).\n\nIn this mechanism, each member node notifies other members in the group whether a node is joining or\nleaving the cluster by using a gossip style information dissemination mechanism (A node gossips to\nneighbouring nodes, they in-turn gossip to their neighbours and so on).\n\nIn order to see whether a node has failed, the nodes randomly probes individual nodes in the\ncluster. For instance, node A probes node B directly. If node B responds, it has not failed. If node\nB does not respond, A attempts to probe node B indirectly through other nodes in the cluster, e.g.\nnode A might ask node C to probe node B. Node A continues to indirectly probe node B with all the\nother nodes in the cluster. If node B responds to any of the indirect probes, it is still considered\nto not have failed. It is otherwise declared failed and removed from the cluster.\n\nThere are mechanisms in place to reduce false failures caused due to temporary hiccups. The paper\ngoes into detail about those mechanisms.\n\nThis is also the core technology used in [Hashicorp Serf](https://www.serf.io/), where there are\nfurther enhancements to improve failure detection and convergence speed.\n\nUsing this mechanism we can obtain a list of all the members in our cluster, along with their unique\nids and capacity weights. We then use their ids and weights to determine where to place a partition\nusing Rendezvous hashing.\n\nFrom the Wikipedia [article](https://en.wikipedia.org/wiki/Rendezvous_hashing):\n\n\u003e Rendezvous or highest random weight (HRW) hashing is an algorithm that allows clients to achieve\n\u003e distributed agreement on a set of _k_ options out of a possible set of _n_ options. A typical\n\u003e application is when clients need to agree on which sites (or proxies) objects are assigned to.\n\nIn our case, we use rendezvous hashing to determine the subset of nodes to use for placing the\nreplicas of a partition.\n\nFor some hashing function `H`, some weight function `f(w, hash)` and partition id `P_x`, we proceed\nas follows:\n\n- For every node `N_i` in the cluster with a weight `w_i`, we compute `R_i = f(w_i, H(concat(P_x,\nN_i)))`\n- We rank all nodes `N_i` belonging to the set of nodes `N` with respect to their rank value `R_i`.\n- For some replication factor `k`, we select the top `k` nodes to place the `k` replicas of the\n  partition with id `P_x`\n\n(We assume `k \u003c= |N|`; where `|N|` is the number of nodes and `k` is the number of replicas)\n\nWith this mechanism, anyone with the ids and weights of all the members in the cluster can compute\nthe destination nodes for the replicas of a partition. This knowledge can also be used to route\npartition request to the appropriate nodes at both the client side and the server side.\n\nIn our case, we use client side load balancing to load balance all idempotent partition requests\nacross all the possible nodes where a replica of the request's partition can be present. For\nnon-idempotent request, if we send it to any one of the candidate nodes, they redirect it to the\ncurrent leader of the replica set.\n\n### Supported execution models\n\n`laminarmq` supports two execution models:\n\n- General async execution model used by various async runtimes in the Rust ecosystem (e.g `tokio`)\n- Thread per core execution model\n\nIn the thread-per-core execution model individual processor cores are limited to single threads.\nThis model encourages design that minimizes inter-thread contention and locks, thereby improving\ntail latencies in software services. Read: [The Impact of Thread per Core Architecture on\nApplication Tail Latency.](https://helda.helsinki.fi//bitstream/handle/10138/313642/tpc_ancs19.pdf?sequence=1)\n\nIn the thread per core execution model, we have to leverage application level partitioning such that\neach individual thread is responsible for a subset of requests and/or responsibilities. We also have\nto complement this model with proper routing of requests to the threads to ensure locality of\nrequests. In our case this translates to having each thread be responsible for only a subset of the\npartition replicas in a node. Requests pertaining to a partition replica are always routed to the\nsame thread. The following sections will go into more detail as to how this is achieved.\n\nWe realize that although the thread per core execution model has some inherent advantages, being\ncompatible with the existing Rust ecosystem will significantly increase adoption. Therefore, we have\ndesigned our system with reusable components which can be organized to suit both execution models.\n\n### Request routing in nodes\n\n#### General design\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-node-request-routing-general.svg\" alt=\"request-rouing-general\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Request routing mechanism in \u003ccode\u003elaminarmq\u003c/code\u003e nodes using the general execution\nmodel.\n\u003c/p\u003e\n\nIn our cluster, we have two kinds of requests:\n\n- **membership requests**: used by the gossip style service discovery system for maintaining cluster\n  membership.\n- **partition requests**: used to interact with `laminarmq` topic partitions.\n\nWe use an [eBPF](https://ebpf.io/what-is-ebpf/) XDP filter to classify request packets at the socket\nlayer into membership request packets and partition request packets. Next we use eBPF to route\nmembership packets to a different socket which is exclusively used by the membership management\nsubsystem in that node. The partition request packets are left to flow as is.\n\nNext we have an \"HTTP server\", which parses the incoming partition request packets from the original\nsocket into valid `partition::*` requests. For every `partition::*` request, the HTTP server spawns\na future to handle it. This request handler future does the following:\n\n- Create a new channel `(tx, rx)` for the request.\n- Send the parsed partition request along with send end of the channel `(partition::*, tx)` to the\n  \"Request Router\" over the request router's receiving channel.\n- Await on the recv. end of the channel created by this future for the response. `res = rx.await`\n- When we receive the response from this future's channel, we serialize it and respond back to the\n  socket we had received the packets from.\n\nNext we have a \"Request Router / Partition manager\" responsible for routing various requests to the\npartition serving futures. The request router unit receives both `membership::*` requests from the\nmembership subsystem and `partition::*` requests received from the \"HTTP server\" request handler\nfutures (also called request poller futures from here on since they poll for the response from the\nchannel recv. `rx` end). The request router unit routes requests as follows:\n\n- `membership::*` requests are broadcast to all the partition serving futures\n- `(partition::*_request(partition_id_x, …), tx)` tuples are routed to their destination partitions\n  using the `partition_id`.\n- `(partition::create(partition_id_x, …), tx)` tuples are handled by the request router/ partition\n  manager itself. For this, the request router / partition manager creates a new partition serving\n  future, allocates the required storage units or it and sends and appropriate response on `tx`.\n\nFinally, the individual partition server futures receive both `membership::*` and `(partition::*,\ntx)` requests as they come to our node and routed. They handle the requests as necessary and send a\nresponse back to `tx` where applicable.\n\n#### Thread per core execution model compatible design\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-node-request-routing-thread-per-core.svg\" alt=\"request-routing-thread-per-core\" /\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Request routing mechanism in \u003ccode\u003elaminarmq\u003c/code\u003e nodes using the thred per core\nexecution model.\n\u003c/p\u003e\n\nIn the thread per core execution model each thread is responsible for a subset of the partitions.\nHence each thread has it's own \"Request Router / Partition Manager\", \"HTTP Server\" and a set of\npartition serving futures. We run multiple such threads on different processor cores.\n\nNow, as discussed before we need to route individual requests to the correct destination thread to\nensure request locality. We use a dedicated \"Thread Router\" eBPF XDP filter to route partition\nrequest packets to their destination threads.\n\nThe \"Thread Router\" eBPF XDP filter keeps a eBPF `sockmap` which contains the sockets where each of\nthe threads listen to for requests. For every incoming request, we route it to its destination\nthread using this `sockmap`. Now we can again leverage rendezvous hashing here to determine the\nthread to be used for a request. We use the `partition_id` and `thread_id` for rendezvous hashing.\nSince all the threads run on different processor cores, they will have similar request handling\ncapacity and hence will have equal weights. Using this, requests belonging to a particular partition\nwill always be routed to the same thread on a particular node. This ensures a high level of request\nlocality.\n\nThe remaining components behave as discussed above. Notice how we are able to reuse the same\ncomponents in a drastically different execution model, as promised before.\n\n### Partition control flow and replication\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-partition-control-flow-and-replication.svg\" alt=\"partition-control-flow-replication\" /\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Partition serving future control flow and partition replication mechanism in\n\u003ccode\u003elaminarmq\u003c/code\u003e\n\u003c/p\u003e\n\nThe partition controller future receives membership event requests `membership::{join, leave,\nupdate_weight}` or `(paritition::*, tx)` requests from the request router future.\n\nThe partition request handler handles the different requests as follows:\n\n- Idempotent `partition::*_request`: performs the necessary idempotent operation on the underlying\n  segmented log and responds with the result on the response channel.\n\n- Non-idempotent `partition::*_request`: Leader and follower replicas handle non-idempotent replicas\n  differently:\n\n  - Leader replicas: Replicates non-idempotent operations on all follower partitions in the Raft\n    group if this partition is a leader, and then applies the operation locally. This might involve\n    first sending an Raft append request, locally writing once majority of replicas respond back,\n    then commit-ing it locally and finally relay the commit to all other replicas. Leader replicas\n    respond to requests only from clients. Non-idempotent requests from follower replicas are\n    ignored.\n  - Follower replicas: Follower replicas respond to non-idempotent requests only from leaders.\n    Non-idempotent from clients are redirected to the leader. A follower replica handles\n    non-idempotent requests by applying the changes locally in accordance with Raft.\n\n  Once the replicas are done handling the request, they send back an appropriate response to the\n  response channel, if present. (A redirect response is also encoded properly and sent back to the\n  response channel).\n\n- `membership::join(i)`: add {node #i} to local priority queue. If the required number of replicas\n  is more than the current number, pop() one member from the priority queue and add it to the Raft\n  group (making it an eligible candidate in the Raft leader election process). If the current\n  replica is a leader, we send a `partition::create(...)` request. If there is no leader among the\n  replicas, we initial the leadership election process with each replica as a candidate.\n\n- `membership::leave(j)`: remove {node #j} from priority queue and Raft group if present. If `{node\n#j}` was not present in the Raft group no further action is necessary. If it was present in the\n  Raft group, `pop()` another member from the priority queue, add it to the Raft group and proceed\n  similarly as in the case of `membership::join(j)`\n\n- `membership::update_weight(k)`: updates priority for `{node #k}` by recomputing rendezvous_hash\n  for {node #k} with this partition replicas partition_id. Next, if any node in the priority queue\n  has a higher priority than any of the nodes in the Raft group, the node with the least priority\n  is replaced by the highest priority element from the queue. We send a\n  `partition::remove(partition_id, ...)` request to `{node #k}`. Afterwards we proceed similarly\n  to `membership::{leave, join}` requests.\n\nWhen a node goes down the appropriate `membership::leave(i)` message (where `i` is the node that\nwent down) is sent to all the nodes in the cluster. The partition replica controllers in each node\nhandle the membership request accordingly. In effect:\n\n- For every leader partition in that node:\n  - if there are no other follower replicas in other nodes in it's Raft group, that partition goes\n    down.\n  - if there are other follower replicas in other nodes, there are leader elections among them and\n    after a leader is elected, reads and writes for that partition proceed normally\n- For every follower partition in that node:\n  - the remaining replicas in the same raft group continue to function in accordance with Raft's\n    mechanisms.\n\nFor each of the partition replicas on the node that went down, new host nodes are selected using\nrendezvous hash priority.\n\nIn our system, we use different Raft groups for different data buckets (replica groups).\n[CockroachDB](https://www.cockroachlabs.com/) and [Tikv](https://tikv.org) call this manner of using\ndifferent Raft groups for different data buckets on the same node as MultiRaft.\n\nRead more here:\n\n- \u003chttps://tikv.org/deep-dive/scalability/multi-raft/\u003e\n- \u003chttps://www.cockroachlabs.com/blog/scaling-raft/\u003e\n\nEvery partition controller is backed by a `segmented_log` for persisting records.\n\n### Persistence mechanism\n\n#### `segmented_log`: Persistent data structure for storing records in a partition\n\nThe segmented-log data structure for storing was originally described in the [Apache\nKafka](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf) paper.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-segmented-log.svg\" alt=\"segmented_log\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Data organisation for persisting the \u003ccode\u003esegmented_log\u003c/code\u003e data structure on a\n\u003ccode\u003e*nix\u003c/code\u003e file system.\n\u003c/p\u003e\n\nA segmented log is a collection of read segments and a single write segment. Each \"segment\" is\nbacked by a storage file on disk called \"store\".\n\nThe log is:\n\n- \"immutable\", since only \"append\", \"read\" and \"truncate\" operations are allowed. It is not possible\n  to update or delete records from the middle of the log.\n- \"segmented\", since it is composed of segments, where each segment services records from a\n  particular range of offsets.\n\nAll writes go to the write segment. A new record is written at `offset = write_segment.next_offset`\nin the write segment. When we max out the capacity of the write segment, we close the write segment\nand reopen it as a read segment. The re-opened segment is added to the list of read segments. A new\nwrite segment is then created with `base_offset` equal to the `next_offset` of the previous write\nsegment.\n\nWhen reading from a particular offset, we linearly check which segment contains the given read\nsegment. If a segment capable of servicing a read from the given offset is found, we read from that\nsegment. If no such segment is found among the read segments, we default to the write segment. The\nfollowing scenarios may occur when reading from the write segment in this case:\n\n- The write segment has synced the messages including the message at the given offset. In this case\n  the record is read successfully and returned.\n- The write segment hasn't synced the data at the given offset. In this case the read fails with a\n  segment I/O error.\n- If the offset is out of bounds of even the write segment, we return an \"out of bounds\" error.\n\n#### `laminarmq` specific enhancements to the `segmented_log` data structure\n\nWhile the conventional `segmented_log` data structure is quite performant for a `commit_log`\nimplementation, it still requires the following properties to hold true for the record being\nappended:\n\n- We have the entire record in memory\n- We know the record bytes' length and record bytes' checksum before the record is appended\n\nIt's not possible to know this information when the record bytes are read from an asynchronous\nstream of bytes. Without the enhancements, we would have to concatenate intermediate byte buffers to\na vector. This would not only incur more allocations, but also slow down our system.\n\nHence, to accommodate this use case, we introduced an intermediate indexing layer to our design.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-indexed-segmented-log-landscape.svg\" alt=\"segmented_log\" /\u003e\n\u003c/p\u003e\n\n```text\n//! Index and position invariants across segmented_log\n\n// segmented_log index invariants\nsegmented_log.lowest_index  = segmented_log.read_segments[0].lowest_index\nsegmented_log.highest_index = segmented_log.write_segment.highest_index\n\n// record position invariants in store\nrecords[i+1].position = records[i].position + records[i].record_header.length\n\n// segment index invariants in segmented_log\nsegments[i+1].base_index = segments[i].highest_index = segments[i].index[index.len-1].index + 1\n```\n\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Data organisation for persisting the \u003ccode\u003esegmented_log\u003c/code\u003e data structure on a\n\u003ccode\u003e*nix\u003c/code\u003e file system.\n\u003c/p\u003e\n\nIn the new design, instead of referring to records with a raw offset, we refer to them with indices.\nThe index in each segment translates the record indices to raw file position in the segment store\nfile.\n\nNow, the store append operation accepts an asynchronous stream of bytes instead of a contiguously\nlaid out slice of bytes. We use this operation to write the record bytes, and at the time of writing\nthe record bytes, we calculate the record bytes' length and checksum. Once we are done writing the\nrecord bytes to the store, we write it's corresponding `record_header` (containing the checksum and\nlength), position and index as an `index_record` in the segment index.\n\nThis provides two quality of life enhancements:\n\n- Allow asynchronous streaming writes, without having to concatenate intermediate byte buffers\n- Records are accessed much more easily with easy to use indices\n\nNow, to prevent a malicious user from overloading our storage capacity and memory with a maliciously\ncrafted request which infinitely loops over some data and sends it to our server, we have provided\nan optional `append_threshold` parameter to all append operations. When provided, it prevents\nstreaming append writes to write more bytes than the provided `append_threshold`.\n\nAt the segment level, this requires us to keep a segment overflow capacity. All segment append\noperations now use `segment_capacity - segment.size + segment_overflow_capacity` as the\n`append_threshold` value. A good `segment_overflow_capacity` value could be `segment_capacity / 2`.\n\n### Execution Model\n\n#### General async runtime (e.g. `tokio`, `async-std` etc.)\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-async-execution-model-general.svg\" alt=\"async-execution-model-general\" /\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e General async runtime based execution model for \u003ccode\u003elaminarmq\u003c/code\u003e\n\u003c/p\u003e\n\nThis execution model is based on the executor, reactor, waker abstractions used by all rust async\nruntimes. We don't have to specifically care about how and where a particular future is executed.\n\nThe data flow in this execution model is as follows:\n\n- A HTTP server future parses HTTP requests from the request socket\n- For every HTTP request it creates a new future to handle it\n- The HTTP handler future sends the request and a response channel tx to the request router via a channel.\n  It also awaits on the response rx end.\n- The request router future maintains a map of partition_id to designated request channel tx for each\n  partition controller future.\n- For every partition request received it forwards the request on the appropriate partition request\n  channel tx. If a `partition::create(...)` request is received it creates a new partition controller\n  future.\n- The partition controller future send back the response to the provided response channel tx.\n- The response poller future received it and responds back with a serialized response to the socket.\n\nAll futures are spawned using the async runtime's designated `{…}::spawn(…)` method. We don't have\nto specify any details as to how and where the future's corresponding task will be executed.\n\n#### Thread per core async runtime (e.g. `glommio`)\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/arindas/laminarmq/assets/assets/diagrams/laminarmq-async-execution-model-thread-per-core.svg\" alt=\"async-execution-model-thread-per-core\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Thread per core async runtime based execution model for \u003ccode\u003elaminarmq\u003c/code\u003e\n\u003c/p\u003e\n\nIn the thread per core model since each processor core is limited to a single thread, tasks in a\nthread need to be scheduled efficiently. Hence each worker thread runs their own task scheduler.\n\nWe currently use [`glommio`](https://docs.rs/glommio) as our thread-per-core runtime.\n\nHere, tasks can be scheduled on different task queues and different task queues can be provisioned\nwith specific fractions of CPU time shares. Generally tasks with similar latency profiles are\nexecuted on the same task queue. For instance web server tasks will be executed on a different queue\nthan the one that runs tasks for persisting data to the disk.\n\nWe re-use the same constructs that we use in the general async runtime execution model. The only\ndifference being, we explicitly care about in which task queue a class of future's tasks are\nexecuted. In our case, we have the following 4 task queues:\n\n- Request router task queue\n- HTTP server request parser task queue\n- Partition replica controller task queue\n- Response poller task queue\n\nEach of these task queue can be assigned specific fractions of CPU time shares. `glommio` also\nprovides utilities for automatically deducing these CPU time shares based on their runtime latency\nprofiles.\n\nApart from this `glommio` leverages the new linux 5.x [`io_uring`](https://kernel.dk/io_uring.pdf)\nAPI which facilitates true asynchronous IO for both networking and disk interfaces. (Other `async`\nruntimes such as [`tokio`](https://docs.rs/tokio) make blocking system calls for disk IO operations\nin a thread-pool.)\n\n`io_uring` also has the advantage of being able to queue together multiple system calls together and\nthen asynchronously wait for their completion by making a maximum of one context switch. It is also\npossible to avoid context switches altogether. This is achieved with a pair of ring buffers called\nthe submission-queue and the completion-queue. Once the queues are set up, user can queue multiple\nsystem calls on the submission queue. The linux kernel processes the system calls and places the\nresults in the completion queue. The user can then freely read the results from the\ncompletion-queue. This entire process after setting up the queues doesn't require any additional\ncontext switch.\n\nRead more: \u003chttps://man.archlinux.org/man/io_uring.7.en\u003e\n\n`glommio` presents additional abstractions on top of `io_uring` in the form of an async runtime,\nwith support for networking, disk IO, channels, single threaded locks and more.\n\nRead more: \u003chttps://www.datadoghq.com/blog/engineering/introducing-glommio/\u003e\n\n## Testing\n\nYou may run tests with `cargo` as you would for any other crate. However, since `laminarmq` is\npoised to support multiple runtimes, some of them might require some additional setup before running\nthe steps.\n\nFor instance, the `glommio` async runtime which requires an updated linux kernel (at least 5.8) with\n`io_uring` support. `glommio` also requires at least 512 KiB of locked memory for `io_uring` to\nwork. (Note: 512 KiB is the minimum needed to spawn a single executor. Spawning multiple executors\nmay require you to raise the limit accordingly. I recommend 8192 KiB on a 8 GiB RAM machine.)\n\nFirst, check the current `memlock` limit:\n\n```sh\nulimit -l\n\n# 512 ## sample output\n```\n\nIf the `memlock` resource limit (rlimit) is lesser than 512 KiB, you can increase it as follows:\n\n```sh\nsudo vi /etc/security/limits.conf\n*    hard    memlock        512\n*    soft    memlock        512\n```\n\nTo make the new limits effective, you need to log in to the machine again. Verify whether the limits\nhave been reflected with `ulimit` as described above.\n\n\u003e (On old WSL versions, you might need to spawn a login shell every time for the limits to be\n\u003e reflected:\n\u003e\n\u003e ```sh\n\u003e su ${USER} -l\n\u003e ```\n\u003e\n\u003e The limits persist once inside the login shell. This is not necessary on the latest WSL2 version as\n\u003e of 22.12.2022)\n\nFinally, clone the repository and run the tests:\n\n```sh\ngit clone https://github.com/arindas/laminarmq.git\ncd laminarmq/\ncargo test\n```\n\n## Benchmarking\n\nSame pre-requisites as testing. Once the pre-requisites are satisfied you may\nrun benchmarks with `cargo` as usual:\n\n```sh\ngit clone https://github.com/arindas/laminarmq.git\ncd laminarmq/\ncargo bench\n```\n\nThe complete latest benchmark reports are available at \u003chttps://arindas.github.io/laminarmq/bench/latest/report\u003e.\n\nAll benchmarks in the reports have been run on a machine (HP Pavilion x360 Convertible 14-ba0xx) with:\n\n- 4 core CPU (Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz)\n- 8GB RAM (SK Hynix HMA81GS6AFR8N-UH DDR4 2133 MT/s)\n- 128GB SSD storage (SanDisk SD8SN8U-128G-1006)\n\n### Selected Benchmark Reports\n\nThis section presents some selected benchmark reports.\n\n\u003e **Note**: We use the following names for different record sizes:\n\u003e\n\u003e \u003ctable\u003e\n\u003e    \u003ctr\u003e\n\u003e        \u003ctd\u003e\u003cb\u003esize_name\u003c/b\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003cb\u003esize\u003c/b\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003cb\u003ecomments\u003c/b\u003e\u003c/td\u003e\n\u003e    \u003c/tr\u003e\n\u003e    \u003ctr\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003etiny\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e12 bytes\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003enone\u003c/td\u003e\n\u003e    \u003c/tr\u003e\n\u003e    \u003ctr\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003etweet\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e140 bytes\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003enone\u003c/td\u003e\n\u003e    \u003c/tr\u003e\n\u003e    \u003ctr\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003ehalf_k\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e560 bytes\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e≈ 512 bytes\u003c/code\u003e\u003c/td\u003e\n\u003e    \u003c/tr\u003e\n\u003e    \u003ctr\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003ek\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e1120 bytes\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e≈ 1024 bytes (1 KiB)\u003c/code\u003e\u003c/td\u003e\n\u003e    \u003c/tr\u003e\n\u003e    \u003ctr\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003elinked_in_post\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e2940 bytes\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e≤ 3000 bytes (3 KB)\u003c/code\u003e\u003c/td\u003e\n\u003e    \u003c/tr\u003e\n\u003e    \u003ctr\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003eblog\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e11760 bytes (11.76 KB)\u003c/code\u003e\u003c/td\u003e\n\u003e        \u003ctd\u003e\u003ccode\u003e4X linked_in_post\u003c/code\u003e\u003c/td\u003e\n\u003e    \u003c/tr\u003e\n\u003e \u003c/table\u003e\n\n#### `commit_log` write benchmark with 1KB messages\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://svg-add-bg-fn.vercel.app/?svg=https://arindas.github.io/laminarmq/bench/latest/commit_log_append_with_k_message/report/lines.svg\" alt=\"k-message-write-bench\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Comparing Time taken v/s Input size in bytes (lower is better) across storage back-ends\n\u003c/p\u003e\n\nView this benchmark report in more detail [here](https://arindas.github.io/laminarmq/bench/latest/commit_log_append_with_k_message/report/index.html)\n\nThis benchmark measures the time taken to write messages of size 1KB across different `commit_log` storage back-ends.\n\nWe also profile our implementation across different storage backends. Here's a\nprofile using the\n[`DmaStorage`](https://arindas.github.io/laminarmq/docs/laminarmq/storage/impls/glommio/storage/dma/struct.DmaStorage.html)\nbackend.\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://arindas.github.io/laminarmq/bench/latest/commit_log_append_with_k_message/glommio_dma_file_segmented_log/10000/profile/flamegraph.svg\"\u003e\n\u003cimg src=\"https://arindas.github.io/laminarmq/bench/latest/commit_log_append_with_k_message/glommio_dma_file_segmented_log/10000/profile/flamegraph.svg\" alt=\"flamegraph\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Flamegraph for 10,000 writes of 1KB messages on DmaStorage backend\n\u003c/p\u003e\n\nAs you can see, a lot of time is spent simply hashing the request bytes.\n\n#### `segmented_log` streaming read benchmark with 1KB messages\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://svg-add-bg-fn.vercel.app/?svg=https://arindas.github.io/laminarmq/bench/latest/segmented_log_read_stream_with_k_message/report/lines.svg\" alt=\"k-message-read-bench\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Comparing Time taken v/s Input size in bytes (lower is better) across storage back-ends\n\u003c/p\u003e\n\nView this benchmark report in more detail [here](https://arindas.github.io/laminarmq/bench/latest/segmented_log_read_stream_with_k_message/report/index.html)\n\nThis benchmark measures the time taken for streaming reads on messages of size\n1KB across different `segmented_log` storage back-ends.\n\nWe also profile our implementation across different storage backends. Here's a\nprofile using the\n[`DmaStorage`](https://arindas.github.io/laminarmq/docs/laminarmq/storage/impls/glommio/storage/dma/struct.DmaStorage.html)\nbackend.\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://arindas.github.io/laminarmq/bench/latest/segmented_log_read_stream_with_k_message/glommio_dma_file_segmented_log/10000/profile/flamegraph.svg\"\u003e\n\u003cimg src=\"https://arindas.github.io/laminarmq/bench/latest/segmented_log_read_stream_with_k_message/glommio_dma_file_segmented_log/10000/profile/flamegraph.svg\" alt=\"flamegraph\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cb\u003eFig:\u003c/b\u003e Flamegraph for 10,000 reads of 1KB messages on DmaStorage backend\n\u003c/p\u003e\n\nIn this case, more time is spent on system calls and I/O.\n\nThe remaining benchmark reports are available at \u003chttps://arindas.github.io/laminarmq/bench/latest/report\u003e.\n\n## License\n\n`laminarmq` is licensed under the MIT License. See\n[License](https://raw.githubusercontent.com/arindas/laminarmq/main/LICENSE) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farindas%2Flaminarmq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farindas%2Flaminarmq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farindas%2Flaminarmq/lists"}