{"id":13632854,"url":"https://github.com/spacejam/tla-rust","last_synced_at":"2025-04-04T17:07:37.909Z","repository":{"id":46240961,"uuid":"83734111","full_name":"spacejam/tla-rust","owner":"spacejam","description":"writing correct lock-free and distributed stateful systems in Rust, assisted by TLA+","archived":false,"fork":false,"pushed_at":"2017-05-23T11:03:58.000Z","size":745,"stargazers_count":1043,"open_issues_count":1,"forks_count":26,"subscribers_count":44,"default_branch":"master","last_synced_at":"2024-10-29T17:31:57.766Z","etag":null,"topics":["distributed","lock-free","model-checking","rust","tla"],"latest_commit_sha":null,"homepage":"","language":"TLA","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spacejam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-02T23:08:14.000Z","updated_at":"2024-10-23T00:41:37.000Z","dependencies_parsed_at":"2022-08-30T18:01:24.458Z","dependency_job_id":null,"html_url":"https://github.com/spacejam/tla-rust","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spacejam%2Ftla-rust","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spacejam%2Ftla-rust/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spacejam%2Ftla-rust/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spacejam%2Ftla-rust/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spacejam","download_url":"https://codeload.github.com/spacejam/tla-rust/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246945338,"owners_count":20858931,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed","lock-free","model-checking","rust","tla"],"created_at":"2024-08-01T22:03:19.620Z","updated_at":"2025-04-04T17:07:37.887Z","avatar_url":"https://github.com/spacejam.png","language":"TLA","funding_links":[],"categories":["TLA","Tla"],"sub_categories":[],"readme":"# tla+rust → \u003cimg src=\"parrot.gif\" width=\"48\" height=\"48\" /\u003e\n\nStable stateful systems through modeling, linear types and simulation.\n\nI like to use things that wake me up at 4am as rarely as possible.\nUnfortunately, infrastructure vendors don't focus on reliability.\nEven if a company gives reliability lip service, it's unlikely that they\nuse techniques like modeling or simulation to create a rock-solid core.\nLet's just build an open-source distributed store that takes correctness\nseriously at the local storage, sharding, and distributed transactional layers.\n\nMy goal: verify core lock-free and distributed algorithms in use\nwith [rsdb](http://github.com/spacejam/rsdb) and\n[rasputin](http://github.com/disasters/rasputin) with TLA+. Write\nan implementation in Rust. Use quickcheck and abstracted RPC/clocks\nto simulate partitions and test correctness under failure conditions.\n\n##### table of contents\n1. [motivations for doing this at all](#motivations)\n  - [x] [what do the words \"simulate\" and \"model\" mean in this context?](#terminology)\n  - [x] [why use Rust?](#why-rust)\n  - [x] [why model?](#why-model)\n  - [x] [why simulate?](#why-simulate)\n2. [introductions to TLA+, PlusCal, quickcheck](#introductions)\n  - [x] [intro: specifying concurrent processes with pluscal](#here-we-go-jumping-into-pluscal)\n  - [ ] [useful primitives for modeling concurrent and distributed algorithms](#useful-primitives)\n3. lock-free algorithms for efficient local storage\n  - [ ] [lock-free ring buffer](#lock-free-ring-buffer)\n  - [ ] [lock-free list](#lock-free-liss)\n  - [ ] [lock-free stack](#lock-free-stack)\n  - [ ] [lock-free radix tree](#lock-free-radix-tree)\n  - [ ] [lock-free IO buffer](#lock-free-io-buffer)\n  - [ ] [lock-free epoch-based garbage collector](#lock-free-epoch-based-gc)\n  - [ ] [lock-free pagecache](#lock-free-pagecache)\n  - [ ] [lock-free tree](#lock-free-tree)\n4. consensus within a shard\n  - [ ] [the harpoon consensus protocol](#harpoon-consensus)\n5. sharding operations\n  - [ ] [shard splitting](#shard-splitting)\n  - [ ] [shard merging](#shard-merging)\n6. distributed transactions\n  - [ ] [a lock-free distributed transaction protocol](#cross-shard-lock-free-transactions)\n\n# motivations\n## terminology\n*Simulation*, in this context, refers to writing tests that exercise RPC-related\ncode by simulating a buggy network over time, partitions and all. Many more\nfailures may be tested per unit of compute time using simulation compared to\nblack-box fault injection with something like Namazu, Jepsen, or Blockade.\n\n*Modeling*, in this context, refers to the use of the TLA+ model checker\nto ensure the correctness of our lock-free and distributed algorithms.\n\n## why rust?\nRust is a new systems programming language that emphasizes memory safety.\nIt is notable for its compiler, which is able to make several types of\ncommon memory corruption bugs (and attack vectors for exploits) impossible\nto create by default, without relying on GC. It is a Mozilla project,\nand as of this writing, it is starting to be included in their Firefox\nweb browser.\n\nIt uses an \"ownership\" system that ensures an object's\ndestructor will run exactly once, preventing double-frees, dangling pointers,\nvarious null pointer related bugs, etc...\nWhen an object is created inside a function's scope, it exists as the property\nof that scope. The object's lifetime is the same as the lifetime of the scope\nthat created it.\n\nWhen the lifetime of an object is over, the object's destructor is run.\nWhen you pass an object to a function as an argument, that object becomes\nthe property of the called function, and when the called function\nreturns, the objects in its posession will be destroyed unless the function\nis returning them. Objects returned from a function become the property\nof the calling scope.\n\nIn order to pass an object to several functions, you may instead pass a\nreference. By passing a reference, the object remains the property of the\ncurrent scope. It is possible to create references that imply sole ownership,\ncalled mutable references, which may be used to, you guessed it, mutate\nthe object being referred to. This is useful for using an object with\na function that will mutate it, without the object becoming\nthe property of that function, and allowing the object to outlive the\nmutating function.\nWhile only a single mutable reference may be created, infinite immutable\nreferences may be created, so long as they do not outlive the object\nthat the reference points to.\n\nRust does not use GC by default. However, it does have several\ncontainer types that rely on reference counting for preventing an\nobject's destructor from being called multiple times. These are useful\nfor sharing things with multiple scopes and multiple threads. These\nobjects are generally rare compared to the total number of objects\ncreated in a typical Rust program. The lack of GC for\nevery object may be a compelling feature for those creating\nhigh-performance systems. Many such systems are currently written\nin C and C++, which have a long track record of buggy and insecure\ncode, even when written by security-conscious life-long practitioners.\n\nRust has the potential to make high-performance, widely-deployed\nsystems much more secure and crash less frequently. This means web\nbrowsers, SSL libraries, operating systems, networking stacks,\ntoasters and many vital systems that are much harder to hack and more\nrobust against common bugs.\n\nFor databases, the memory safety benefits are wonderful, and I'm betting\non being able to achieve faster long-term iteration by not spending\nso much time chasing down memory-related bugs. However, it needs to be\nnoted that when creating lock-free high-performance algorithms, we\nare going to need to sidestep the safety guarantees of the compiler.\nOur goal is to create data structures that are mutated using atomic\ncompare-and-swap (CAS) operations by multiple threads simultaneously,\nand also supporting reads at the same time. We choose not to sacrifice\nperformance by using Mutexes. This means using Rust's\n`Box::into_raw`/`from_raw`, `AtomicPtr`, unsafe pointers and `mem::forget`.\nWe are giving up a significant benefit of Rust for certain very\nhigh-performance chunks of this system. In place of Rust's compiler,\nwe use the TLA+ model checker to gain confidence in the correctness\nof our system!\n\n## why model?\nTLA+ allows us to specify and verify algorithms in very few lines, compared to\nthe programming language that we will use to implement and test it.\nIt is a tool that is frequently mentioned by engineers of stateful\ndistributed systems, but it has been used by relatively few, and has\na reputation for being overkill. I believe that this reputation is\nunfounded for this type of work.\n\nMany systems are not well understood by their creators at the start of the\nproject, which leads to architectural strain as assumptions are invalidated\nand the project continues to grow over time.\nSmall projects are often cheaper to complete using this approach,\nas an incorrect initial assumption may have a lower long-term impact.\nStateful distributed systems tend to have significant costs associated\nwith unanticipated changes in architecture: reliability, iteration time,\nand performance can be expected to take hits. For our system, we will\nspecify the core algorithms before implementing them, which will\nallow us to catch mistakes before they result in bugs or outages.\n\n## why simulate?\nWe want to make sure that our implementation is robust against\nnetwork partitions, disk failures, NTP issues, etc...\nSo, why not run Namazu, Jepsen, or Blockade? They have great success with\nfinding bugs in databases! However, it is far slower to perform black-box\nfault injection than simulation. A simulator can artificially advance\nthe clocks of a system to induce a leader election, while a \"real\" cluster\nhas to wait real time to trigger certain logic. It also takes a lot of\ntime to deploy new code to a \"real\" cluster, and it is cumbersome to\nintrospect.\n\nSimulation is not a replacement for black-box testing.\nSimulation will be biased, and it's up to the implementor\nof the simulator to ensure that all sources of time, IPC, and other\ninteraction are sufficiently encapsulated by the artificial time and\ninteraction logic.\n\nSimulation can allow a contributor working on a more resource-constrained\nsystem to test locally, running through thousands or millions of\nfailure situations in the time that it takes to create the RPM/container\nthat is then fed to a black-box fault injection system. A CI/CD pipeline\ncan get far more test coverage per unit of compute time using simulation\nthan with black-box fault injection.\n\nBoth simulation and black-box fault injection can be constrained\nto complete in a certain amount of time, but simulation will\nlikely find a lot more bugs per unit of compute time. Simulation\ntests may be a reasonable thing to expect to pass for most pull\nrequests, since they can achieve a high bug:compute time ratio.\nHowever, black box fault injection is still important, and will\nprobably catch bugs arising from the bias of the simulation authors.\n\nWe will also use black-box testing, but we will spend less time talking about\nit due to its decent existing coverage.\n\n# introductions\n\nWe want to use TLA+ to model and find bugs in things like:\n\n* CAS operations on lists, ring buffers, and radix trees for lock-free local systems\n* paxos-like consensus for leadership, replication and shard management systems\n* lock-free distributed transactions\n\nDistributed and concurrent algorithms have many similarities, but there are\nsome key differences in the primitives that we build on in our work.\nConcurrent algorithms can rely on atomic CAS primitives, as achieving sequentially\nconsistent access semantics is fairly well understood and implemented\nat this point. The distributed systems world has many databases that provide strong\nordering semantics, but it doesn't have such a reliable, standard primitive\nas CAS that we can simply assume to be present. So we need to initially work\nin terms of the \"asynchronous communication model\" in which messages between\nany two processes can be reordered and arbitrarily delayed, or dropped\naltogether. After we have proved our own model for achieving consistency,\nwe will build on it in later higher-level models that describe particularly\ninteresting functionality such as lock-free distributed transactions.\n\nIn our TLA+ models, we can simply use a fairly short labeled block that performs\nthe duties of compare and swap (or another atomic operation) on shared state\nwhen describing a concurrent algorithm, but we will need to build a complete\nreplicated log primitive before we can work at a similar level of abstraction\nin our models of distributed algorithms.\n\nSo, let's learn how to describe some of our primitives and invariants!\n\n\n## here we go... jumping into pluscal\n\nThis is a summary of an example from\n[a wonderful primer on TLA+](https://www.learntla.com/introduction/example/)...\n\nThe first thing to know is that there are two languages in play: pluscal and TLA.\nWe test models using `tlc`, which understands most of TLA (not infinite sets, maybe\nother stuff). TLA started as a specification language, tlc came along later to\nactually test it, and pluscal is a simpler language that can be transpiled into\nTLA. Pluscal has two forms, `c` and `p`. They are functionally identical, but\n`c` form uses braces and `p` form uses `begin` and `end` statements that can be a\nlittle easier to spot errors with, in my opinion.\n\nWe're writing Pluscal in a TLA comment (block comments are written\nwith `(* \u003ccomment text\u003e *)`), and when we run a translator like `pcal2tla`\nit will insert TLA after the comment, in the same file.\n\n```tla\n------------------------------- MODULE pcal_intro -------------------------------\nEXTENDS Naturals, TLC\n\n(* --algorithm transfer\nvariables alice_account = 10, bob_account = 10,\n          account_total = alice_account + bob_account\n\nprocess TransProc \\in 1..2\n  variables money \\in 1..20;\nbegin\n  Transfer:\n    if alice_account \u003e= money then\n      A: alice_account := alice_account - money;\n      B: bob_account := bob_account + money;\n    end if;\nC: assert alice_account \u003e= 0;\nend process\n\nend algorithm *)\n\n\\* this is a TLA comment. pcal2tla will insert the transpiled TLA here\n\nMoneyInvariant == alice_account + bob_account = account_total\n\n=============================================================================\n```\n\nThis code specifies 3 global variables, `alice_account`, `bob_account`, `account_total`.\nIt specifies, using `process \u003cname\u003e \\in 1..2` that it will run in two concurrent processes.\nEach concurrent process has local state, `money`, which may take any initial value from\n1 to 20, inclusive.  It defines steps `Transfer`, `A`, `B` and `C` which are evaluated as\natomic units, although they will be tested against all possible interleavings of execution.\nAll possible values will be tested.\n\nLet's save the above example as `pcal_intro.tla`, transpile the pluscal comment to TLA,\nthen run it with tlc! (if you want to name it something else, update the MODULE\nspecification at the top)\n\n```\npcal2tla pcal_intro.tla\ntlc pcal_intro.tla\n```\n\nBOOM! This blows up because our transaction code sucks, big time:\n\n```\nThe first argument of Assert evaluated to FALSE; the second argument was:\n\"Failure of assertion at line 16, column 4.\"\nError: The behavior up to this point is:\nState 1: \u003cInitial predicate\u003e\n/\\ bob_account = 10\n/\\ money = \u003c\u003c1, 10\u003e\u003e\n/\\ alice_account = 10\n/\\ pc = \u003c\u003c\"Transfer\", \"Transfer\"\u003e\u003e\n/\\ account_total = 20\n\nState 2: \u003cAction line 35, col 19 to line 40, col 42 of module pcal_intro\u003e\n/\\ bob_account = 10\n/\\ money = \u003c\u003c1, 10\u003e\u003e\n/\\ alice_account = 10\n/\\ pc = \u003c\u003c\"A\", \"Transfer\"\u003e\u003e\n/\\ account_total = 20\n\nState 3: \u003cAction line 35, col 19 to line 40, col 42 of module pcal_intro\u003e\n/\\ bob_account = 10\n/\\ money = \u003c\u003c1, 10\u003e\u003e\n/\\ alice_account = 10\n/\\ pc = \u003c\u003c\"A\", \"A\"\u003e\u003e\n/\\ account_total = 20\n\nState 4: \u003cAction line 42, col 12 to line 45, col 63 of module pcal_intro\u003e\n/\\ bob_account = 10\n/\\ money = \u003c\u003c1, 10\u003e\u003e\n/\\ alice_account = 9\n/\\ pc = \u003c\u003c\"B\", \"A\"\u003e\u003e\n/\\ account_total = 20\n\nState 5: \u003cAction line 47, col 12 to line 50, col 65 of module pcal_intro\u003e\n/\\ bob_account = 11\n/\\ money = \u003c\u003c1, 10\u003e\u003e\n/\\ alice_account = 9\n/\\ pc = \u003c\u003c\"C\", \"A\"\u003e\u003e\n/\\ account_total = 20\n\nState 6: \u003cAction line 42, col 12 to line 45, col 63 of module pcal_intro\u003e\n/\\ bob_account = 11\n/\\ money = \u003c\u003c1, 10\u003e\u003e\n/\\ alice_account = -1\n/\\ pc = \u003c\u003c\"C\", \"B\"\u003e\u003e\n/\\ account_total = 20\n\nError: The error occurred when TLC was evaluating the nested\nexpressions at the following positions:\n0. Line 52, column 15 to line 52, column 28 in pcal_intro\n1. Line 53, column 15 to line 54, column 66 in pcal_intro\n\n\n9097 states generated, 6164 distinct states found, 999 states left on queue.\nThe depth of the complete state graph search is 7.\n```\n\nLooking at the trace that tlc outputs, it shows us how alice's account may become\nnegative. Because processes 1 and 2 execute the steps sequentially but with\ndifferent interleavings, the algorithm will check `alice_account \u003e= money`\nbefore trying to transfer it to bob. By the time one process subtracts the\nmoney from alice, however, the other process may have already done so. We can\nspecify that these steps and checks happen atomically by changing:\n\n```\n  Transfer:\n    if alice_account \u003e= money then\n      A: alice_account := alice_account - money;\n      B: bob_account := bob_account + money;\n    end if;\n```\n\nto\n\n```\n  Transfer:\n    if alice_account \u003e= money then\n      \\* remove the labels A: and B:\n      alice_account := alice_account - money;\n      bob_account := bob_account + money;\n    end if;\n```\n\nwhich means that the entire `Transfer` step is atomic. In reality, maybe this is done\nby punting this atomicity requirement to a database transaction. Re-running tlc should\nproduce no errors now, because both processes atomically check + deduct + add balances\nto the bank accounts without violating the assertion.\n\nThe invariant, `MoneyInvariant`, at the bottom is not actually being checked yet.\nInvariants are specified in TLA, not in the pluscal comment. They can be checked\nby creating a `pcal_intro.cfg` file (or replace the one auto-generated by pcal2tla)\nwith the following content:\n\n```\nSPECIFICATION Spec\nINVARIANT MoneyInvariant\n```\n\n## useful primitives\nSo, we've seen how to create labels, processes, and invariants. Here are some other\nuseful primitives:\n\nawait\nbags\nEXTENDS Naturals, FiniteSets, Sequences, Integers, TLC\n\nFor a more in-depth TLA+ introduction, refer to [the tutorial that this\nwas summarized from](http://www.learntla.com) and\n[the manual](http://lamport.azurewebsites.net/tla/p-manual.pdf).\n\n# lock-free algorithms for efficient local storage\n\nIn the interests of achieving a price-performance that is compelling,\nwe need to make this thing sympathetic to modern hardware. Check out\n[Dmitry's wonderful blog](http://www.1024cores.net/home/lock-free-algorithms)\nfor a fast overview of the important ideas in writing scalable code.\n\n## lock-free ring buffer\n\nThe ring buffer is at the heart of several systems in our local storage system.\nIt serves as the core of our concurrent persistent log IO buffer and the\nepoch-based garbage collector for our logical page ID allocator.\n\n## lock-free list\nThe list allows us to CAS a partial update to a page into a chain, avoiding\nthe work of rewriting the entire page. To read a page, we traverse its list\nuntil we learn about what we sought. Eventually, we need to compact the list\nof partial updates to improve locality, probably around 4-8.\n\n## lock-free stack\nThe stack allows us to maintain a free list of page identifiers. Our radix\ntree needs to be very densely populated to achieve a favorable data to\npointer ratio, and by reusing page identifiers after they are freed, we\nare able to keep it dense. Hence this stack. When we free a page, we push\nits identifier into this stack for reuse.\n\n## lock-free radix tree\nWe use a radix tree for maintaining our in-memory mapping from logical\npage ID to its list of partial updates. A well-built radix tree can\nachieve a .92 total size:data ratio when densely populated and using a\ncontiguous key range. This is way better than what we get with B+ trees,\nwhich max out between .5-.6. The downside is that with low-density we\nget extremely poor data:pointer ratios with a radix tree.\n\n## lock-free IO buffer\nWe use a ring buffer to hold buffers for writing data onto the disk, along\nwith associated metadata about where on disk the buffer will end up.\nThis is fraught with peril. We need to avoid ABA problems in the CAS that\nclaims a particular buffer, and later relies on a particular log offset.\nWe also need to avoid creating a stall when all available buffers are\nclaimed, and a write depends on flushing the end of the buffer before the\nbeginning is free. Possible ways of avoiding: fail reservation attempts\nwhen the buffer is full of claims, support growing the buffer when necessary.\nBandaid: don't seal entire buffer during commit of reservation.\n\n## lock-free epoch-based GC\nThe basic idea for epoch-based GC is that in our lock-free structures,\nwe may end up making certain data inaccessible via a CAS on a node somewhere,\nbut that doesn't mean that there isn't already some thread that is operating\non it. We use epochs to track when a structure is marked inaccessible, as\nwell as when threads begin and end operating on shared state. Before\nreading or mutating the shared state, a thread \"enrolls\" in an epoch.\nIf the thread makes some state inaccessible, it adds it to the\ncurrent epoch's free list. The current epoch may be later than the\nepoch that the thread initially enrolled in. The state is not dropped\nuntil there are no threads in epochs before or at the epoch where\nthe state was marked free. When a thread stops reading or mutating\nthe shared state, it leaves the epoch that it enrolled in.\n\n\n## lock-free pagecache\nMaintains a radix tree mapping from logical page ID to a list of page updates,\nterminated by a base page. Uses the epoch-based GC for safely making logical ID's\navailable in a stack. Facilitates atomic splits and merges of pages.\n\n## lock-free tree\nUses the pagecache to store B+ tree pages.\n\n# consensus within a shard\nWe use a consensus protocol as the basis of our replication across a shard.\nConsensus notes:\n\n1. support OLTP with small replication batch size\n1. support batch loading and analytical jobs with large replication batch size\n1. for max throughput with a single shard, send disparate 1/N of the batch to\n   each other node, and then have them all forward their chunk to everybody else\n1. but this adds complexity, and if each node has several shards, we are already\n   spreading the IO around, so we can just pick the latency-minimizing simple\n   broadcast where the leader sends full batches to all followers.\n1. TCP is already a replicated log, hint hint\n1. UDP may be nice for receiving acks, but it doesn't work in a surprising number of DCs\n\n## harpoon consensus\nSimilar to raft, but uses leader leases instead of a paxos register. The paxos\nregister preemptable election of raft is vulnerable to livelock in the not-unusual\ncase of a network partition between a leader and another node, which triggers a\ndueling candidate situation. Using leases allows us to make progress as long as a\nnode has connectivity with a majority of its quorum, regardless of interfering nodes.\nIn addition, a node that cannot reach a leader may subscribe to the replication log\nof any other node which has seen more successful log entries.\n\n# sharding operations\nSharding has these ideals:\n\n1. avoid unnecessary data movement (wait some time before replacing a failed node)\n2. if multiple nodes fail simultaneously, minimize chances of dataloss ([chainsets](https://github.com/rescrv/HyperDex/blob/8d8ca6781cdfa6b72869c466caa32f076576c43d/coordinator/replica_sets.cc#L71))\n3. minimize MTTR when a node fails (lots of shards per machine, reduce membership overlap)\n\nideals 2 and 3 are someone at tension, but there is a goldilocks zone.\n\nSharding introduces the question of \"who manages the mapping?\"\nThis is sometimes punted to an external consensus-backed system.\nWe will initially create this by punting the metadata problem to\nsuch a system. Eventually, we will go single-binary with something\nlike the following:\n\nIf we treat shard metadata as just another range, how do we prevent\nsplit brain?\n\nGeneral initialization and key metadata:\n\n1. nodes are configured with a set of \"seed nodes\" to initially connect to\n1. cluster is initialized when some node is explicitly given permission to do so, either\n   via argv, env var, conf file or admin REST api request\n1. the designated node creates an initial range in an underreplicated state\n1. the metadata range contains a mapping from range to current assigned members\n1. as this node learns of others via the seeds, it assigns peers to the initial range\n1. if the metadata range (or any other range) loses quorum, a particular minority survivor\n   can be manually chosen as a seed for fresh replication. the admin api can also trigger\n   backup dumps for a range, and restoration of a range from a backup file.\n1. nodes each maintain their own monotonic counters, and publish a few basic stats about\n   their ranges and utilization using a shared ORSWOT\n\n## shard splitting\nSplit algorithm:\n\n1. as operations happen in a range, we keep track of the max and min keys,\n   and keep a running average for the position between max and min of inserts.\n   We then choose a split point around there. If keys are always added to one end,\n   the split should occur at the end.\n1. record split intent in watched meta range at the desired point\n1. record the split intent in the replicated log for the range\n1. all members of the replica set split their metadata when they see\n   the split intent in their replicated log\n1. The half of the split point that contains less density is migrated\n   by changing consensus participants one node at a time.\n1. once the two halves have a balanced placement, the split intent is removed\n\n## shard merging\nMerge algorithm:\n\n1. merge intent written to metadata range\n1. the smaller half is to move to the larger's servers\n1. this direction is marked at the time of intent, to prevent flapping\n1. once the ranges are colocated, in the larger range's replicated log,\n   write a merge intent, which causes it to accept keys in the new range\n1. write a merge intent into the less frequently accessed range's replicated\n   log that causes it to redirect reads and writes to the larger range.\n1. update the metadata range to reflect the merge\n1. remove handlers and metadata for old range\n\n# distributed transactions\n## cross-shard lock-free transactions\nRelatively simple lock-free distributed transactions:\n\n1. read all involved data\n1. create txn object somewhere\n1. CAS all involved data to refer to the txn object and the conditionally mutated state\n1. CAS the txn object to successful\n1. (can crash here and the txn is still valid)\n1. CAS all affected data to replace the value, and remove the txn reference\n\nreaders at any point will CAS a txn object to aborted if they encounter an in-progress\ntxn on something they are reading. if the txn object is successful, the reader needs\nto CAS the object's conditionally mutated state to be the present state, and nuke the\ntxn reference, before continuing.\n\nThis can be relaxed to just intended writers, but then our isolation level goes from SSI\nto SI and we are vulnerable to write skew.\n\nInvariants:\n\n1. must never see any intermediate states, a transaction must be entirely\n   committed or entirely invisible.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspacejam%2Ftla-rust","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspacejam%2Ftla-rust","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspacejam%2Ftla-rust/lists"}