{"id":16420633,"url":"https://github.com/maxpoletaev/kivi","last_synced_at":"2025-03-21T03:33:10.621Z","repository":{"id":64754462,"uuid":"527207704","full_name":"maxpoletaev/kivi","owner":"maxpoletaev","description":"Dynamo-inspired distributed leader-less key-value database that has no unique features and no apparent reason to exist ","archived":false,"fork":false,"pushed_at":"2023-11-18T20:39:46.000Z","size":1128,"stargazers_count":42,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-17T20:39:03.162Z","etag":null,"topics":["database","distributed-systems","golang","gossip","key-value","key-value-store","leaderless","lsm-tree","masterless","replication","sstables"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maxpoletaev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-21T13:00:28.000Z","updated_at":"2025-03-15T23:21:30.000Z","dependencies_parsed_at":"2023-02-10T23:31:05.966Z","dependency_job_id":"3e26505b-c5c9-42fc-a266-3d5e52c7ce04","html_url":"https://github.com/maxpoletaev/kivi","commit_stats":null,"previous_names":["maxpoletaev/kiwi"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxpoletaev%2Fkivi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxpoletaev%2Fkivi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxpoletaev%2Fkivi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxpoletaev%2Fkivi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maxpoletaev","download_url":"https://codeload.github.com/maxpoletaev/kivi/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244734070,"owners_count":20501014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","distributed-systems","golang","gossip","key-value","key-value-store","leaderless","lsm-tree","masterless","replication","sstables"],"created_at":"2024-10-11T07:28:36.746Z","updated_at":"2025-03-21T03:33:10.246Z","avatar_url":"https://github.com/maxpoletaev.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e \n   \u003cimg src=\"images/logo.png\" width=\"500\" alt=\"Kivi Logo\"\u003e\n   \u003cp\u003eDistributed key-value database for educational purposes\u003c/p\u003e\n   \u003cimg src=\"https://img.shields.io/badge/status-Alpha-yellow\" alt=\"\"\u003e\n   \u003cimg src=\"https://img.shields.io/badge/license-MIT-blue\" alt=\"\"\u003e\n\u003c/div\u003e\n\n---\n\nKivi falls into the category of Dynamo-style databases (like Cassandra and Riak), \nwhich are distributed databases that were initially designed for high availability\nand partition tolerance. The primary difference between Kivi and other databases \nin this category is its simplicity, offering an easily understood and modifiable \nimplementation of core distributed system concepts.\n\nThe main goal of this project is to provide myself and others with hands-on \nexperience in databases, distributed systems, and their underlying mechanics. \nAlthough the project is still in its early stages, it is already possible to run \na fully functional cluster of nodes and perform basic operations such as gets, \nputs, and deletes.\n\n## Key Properties\n\n * **Handcrafted**: The core functions should not rely on any external libraries.\n * **Leaderless**: The system should not have a single point of failure.\n * **Highly Available**: The system should be available even if some nodes fail.\n * **Replicated**: The system should replicate data across multiple nodes.\n * **Eventually Consistent**: The system should eventually converge to a consistent state.\n * **Partition Tolerant**: The system should be able to tolerate network partitions.\n * **Configurable**: The system should allow to configure the consistency level for reads and writes.\n * **Conflict Resilient**: The system should be able to resolve conflicts in case of concurrent updates.\n * **Simple API**: The system should provide a simple API for storing and retrieving key-value pairs.\n * **Simple Deployment**: The system should be easy to configure and deploy.\n * **Simple Implementation**: The system should be easy to understand and modify.\n * **High Performance**: The system should be able to handle a large number of requests per second.\n\n## The Building Blocks\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/blocks.png\" alt=\"Building Blocks Diagram\"\u003e\n\u003c/p\u003e\n\n### Membership and Failure Detection\n\nThe membership layer is responsible for maintaining the list of nodes in the\ncluster and detecting failures. It uses a SWIM-like gossip protocol to exchange\ninformation about cluster members and their status.\n\nOn each algorithm iteration, a node randomly selects other node from the cluster\nand sends it an empty **ping** message. The receiving node responds with a \n**ping-ack** message containing a hash of its membership list. The sending node \nthen compares the hash with its own membership list and initiates \na **state-exchange** operation in case of a mismatch.\n\nDuring the **state-exchange**, the sending node sends its membership list to the\nreceiving node, which then merges it with its own list and sends back a new list\nof members. The sending node then merges the received list with its own and\nupdates its membership list accordingly.\n\nIn case the **ping** request fails, the sending node asks two other random nodes\nfrom the cluster to ping the failed node on its behalf. If the node is still\nunresponsive, the sending node marks the receiving node as unhealthy. This\nstate will be propagated to other nodes in the cluster during the next algorithm\niteration.\n\n### Persistent Storage\n\nThe storage layer is responsible for storing the key-value pairs on disk. The \nstorage is based on log-structured merge trees (**LSM-Tree**), which are a type \nof external memory data structure allowing for efficient writes and providing \ndecent read performance. LSM-tree is composed of two main components: the in-memory \nsorted map (**memtable**) and the on-disk sorted string tables (**SSTables**).\n\nThe memtable holds the most recent updates to the database. Each write to a \nmemtable is added to a **write-ahead log** which is used to restore the contents \nof the memtable in case of a crash or restart. Once the size of the memtable \nexceeds a certain threshold, it is flushed to disk as an immutable **SSTable**\naccompanied by a **sparse index** and a **bloom filter** files. SStables are \nstored in level-based structure, where each level contains a set of SSTables.\n\n```\n$ tree data\n├── STATE\n├── mem-1679761553589453.wal\n├── sst-1679520869189745.L0.bloom\n├── sst-1679520869189745.L0.data\n├── sst-1679520869189745.L0.index\n├── sst-1679594643164184.L0.bloom\n├── sst-1679594643164184.L0.data\n├── sst-1679594643164184.L0.index\n├── sst-1679595355518826.L1.bloom\n├── sst-1679595355518826.L1.data\n└── sst-1679595355518826.L1.index\n```\n\nA background **compaction process** periodically merges the SSTables in each \nlevel into larger SSTables and moves them to the next level, removing the values\nthat were overwritten by newer updates. Each change to the tree state is recorded \nin the `STATE` file, which is used to restore the last known state of the tree\nand to identify the files that were merged and thus can be safely deleted during \nthe **garbage collection process**.\n\nThe reads are performed by first checking the memtable, and then the SSTables\nfrom newest to oldest. The SSTables are searched using the **bloom filter** to\nquickly skip the files that do not contain the requested key. The **sparse\nindex** is then used to find the approximate location of the key in the data file,\nand from there the data file is scanned linearly to find the exact location of\nthe key.\n\n### Replication and Consistency\n\nThe replication layer is responsible for coordinating reads and writes to\nmultiple nodes. It uses a quorum-based approach to ensure that the desired\nconsistency level is achieved. The replication layer is also responsible for\ndetecting and resolving conflicts in case of concurrent updates of the same key\nfrom different clients on different nodes.\n\nOnce a read or write request is received, the replication layer mirrors it to all\navailable nodes in the cluster. The request is considered successful if the desired\nnumber of nodes acknowledge it. The number of nodes depends on the configured \nconsistency level. For example, if the write consistency level is set to `Quorum`, \nthe majority of nodes (2/3 or 3/5) must confirm that the write operation was \nsuccessful.\n\nSince the writes can be performed on any node in the cluster, it is possible that \nthe same key may be updated on multiple nodes. A conflict occurs when two or more\nnodes have different values for the same key. The conflict resolution strategy \nrelies on **version vectors** to determine the causal order of updates. In case \nthere is a clear dependency that one update happened before the other, the update \nwith the lower version is discarded. In case there is no clear relationship between \nthe updates, the server returns a list of all conflicting values and leaves it up\nto the client. The client can then choose to perform conflict resolution using\na different content-aware strategy, such as last-write-wins or use conflict-free \ndata structures (CRDT).\n\n### Conflict-free Data Types\n\nKivi supports a few basic conflict-free data types for which the conflict resolution\nis not required, allowing to use kivi as a more traditional key-value store. The\nfollowing data types are currently supported:\n\n * **LWW-Register**: A last-write-wins register with timestamp-based conflict resolution\n * **Set**: A set of strings that supports adding and removing elements without conflicts\n\nNote that these data types have some overhead associated with storing additional\nmetadata required for conflict resolution. For example, the LWW-Register stores \nthe timestamp of the last update, and the Set needs to keep track of tombstones \nfor deleted elements.\n\n## Running a Local Cluster\n\nThe `docker-compose.yaml` contains a minimal configuration of a cluster of\nfive replicas. To run it, use:\n\n 1. `make image`\n 2. `docker compose up`\n\nWith the default consistency level, you need the majority of nodes (3 out of 5)\nto be available to perform reads and writes. A failure can be simulated by\nkilling one or two of the containers with `docker kill`.\n\n## References\n\nThe following resources, papers and books were the main source of inspiration for\nthis project:\n\n * [SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf)\n * [Dynamo: Amazon's Highly Available Key-value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)\n * [Introduction to Reliable and Secure Distributed Programming](https://www.distributedprogramming.net)\n * [Distributed Computing: Principles, Algorithms, and Systems](https://www.cs.uic.edu/~ajayk/DCS-Book)\n * [The Art of Multiprocessor Programming](https://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy-ebook/dp/B08HQ7XNLD)\n * [Designing Data-Intensive Applications](https://dataintensive.net/)\n * [MIT 6.824: Distributed Systems](https://pdos.csail.mit.edu/6.824/)\n * [Project Voldemort](https://www.project-voldemort.com/voldemort/)\n ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxpoletaev%2Fkivi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxpoletaev%2Fkivi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxpoletaev%2Fkivi/lists"}