{"id":13569344,"url":"https://github.com/tomfran/LSM-Tree","last_synced_at":"2025-04-04T05:32:08.865Z","repository":{"id":167691270,"uuid":"641105582","full_name":"tomfran/LSM-Tree","owner":"tomfran","description":"Log-Structured Merge Tree Java implementation","archived":false,"fork":false,"pushed_at":"2024-05-11T17:48:44.000Z","size":1070,"stargazers_count":71,"open_issues_count":0,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-08-02T14:06:15.630Z","etag":null,"topics":["benchmarking","bloom-filter","database","java","jmh-benchmarks","lsm-tree","skiplist","sstable"],"latest_commit_sha":null,"homepage":"https://medium.com/@tomfran/log-structured-merge-tree-a79241c959e3","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomfran.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-15T19:42:14.000Z","updated_at":"2024-07-25T14:40:23.000Z","dependencies_parsed_at":"2023-10-21T18:29:55.808Z","dependency_job_id":"99f50bab-0f2d-4bfe-b869-b26ee56fdbf6","html_url":"https://github.com/tomfran/LSM-Tree","commit_stats":{"total_commits":49,"total_committers":1,"mean_commits":49.0,"dds":0.0,"last_synced_commit":"fd0d85b4bf7ea727b88e6add4038436b52db4be4"},"previous_names":["tomfran/lsm-tree"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomfran%2FLSM-Tree","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomfran%2FLSM-Tree/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomfran%2FLSM-Tree/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomfran%2FLSM-Tree/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomfran","download_url":"https://codeload.github.com/tomfran/LSM-Tree/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223100121,"owners_count":17087387,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","bloom-filter","database","java","jmh-benchmarks","lsm-tree","skiplist","sstable"],"created_at":"2024-08-01T14:00:38.831Z","updated_at":"2024-11-05T01:32:10.657Z","avatar_url":"https://github.com/tomfran.png","language":"Java","readme":"# LSM tree\n\nAn implementation of the Log-Structured Merge Tree (LSM tree) data structure in Java.\n\nHere you can find a [Medium article](https://medium.com/@tomfran/log-structured-merge-tree-a79241c959e3) about this project.\n\n**Table of Contents**\n\n1. [Architecture](#architecture)\n    1. [SSTable](#sstable)\n    2. [Skip-List](#skip-list)\n    3. [Tree](#tree)\n4. [Benchmarks](#benchmarks)\n    1. [SSTable](#sstable-1)\n    2. [Skip-List](#skip-list-1)\n    3. [Tree](#tree-1)\n5. [Possible future improvements](#possible-improvements)\n6. [References](#references)\n\n### Console\n\nTo interact with a toy tree you can use `./gradlew run -q` to spawn a console.\n\n```\n\n  |      __|   \\  |           __ __|              \n  |    \\__ \\  |\\/ |   ____|      |   _| -_)   -_) \n ____| ____/ _|  _|             _| _| \\___| \\___| \n\nCommands:\n  - s/set  \u003ckey\u003e \u003cvalue\u003e : insert a key-value pair;\n  - r/rgn  \u003cstart\u003e \u003cend\u003e : insert this range of numeric keys with random values;\n  - g/get  \u003ckey\u003e         : get a key value;\n  - d/del  \u003ckey\u003e         : delete a key;\n  - p/prt                : print current tree status;\n  - e/exit               : stop the console;\n  - h/help               : show this message.\n\n\u003e \n```\n\n# Architecture\n\nArchitecture overview, from SSTables, which are the disk-resident portion of the database, Skip Lists, used\nas memory buffers, and finally to the combination of the twos to create insertion, lookup and deletion primitives.\n\n## SSTable\n\nSorted String Table (SSTable) is a collection of files modelling key-value pairs in sorted order by key.\nIt is used as a persistent storage for the LSM tree.\n\n**Components**\n\n- _Data_: key-value pairs in sorted order by key, stored in a file;\n- _Sparse index_: sparse index containing key and offset of the corresponding key-value pair in the data;\n- _Bloom filter_: a [probabilistic data structure](https://en.wikipedia.org/wiki/Bloom_filter) used to test whether a\n  key is in the SSTable.\n\n**Key lookup**\n\nThe basic idea is to use the sparse index to find the key-value pair in the data file.\nThe steps are:\n\n1. Use the Bloom filter to test whether the key might be in the table;\n2. If the key might be present, use binary search on the index to find the maximum lower bound of the key;\n3. Scan the data from the position found in the previous step to find the key-value pair. The search\n   can stop when we are seeing a key greater than the one we are looking for, or when we reach the end of the table.\n\nThe search is as lazy as possible, meaning that we read the minimum amount of data from disk,\nfor instance, if the next key length is smaller than the one we are looking for, we can skip the whole key-value pair.\n\n**Persistence**\n\nA table is persisted to disk when it is created. A base filename is defined, and three files are present:\n\n- `\u003cbase_filename\u003e.data`: data file;\n- `\u003cbase_filename\u003e.index`: index file;\n- `\u003cbase_filename\u003e.bloom`: bloom filter file.\n\nData format:\n\n- `n`: number of key-value pairs;\n- `\u003ckey_len_1, value_len_1, key_1, value_1, ... key_n, value_n\u003e`: key-value pairs.\n\nIndex format:\n\n- `s`: number of entries in the whole table;\n- `n`: number of entries in the index;\n- `o_1, o_2 - o_1, ..., o_n - o_n-1`: offsets of the key-value pairs in the data file, skipping\n  the first one;\n- `s_1, s_2, ..., s_n`: remaining keys after a sparse index entry, used to exit from search;\n- `\u003ckey_len_1, key_1, ... key_len_n, key_n\u003e`: keys in the index.\n\nFilter format:\n\n- `m`: number of bits in the bloom filter;\n- `k`: number of hash functions;\n- `n`: size of underlying long array;\n- `b_1, b_2, ..., b_n`: bits of the bloom filter.\n\nTo save space, all integers are stored\nin [variable-length encoding](https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html),\nand offsets in the index are stored as [deltas](https://en.wikipedia.org/wiki/Delta_encoding).\n\n## Skip-List\n\nA [skip-list](https://en.wikipedia.org/wiki/Skip_list) is a probabilistic data structure that allows fast search,\ninsertion and deletion of elements in a sorted sequence.\n\nIn the LSM tree, it is used as an in-memory data structure to store key-value pairs in sorted order by key.\nOnce the skip-list reaches a certain size, it is flushed to disk as an SSTable.\n\n**Operations details**\n\nThe idea of a skip list is similar to a classic linked list. We have nodes with forward pointers, but also levels. We\ncan think about a\nlevel as a fast lane between nodes. By carefully constructing them at insertion time, searches are faster, as they can\nuse higher levels to skip unwanted nodes.\n\nGiven `n` elements, a skip list has `log(n)` levels, the first level containing all the elements.\nBy increasing the level, the number of elements is cut roughly by half.\n\nTo locate an element, we start from the top level and move forward until we find an element greater than the one\nwe are looking for. Then we move down to the next level and repeat the process until we find the element.\n\nInsertions, deletions, and updates are done by first locating the element, then performing\nthe operation on the node. All of them have an average time complexity of `O(log(n))`.\n\n## Tree\n\nHaving defined SSTables and Skip Lists we can obtain the final structure as a combination of the two.\nThe main idea is to use the latter as an in-memory buffer, while the former efficiently stores flushed\nbuffers.\n\n**Insertion**\n\nEach insert goes directly to a Memtable, which is a Skip List under the hood, so the response time is quite fast.\nThere exists a threshold, over which the mutable structure is made immutable by appending it to the _immmutable\nmemtables LIFO list_ and replaced with a new mutable list.\n\nThe immutable memtable list is asynchronously consumed by a background thread, which takes the next available\nlist and create a disk-resident SSTable with its content.\n\n**Lookup**\n\nWhile looking for a key, we proceed as follows:\n\n1. Look into the in-memory buffer, if the key is recently written it is likely here, if not present continue;\n2. Look into the immutable memtables list, iterating from the most recent to the oldest, if not present continue;\n3. Look into disk tables, iterating from the most recent one to the oldest, if not present return null.\n\n**Deletions**\n\nTo delete a key, we do not need to delete all its replicas, from the on-disk tables, we just need a special\nvalue called _tombstone_. Hence a deletion is the same as an insertion, but with a value set to null. While looking for\na key, if we encounter a null value we simply return null as a result.\n\n**SSTable Compaction**\n\nThe most expensive operation while looking for a key is certainly the disk search, and this is why bloom filters are\ncrucial for negative\nlookup on SSTables. But no bloom filter can save us if too many tables are available to search, hence we need\n_compaction_.\n\nWhen flushing a Memtable, we create an SSTable of level zero.\nWhen the first level reaches a certain threshold, all its tables are merged with \nthe subsequent level in a sorted run.\n\nA sorted run is a procedure in which we merge SSTables into multiple tables. The result \nis a sequence of SSTs that are non-intersecting, more details can be found in the Medium article.\n\nThis check is made periodically on all levels to ensure a level does not grow too much.\nLevels and SST sizes increases by a factor of 1.75 on each step.\n\n# Benchmarks\n\nI am using [JMH](https://openjdk.java.net/projects/code-tools/jmh/) to run benchmarks,\nthe results are obtained on a base model M3 pro Macbook Pro.\n\nTo run them use `./gradlew jmh`.\n\n**SSTable**\n\n- Negative access: the key is not present in the table, hence the Bloom filter will likely stop the search;\n- Random access: the key is present in the table, the order of the keys is random.\n\n```\n\nBenchmark                                       Mode  Cnt        Score        Error  Units\nc.t.l.sstable.SSTableBenchmark.negativeAccess  thrpt    5  3316202.976 ±  32851.546  ops/s\nc.t.l.sstable.SSTableBenchmark.randomAccess    thrpt    5     7989.945 ±     40.689  ops/s\n\n```\n\n**Bloom filter**\n\n- Add: add keys to a 1M keys Bloom filter with 0.01 false positive rate;\n- Contains: test whether the keys are present in the Bloom filter.\n\n```\nBenchmark                                       Mode  Cnt        Score        Error  Units\nc.t.l.bloom.BloomFilterBenchmark.add        thrpt    5  10870782.166 ± 151949.254  ops/s\nc.t.l.bloom.BloomFilterBenchmark.contains   thrpt    5  11061776.096 ±  16752.915  ops/s\n```\n\n**Skip-List**\n\n- Get: get keys from a 100k keys skip-list;\n- Add/Remove: add and remove keys from a 100k keys skip-list.\n\n```\nBenchmark                                       Mode  Cnt        Score        Error  Units\nc.t.l.memtable.SkipListBenchmark.addRemove  thrpt    5   1066479.961 ±  70216.252  ops/s\nc.t.l.memtable.SkipListBenchmark.get        thrpt    5   1280680.984 ±  42235.970  ops/s\n```\n\n**Tree**\n\n- Get: get elements from a tree with 1M keys;\n- Add: add 1M distinct elements to a tree with a memtable size of 2^18\n\n```\nBenchmark                                       Mode  Cnt        Score        Error  Units\nc.t.l.tree.LSMTreeAddBenchmark.add          thrpt    5    722278.306 ±  30802.444  ops/s\nc.t.l.tree.LSMTreeGetBenchmark.get          thrpt    5     20098.919 ±    240.244  ops/s\n```\n\n## Possible improvements\n\nThere is certainly space for improvement on this project:\n\n- [ ] Blocked bloom filters: its a variant of a classic array-like bloom filter which is more cache efficient;\n- [ ] Search fingers in the Skip list: the idea is to keep a pointer to the last search, and start from there with\n   subsequent queries;\n- [x] Proper level compaction in the LSM tree;\n- [ ] Write ahead log for the insertions, without this, a crash makes all the in-memory writes disappear;\n- [ ] Proper recovery: handle crashes and reboots, using existing SSTables and the write-ahead log.\n\nI don't have the practical time to do all of this, perhaps the first two points will be handled in the future.\n\n## References\n\n- [Database Internals](https://www.databass.dev/) by Alex Petrov, specifically chapters about Log-Structured Storage and\n  File Formats;\n- [A Skip List Cookbook](https://api.drum.lib.umd.edu/server/api/core/bitstreams/17176ef8-8330-4a6c-8b75-4cd18c570bec/content)\n  by William Pugh.\n\n_If you found this useful or interesting do not hesitate to ask clarifying questions or get in touch!_\n","funding_links":[],"categories":["Java"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomfran%2FLSM-Tree","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomfran%2FLSM-Tree","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomfran%2FLSM-Tree/lists"}