Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/elucherini/db-toys
Toy database implementations for a deep study
https://github.com/elucherini/db-toys
Last synced: 14 days ago
JSON representation
Toy database implementations for a deep study
- Host: GitHub
- URL: https://github.com/elucherini/db-toys
- Owner: elucherini
- Created: 2024-11-22T19:33:58.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-24T03:52:12.000Z (about 2 months ago)
- Last Synced: 2024-11-24T04:26:02.844Z (about 2 months ago)
- Language: Python
- Size: 1.95 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# db-toys
Toy database implementations for a deep study of *Designing Data-Intensive Applications* by Martin Kleppman.## Implemented
### Log-structured storage engines
- Baseline in-memory, which stores a list of entries and searches linearly through them.
- Baseline with CSV-like IO, which appends entries to a file and loads the whole file into memory to search linearly through them.
- Indexed with random access-like IO, which keeps an index of offsets where each entry is written and uses it to retrieve them efficiently.## Key findings
### List of single-key dictionaries vs single dictionary with multiple keys
~~Performance evaluations highlighted the memory efficiency of the baseline in-memory log-structured engine! It turns out
that a list of single-key dictionary in Python is much more compact than a single dictionary with the same number of
element -- less than twice the size, to be precise. The constant-time lookup that a single dictionary provides really
comes at a cost, which is probably mostly impacted by the large hash table and collision-handling mechanisms.~~Turns out that `sys.getsizeof()` was the wrong tool for this, since it only calculates the pointer overhead. So yes,
there is more pointer overhead in a large dictionary as opposed to a list of single-key dictionaries with the same
contents; that is not unexpected. The total memory consumption, obtained with `pympler.asizeof()`, is about one third
that of the baseline in-memory storage engine. It makes more sense that storing key-value pairs be less efficient than
storing key-offset pairs.### Handling IO
The most realistic way to simulate a storage engine is to not close log files after each operation. This ensures the IO
operations, which introduce significant overhead, are kept to a minimum.### A new interface is needed
My first interfaces were:
- Base storage engine, with get and set methods.
- Base IO handler mixin, with read and append to log methods.Going beyond simple baseline engines, it became clear these are not sufficient.
1. I want to keep these toy DBs simple, but with a degree of realism. IO operations should mimic real implementations. At the very least, I should add open/close methods.
2. As the engines become more stateful, the problem arises where that state should be kept. This will be an evolving issue and whatever interfaces I come up with might not be fully generalizable to all DBs I will implement.### Bitcask
An easy sanity check is to run Bitcask with only one segment and no compacting: it should have comparable memory
consumption to our simple indexed log-structured engine. The write and read latency are also surprisingly similar,
despite the higher overhead for Bitcask.