{"id":17968395,"url":"https://github.com/twixes/emdrive","last_synced_at":"2025-03-25T09:30:49.383Z","repository":{"id":37177299,"uuid":"359614722","full_name":"Twixes/emdrive","owner":"Twixes","description":"💫  Fast similarity search DBMS","archived":false,"fork":false,"pushed_at":"2023-01-20T23:25:50.000Z","size":292,"stargazers_count":10,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-20T00:41:00.279Z","etag":null,"topics":["database","rust","similarity-search"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Twixes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-19T22:27:36.000Z","updated_at":"2024-09-06T11:55:32.000Z","dependencies_parsed_at":"2023-02-12T06:45:19.170Z","dependency_job_id":null,"html_url":"https://github.com/Twixes/emdrive","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Twixes%2Femdrive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Twixes%2Femdrive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Twixes%2Femdrive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Twixes%2Femdrive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Twixes","download_url":"https://codeload.github.com/Twixes/emdrive/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245435045,"owners_count":20614817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","rust","similarity-search"],"created_at":"2024-10-29T14:21:14.327Z","updated_at":"2025-03-25T09:30:49.365Z","avatar_url":"https://github.com/Twixes.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Emdrive\n\nDatabase management system for fast similarity search within metric spaces, written in Rust.\n\n### Data types\n\n| Name | Description | Size on disk | Value bounds |\n| --- | --- | --- | -- |\n| `UINT8` | unsigned 8-bit integer | 1 byte | ≥ 0 and \u003c 2⁸ |\n| `UINT16` | unsigned 16-bit integer | 2 bytes | ≥ 0 and \u003c 2¹⁶ |\n| `UINT32` | unsigned 32-bit integer | 4 bytes | ≥ 0 and \u003c 2³² |\n| `UINT64` | unsigned 64-bit integer | 8 bytes | ≥ 0 and \u003c 2⁶⁴ |\n| `UINT128` | unsigned 128-bit integer | 16 bytes | ≥ 0 and \u003c 2¹²⁸ |\n| `BOOL` | boolean value | 1 byte | either `TRUE` (non-zero) or `FALSE` (zero) |\n| `TIMESTAMP` | number of microseconds [since Unix epoch](https://en.wikipedia.org/wiki/Unix_time), saved in a signed 64-bit integer | 8 bytes | ≥ 2⁶³ µs before Unix epoch and \u003c 2⁶³ µs after Unix epoch (around 292 000 years in either direction) |\n| `UUID` | UUID-like value | 16 bytes | any sequence of 128 bits |\n| `STRING(n)` | UTF-8 string | 2+n bytes | ≤ `n` characters, where `n` ≤ 2048 |\n\nEmdrive types are **non-nullable by default**. They can made so simply by wrapping them in `NULLABLE()`. For instance, a nullable string of maximum length 20 is `NULLABLE(STRING(20))`.\n\n### Indexes\n\n| Name | Category | Description | Data types | Supported operators |\n| --- | --- | --- | --- | --- |\n| `btree` | general | [B+ tree](https://en.wikipedia.org/wiki/B+_tree) | all | `=` (equality) |\n| `emtree` | metric | [EM-tree](http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F8-12-22/paper_web.pdf) | depending on chosen metric | `@` (distance) |\n\n### Metrics\n\n| Name | Description | Column types |\n| --- | --- | --- |\n| `hamming` | [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance) | `UINT*` |\n\n### Story\n\nLet's imagine you're running an image search engine. As a fan of geese you called it Gaggle.  \nBeing a search engine operator, you run a bot which crawls pages on the internet.\nEvery time the bot sees an image, it computes a [perceptual hash](https://en.wikipedia.org/wiki/Perceptual_hashing)\nof it and saves it, along with some other metadata, to an Emdrive instance.\n\nWe'll be using database `gaggle`. A relevant table schema here may be:\n\n```SQL\nCREATE TABLE photos_seen (\n    hash UINT8 METRIC KEY USING mtree(hamming),\n    url STRING(2048) PRIMARY KEY,\n    width UINT32,\n    height UINT32,\n    seen_at TIMESTAMP\n);\n```\n\n\u003e Note that column `hash` is marked with `METRIC KEY USING hamming`!  \nWhile a primary key is B+ tree-based and allows for quick general lookups of rows, it's useless for distance queries.\nAn EM-tree-based metric key does the job very well though. In this case, as we're comparing perceptual hashes in integer form, Hamming distance\nis the most relevant metric.\n\nOh, your bot has just seen a new image! Let's register it:\n\n```SQL\nINSERT INTO photos_seen (hash, url, width, height, seen_at)\nVALUES (0b11001111, 'https://twixes.com/a.png', 1280, 820, '2077-01-01T21:37');\n```\n\nNow, look, a user just uploaded their image to see similar occurences of it from the internet. The search engine\ncalculated that image's hash to be `0b00001011` (binary representation of decimal `11`).  \nLet's check that against Emdrive. We'll be using the `@` distance operator, which always returns a number\nand is exclusively supported for `METRIC KEY` columns.\n\n```SQL\nSELECT url, hash @ 0b00001011 AS distance FROM photos_seen WHERE distance \u003c 4;\n```\n\nIt's a match! The image we saved previously has a similar hash, and we can now show it in search results.\n\n| `url`                        | `distance` |\n| ---------------------------- | ---------- |\n| `\"https://twixes.com/a.png\"` | `3`        |\n\n### Data storage\n\n```bash\n$EMDRIVE_DATA_DIRECTORY # /var/lib/emdrive/data by default\n   └── gaggle/ # database\n      └── photos_seen/ # table\n         └── 0 # core table data\n```\n\nEvery table has a `data` file containing all its, well, data. Such `data` files are made up of pages.\n\n### Launch configuration\n\nThe following launch configuration settings are available for Emdrive instances.\nThey are applied on instance launch from environment variables in the format `EMDRIVE_${SETTING_NAME_UPPERCASE}`\n(i.e. setting `data_directory` is set with variable `EMDRIVE_DATA_DIRECTORY`).\nIf a setting's environment variable is not set, its default value will be used.\n\n| Name | Type | Default value | Description |\n| --- | --- | --- | --- |\n| `data_directory` | `STRING` | `\"/var/lib/emdrive/data\"` | Location of all data, including system tables |\n| `http_listen_host` | `STRING` | `\"127.0.0.1\"` | Host on which the HTTP server will listen |\n| `http_listen_port` | `UINT16` | `8824` | Port on which the HTTP server will listen |\n\n### Search\n\n### SQL\n\n### HTTP interface\n\n## Benchmarks\n\n| Postgres | MySQL | ClickHouse | ⚡️ Emdrive |\n| --- | --- | --- | --- |\n\n### Autogenerated IDs\n\nEmdrive has no serial or auto-increment data type. For entity IDs, [ULID](https://github.com/ulid/spec) is the recommended solution in Emdrive. It's UUID-like, meaning it fits into the `UUID` data type, and can be generated with function `ULID()`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwixes%2Femdrive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftwixes%2Femdrive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwixes%2Femdrive/lists"}