{"id":13468982,"url":"https://github.com/cswinter/LocustDB","last_synced_at":"2025-03-26T05:31:30.469Z","repository":{"id":40477426,"uuid":"132361668","full_name":"cswinter/LocustDB","owner":"cswinter","description":"Blazingly fast analytics database that will rapidly devour all of your data.","archived":false,"fork":false,"pushed_at":"2024-08-19T02:11:08.000Z","size":3676,"stargazers_count":1618,"open_issues_count":13,"forks_count":72,"subscribers_count":44,"default_branch":"master","last_synced_at":"2024-10-29T22:56:54.777Z","etag":null,"topics":["analytics","database","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cswinter.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-06T17:38:27.000Z","updated_at":"2024-10-29T13:13:33.000Z","dependencies_parsed_at":"2023-01-24T06:16:28.388Z","dependency_job_id":"89f1ad45-49ca-48d8-b3bc-dfcc9bba611b","html_url":"https://github.com/cswinter/LocustDB","commit_stats":null,"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cswinter%2FLocustDB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cswinter%2FLocustDB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cswinter%2FLocustDB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cswinter%2FLocustDB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cswinter","download_url":"https://codeload.github.com/cswinter/LocustDB/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245597408,"owners_count":20641869,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","database","rust"],"created_at":"2024-07-31T15:01:23.444Z","updated_at":"2025-03-26T05:31:28.832Z","avatar_url":"https://github.com/cswinter.png","language":"Rust","funding_links":[],"categories":["Rust","Columnar Databases","analytics","`Columnar Databases`","\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust"],"sub_categories":[],"readme":"# LocustDB\n\n[![Build Status][bi]][bl] [![Crates.io][ci]][cl] [![Gitter][gi]][gl]\n\n[bi]: https://github.com/cswinter/LocustDB/workflows/Test/badge.svg\n[bl]: https://github.com/cswinter/LocustDB/actions\n\n[ci]: https://img.shields.io/crates/v/locustdb.svg\n[cl]: https://crates.io/crates/locustdb/\n\n[gi]: https://badges.gitter.im/LocustDB/Lobby.svg\n[gl]: https://gitter.im/LocustDB/Lobby\n\nAn experimental analytics database aiming to set a new standard for query performance and storage efficiency on commodity hardware.\nSee [How to Analyze Billions of Records per Second on a Single Desktop PC][blogpost] and [How to Read 100s of Millions of Records per Second from a Single Disk][blogpost-2] for an overview of current capabilities.\n\n## Usage\n\nDownload the [latest binary release][latest-release], which can be run from the command line on most x64 Linux systems, including Windows Subsystem for Linux. For example, to load the file `test_data/nyc-taxi.csv.gz` in this repository and start the repl run:\n\n```Bash\n./locustdb --load test_data/nyc-taxi.csv.gz --trips\n```\n\nWhen loading `.csv` or `.csv.gz` files with `--load`, the first line of each file is assumed to be a header containing the names for all columns. The type of each column will be derived automatically, but this might break for columns that contain a mixture of numbers/strings/empty entries.\n\nTo persist data to disk in LocustDB's internal storage format (which allows fast queries from disk after the initial load), specify the storage location with `--db-path`\nWhen creating/opening a persistent database, LocustDB will open a lot of files and might crash if the limit on the number of open files is too low.\nOn Linux, you can check the current limit with `ulimit -n` and set a new limit with e.g. `ulimit -n 4096`.\n\nThe `--trips` flag will configure the ingestion schema for loading the 1.46 billion taxi ride dataset which can be downloaded [here][nyc-taxi-trips].\n\nFor additional usage info, invoke with `--help`:\n\n```Bash\n$ ./locustdb --help\nLocustDB 0.2.1\nClemens Winter \u003cclemenswinter1@gmail.com\u003e\nMassively parallel, high performance analytics database that will rapidly devour all of your data.\n\nUSAGE:\n    locustdb [FLAGS] [OPTIONS]\n\nFLAGS:\n    -h, --help             Prints help information\n        --mem-lz4          Keep data cached in memory lz4 encoded. Decreases memory usage and query speeds.\n        --reduced-trips    Set ingestion schema for select set of columns from nyc taxi ride dataset\n        --seq-disk-read    Improves performance on HDD, can hurt performance on SSD.\n        --trips            Set ingestion schema for nyc taxi ride dataset\n    -V, --version          Prints version information\n\nOPTIONS:\n        --db-path \u003cPATH\u003e           Path to data directory\n        --load \u003cFILES\u003e             Load .csv or .csv.gz files into the database\n        --mem-limit-tables \u003cGB\u003e    Limit for in-memory size of tables in GiB [default: 8]\n        --partition-size \u003cROWS\u003e    Number of rows per partition when loading new data [default: 65536]\n        --readahead \u003cMB\u003e           How much data to load at a time when reading from disk during queries in MiB\n                                   [default: 256]\n        --schema \u003cSCHEMA\u003e          Comma separated list specifying the types and (optionally) names of all columns in\n                                   files specified by `--load` option.\n                                   Valid types: `s`, `string`, `i`, `integer`, `ns` (nullable string), `ni` (nullable\n                                   integer)\n                                   Example schema without column names: `int,string,string,string,int`\n                                   Example schema with column names: `name:s,age:i,country:s`\n        --table \u003cNAME\u003e             Name for the table populated with --load [default: default]\n        --threads \u003cINTEGER\u003e        Number of worker threads. [default: number of cores (12)]\n```\n\n## Goals\nA vision for LocustDB.\n\n### Fast\nQuery performance for analytics workloads is best-in-class on commodity hardware, both for data cached in memory and for data read from disk.\n\n### Cost-efficient\nLocustDB automatically achieves spectacular compression ratios, has minimal indexing overhead, and requires less machines to store the same amount of data than any other system. The trade-off between performance and storage efficiency is configurable.\n\n### Low latency\nNew data is available for queries within seconds.\n\n### Scalable\nLocustDB scales seamlessly from a single machine to large clusters.\n\n### Flexible and easy to use\nLocustDB should be usable with minimal configuration or schema-setup as:\n- a highly available distributed analytics system continuously ingesting data and executing queries\n- a commandline tool/repl for loading and analysing data from CSV files\n- an embedded database/query engine included in other Rust programs via cargo\n\n\n## Non-goals\nUntil LocustDB is production ready these are distractions at best, if not wholly incompatible with the main goals.\n\n### Strong consistency and durability guarantees\n- small amounts of data may be lost during ingestion\n- when a node is unavailable, queries may return incomplete results\n- results returned by queries may not represent a consistent snapshot\n\n### High QPS\nLocustDB does not efficiently execute queries inserting or operating on small amounts of data.\n\n### Full SQL support\n- All data is append only and can only be deleted/expired in bulk.\n- LocustDB does not support queries that cannot be evaluated independently by each node (large joins, complex subqueries, precise set sizes, precise top n).\n\n### Support for cost-inefficient or specialised hardware\nLocustDB does not run on GPUs.\n\n\n## Compiling from source\n\n1. Install Rust: [rustup.rs][rustup]\n2. Clone the repository\n\n```Bash\ngit clone https://github.com/cswinter/LocustDB.git\ncd LocustDB\n```\n\n3. Compile with `--release` for optimal performance:\n\n```Bash\ncargo run --release --bin repl -- --load test_data/nyc-taxi.csv.gz --reduced-trips\n```\n\n### Running tests or benchmarks\n\n`cargo test`\n\n`cargo bench`\n\n\n[nyc-taxi-trips]: https://www.dropbox.com/sh/4xm5vf1stnf7a0h/AADRRVLsqqzUNWEPzcKnGN_Pa?dl=0\n[blogpost]: https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/\n[blogpost-2]: https://clemenswinter.com/2018/08/13/how-read-100s-of-millions-of-records-per-second-from-a-single-disk/\n[rustup]: https://rustup.rs/\n[latest-release]: https://github.com/cswinter/LocustDB/releases/download/v0.1.0-alpha/locustdb-0.1.0-alpha-x64-linux.0-alpha\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcswinter%2FLocustDB","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcswinter%2FLocustDB","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcswinter%2FLocustDB/lists"}