Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cswinter/LocustDB

Blazingly fast analytics database that will rapidly devour all of your data.
https://github.com/cswinter/LocustDB

analytics database rust

Last synced: about 1 month ago
JSON representation

Blazingly fast analytics database that will rapidly devour all of your data.

Awesome Lists containing this project

README

        

# LocustDB

[![Build Status][bi]][bl] [![Crates.io][ci]][cl] [![Gitter][gi]][gl]

[bi]: https://github.com/cswinter/LocustDB/workflows/Test/badge.svg
[bl]: https://github.com/cswinter/LocustDB/actions

[ci]: https://img.shields.io/crates/v/locustdb.svg
[cl]: https://crates.io/crates/locustdb/

[gi]: https://badges.gitter.im/LocustDB/Lobby.svg
[gl]: https://gitter.im/LocustDB/Lobby

An experimental analytics database aiming to set a new standard for query performance and storage efficiency on commodity hardware.
See [How to Analyze Billions of Records per Second on a Single Desktop PC][blogpost] and [How to Read 100s of Millions of Records per Second from a Single Disk][blogpost-2] for an overview of current capabilities.

## Usage

Download the [latest binary release][latest-release], which can be run from the command line on most x64 Linux systems, including Windows Subsystem for Linux. For example, to load the file `test_data/nyc-taxi.csv.gz` in this repository and start the repl run:

```Bash
./locustdb --load test_data/nyc-taxi.csv.gz --trips
```

When loading `.csv` or `.csv.gz` files with `--load`, the first line of each file is assumed to be a header containing the names for all columns. The type of each column will be derived automatically, but this might break for columns that contain a mixture of numbers/strings/empty entries.

To persist data to disk in LocustDB's internal storage format (which allows fast queries from disk after the initial load), specify the storage location with `--db-path`
When creating/opening a persistent database, LocustDB will open a lot of files and might crash if the limit on the number of open files is too low.
On Linux, you can check the current limit with `ulimit -n` and set a new limit with e.g. `ulimit -n 4096`.

The `--trips` flag will configure the ingestion schema for loading the 1.46 billion taxi ride dataset which can be downloaded [here][nyc-taxi-trips].

For additional usage info, invoke with `--help`:

```Bash
$ ./locustdb --help
LocustDB 0.2.1
Clemens Winter
Massively parallel, high performance analytics database that will rapidly devour all of your data.

USAGE:
locustdb [FLAGS] [OPTIONS]

FLAGS:
-h, --help Prints help information
--mem-lz4 Keep data cached in memory lz4 encoded. Decreases memory usage and query speeds.
--reduced-trips Set ingestion schema for select set of columns from nyc taxi ride dataset
--seq-disk-read Improves performance on HDD, can hurt performance on SSD.
--trips Set ingestion schema for nyc taxi ride dataset
-V, --version Prints version information

OPTIONS:
--db-path Path to data directory
--load Load .csv or .csv.gz files into the database
--mem-limit-tables Limit for in-memory size of tables in GiB [default: 8]
--partition-size Number of rows per partition when loading new data [default: 65536]
--readahead How much data to load at a time when reading from disk during queries in MiB
[default: 256]
--schema Comma separated list specifying the types and (optionally) names of all columns in
files specified by `--load` option.
Valid types: `s`, `string`, `i`, `integer`, `ns` (nullable string), `ni` (nullable
integer)
Example schema without column names: `int,string,string,string,int`
Example schema with column names: `name:s,age:i,country:s`
--table Name for the table populated with --load [default: default]
--threads Number of worker threads. [default: number of cores (12)]
```

## Goals
A vision for LocustDB.

### Fast
Query performance for analytics workloads is best-in-class on commodity hardware, both for data cached in memory and for data read from disk.

### Cost-efficient
LocustDB automatically achieves spectacular compression ratios, has minimal indexing overhead, and requires less machines to store the same amount of data than any other system. The trade-off between performance and storage efficiency is configurable.

### Low latency
New data is available for queries within seconds.

### Scalable
LocustDB scales seamlessly from a single machine to large clusters.

### Flexible and easy to use
LocustDB should be usable with minimal configuration or schema-setup as:
- a highly available distributed analytics system continuously ingesting data and executing queries
- a commandline tool/repl for loading and analysing data from CSV files
- an embedded database/query engine included in other Rust programs via cargo

## Non-goals
Until LocustDB is production ready these are distractions at best, if not wholly incompatible with the main goals.

### Strong consistency and durability guarantees
- small amounts of data may be lost during ingestion
- when a node is unavailable, queries may return incomplete results
- results returned by queries may not represent a consistent snapshot

### High QPS
LocustDB does not efficiently execute queries inserting or operating on small amounts of data.

### Full SQL support
- All data is append only and can only be deleted/expired in bulk.
- LocustDB does not support queries that cannot be evaluated independently by each node (large joins, complex subqueries, precise set sizes, precise top n).

### Support for cost-inefficient or specialised hardware
LocustDB does not run on GPUs.

## Compiling from source

1. Install Rust: [rustup.rs][rustup]
2. Clone the repository

```Bash
git clone https://github.com/cswinter/LocustDB.git
cd LocustDB
```

3. Compile with `--release` for optimal performance:

```Bash
cargo run --release --bin repl -- --load test_data/nyc-taxi.csv.gz --reduced-trips
```

### Running tests or benchmarks

`cargo test`

`cargo bench`

[nyc-taxi-trips]: https://www.dropbox.com/sh/4xm5vf1stnf7a0h/AADRRVLsqqzUNWEPzcKnGN_Pa?dl=0
[blogpost]: https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/
[blogpost-2]: https://clemenswinter.com/2018/08/13/how-read-100s-of-millions-of-records-per-second-from-a-single-disk/
[rustup]: https://rustup.rs/
[latest-release]: https://github.com/cswinter/LocustDB/releases/download/v0.1.0-alpha/locustdb-0.1.0-alpha-x64-linux.0-alpha