An open API service indexing awesome lists of open source software.

https://github.com/themankindproject/keplordb

A columnar log engine optimized for high-throughput ingestion of structured, append-only, time-ordered events.
https://github.com/themankindproject/keplordb

append-only avx2 columnar-storage database db embeddable iot log-engine logging mmap rust simd telemetry time-series wal

Last synced: 29 days ago
JSON representation

A columnar log engine optimized for high-throughput ingestion of structured, append-only, time-ordered events.

Awesome Lists containing this project

README

          

# KeplorDB

[![crates.io](https://img.shields.io/crates/v/keplordb.svg)](https://crates.io/crates/keplordb)
[![docs.rs](https://docs.rs/keplordb/badge.svg)](https://docs.rs/keplordb)
[![CI](https://github.com/themankindproject/keplordb/actions/workflows/ci.yml/badge.svg)](https://github.com/themankindproject/keplordb/actions)
[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](./LICENSE)

A columnar append-only log engine written in Rust — purpose-built for high-throughput structured event ingestion. LLM observability, HTTP access logs, IoT telemetry, payment ledgers — any workload that's append-only and time-ordered.

## Why

Existing options are either too heavy (ClickHouse, TimescaleDB) or too general (SQLite, RocksDB). **KeplorDB is an embeddable library** with no server process, no SQL parser, and no background threads. Declare your schema with a derive macro, open the engine, append events, query columns.

## Install

```toml
[dependencies]
keplordb = "0.1"
```

Or via git (pre-crates.io):

```toml
[dependencies]
keplordb = { git = "https://github.com/themankindproject/keplordb" }
```

Requires Rust **1.82** or newer. The workspace ships two crates — `keplordb` (engine) and `keplordb-macros` (the `#[derive(Schema)]` proc-macro, re-exported from the main crate). One dep gets both.

## Quick Start — typed schema derive

```rust
use keplordb::{Engine, Schema};

#[derive(Schema)]
#[keplordb(id = 1)]
pub struct AiCall {
#[dim(bloom, rollup)] pub user: String,
#[dim(rollup)] pub org: String,
#[dim] pub model: String,
#[counter] pub input_tokens: u32,
#[counter] pub output_tokens: u32,
#[label] pub region: String,
}

fn main() -> Result<(), keplordb::DbError> {
// `default_config` prefills bloom_dim + rollup_dims from the schema.
let engine: Engine<3, 2, 1> =
Engine::open(AiCall::default_config("/tmp/logs".into()))?;

// Fluent typed builder — no positional indexing.
engine.append(
&AiCall::new(ts_ns())
.user("alice")
.org("acme")
.model("gpt-4o")
.input_tokens(1_200)
.output_tokens(850)
.region("us-east")
.metric(5_000_000) // cost in nanodollars
.status(200)
.into_log_event(),
)?;

// Named filter — maps each setter to the right dim internally.
let alice = engine.aggregate(
&AiCall::filter().user("alice").into_filter(),
)?;
println!("alice events: {}, cost sum: {}", alice.event_count, alice.metric);

engine.flush()?;
Ok(())
}

fn ts_ns() -> i64 {
std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_nanos() as i64
}
```

The raw positional `LogEvent` + `QueryFilter` API is still available for dynamic / runtime-decided schemas — see [`examples/raw_positional.rs`](crates/keplordb/examples/raw_positional.rs).

## Features

### Ergonomics

- **Typed schema derive** — `#[derive(keplordb::Schema)]` generates typed event + filter builders. No positional `dims[0]` indexing anywhere in user code
- **Compile-time validation** — wrong counter type, two bloom dims, missing `schema_id`, or caps exceeded all rejected by the macro with spans pointing at your source
- **Schema ID guard** — segment headers record the active schema; opening a data directory with a mismatched `schema_id` fails fast with `DbError::Corrupt`

### Storage

- **Columnar segments** — queries touch only the columns they need; aggregations scan contiguous arrays via mmap
- **Zero-copy reads** — u32/u16 columns read directly from mmap'd segment files via `zerocopy`; no deserialization
- **Pre-decompressed i64 cache** — `ts_ns` + `metric` delta-decoded once at segment open and reused across readers
- **Intern table cache** — string → u16 resolve table decompressed once per segment via `OnceLock`; 170× faster filtered aggregate after warmup

### Indexing

- **Bloom filter skip** — per-segment bloom on the primary dimension; skip entire files on mismatch
- **Zone maps per-dimension** — min/max per 256-row chunk, built with AVX2 SIMD min/max during rotation; chunks that can't match a filter are skipped before any column access
- **Status bitmap index** — per-value compressed bitmaps for O(1) status lookups; full-scan fallback

### Throughput

- **AVX2 SIMD aggregation** — vectorized sum, count, filtered-sum using 256-bit registers with hardware prefetching and scalar fallback
- **Sharded WAL, batch-routed** — each `append_batch` claims one shard via `fetch_add`; N concurrent writers → N shards, zero contention in the common case
- **Rayon parallel scan** — cross-segment aggregates fan out across cores
- **`query_recent` global merge** — candidates sorted by `max_ts` descending, per-segment results pooled + merged by `ts_ns` descending, with early termination once the kth-best ts exceeds any remaining segment's `max_ts`

### Durability

- **CRC32-framed WAL** — every record checksummed; partial frames detected and recovered up to the last complete frame on replay
- **Crash-safe rotation** — three-phase `rename → write segment → unlink`; orphaned `*.wal.rotating` files replayed on next open
- **Tunable fsync** — batched per `wal_sync_interval` (default 64 events) or `wal_sync_bytes` (default 256 KB). Set to `1` each for zero-loss, or `u32::MAX` for best-effort

### Operations

- **Segment-level GC** — `engine.gc(cutoff)` drops segments whose `max_ts` is below the threshold. No compaction, no write amplification, no read pause
- **Lock-free reads** — segment manifest + tombstones behind `ArcSwap`; one atomic load per query
- **Embeddable** — `Engine::open()` in your Rust binary. No TCP, no SQL, no external service

## Performance

Measured with Criterion over **1 million events in 10 segments** on an Intel i5-1135G7 (4c/8t). Appends are WAL-durable; reads run against real on-disk column data via mmap + AVX2 SIMD.

### Write path

| Operation | Latency | Throughput |
|---|---|---|
| `append_batch` · 4096 events | **4.76 ms** | 860K ev/s |
| `append_batch` · 1024 events | 873 µs | 1.17M ev/s |
| concurrent · 8t × 1024 events | **1.24 ms** | **6.6M ev/s** |
| WAL memory-only | 352 µs | 2.9M ev/s |
| rotation · 1 shard · 1024 | 10.1 ms | compress+fsync |

### Read path

| Operation | Latency | Throughput |
|---|---|---|
| `aggregate` · no filter | **756 µs** | 1.3G ev/s |
| `aggregate` · user filter | **254 µs** | 3.9G ev/s |
| `aggregate` · time range | 274 µs | 3.7G ev/s |
| `aggregate` · user + time | **105 µs** | **9.5G ev/s** |
| `query_recent` · 100 | **37 µs** | — |
| `query_recent` · 1000 | 456 µs | — |
| `query_recent` · user · 100 | 70 µs | — |

### Rollup (in-memory)

| Operation | Latency |
|---|---|
| `query_rollups` · single user · day | 7.5 µs |
| `query_rollups` · all buckets · day | 24 µs |

Run `cargo bench --workspace` to reproduce on your hardware.

## Use cases

The typed schema derive maps to any append-only, time-ordered workload. Example schemas per domain:

### LLM observability

```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 1)]
pub struct LlmCall {
#[dim(bloom, rollup)] pub user: String,
#[dim(rollup)] pub api_key: String,
#[dim] pub model: String,
#[dim] pub provider: String,
#[counter] pub input_tokens: u32,
#[counter] pub output_tokens: u32,
#[counter] pub cache_tokens: u32,
#[label] pub region: String,
}
```

### HTTP access logs

```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 2)]
pub struct HttpLog {
#[dim(bloom)] pub client_ip: String,
#[dim] pub method: String,
#[dim] pub route: String,
#[counter] pub bytes: u32,
}
```

### IoT telemetry

```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 3)]
pub struct Reading {
#[dim(bloom)] pub device_id: String,
#[dim] pub sensor: String,
}
// metric = sensor reading (fixed-point)
```

### Payment ledger

```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 4)]
pub struct Txn {
#[dim(bloom, rollup)] pub merchant: String,
#[dim(rollup)] pub customer: String,
#[dim] pub currency: String,
#[dim] pub method: String,
#[counter] pub items: u32,
#[counter] pub tax: u32,
}
// metric = amount in cents
```

See the full set under [`crates/keplordb/examples/`](crates/keplordb/examples/):

- [`typed_schema.rs`](crates/keplordb/examples/typed_schema.rs) — end-to-end typed API
- [`raw_positional.rs`](crates/keplordb/examples/raw_positional.rs) — dynamic schemas via the positional API
- [`concurrent_writes.rs`](crates/keplordb/examples/concurrent_writes.rs) — 8-thread producers with `Arc`
- [`pagination.rs`](crates/keplordb/examples/pagination.rs) — cursor-based pagination through `query_recent`
- [`crash_recovery.rs`](crates/keplordb/examples/crash_recovery.rs) — WAL replay on restart

## Architecture

```
Write path
──────────
append_batch() ──► fetch_add → shard N ──► WAL (CRC32 + fsync/interval)


rotate (3-phase, crash-safe)


Segment .kseg

Read path
─────────
query() ──► ArcSwap index load


candidate segments sorted by max_ts desc


per-segment scan (AVX2 SIMD + zone-map prune)


pool → sort ts_ns desc → truncate(limit)


intern resolve via OnceLock cache
```

### Segment file layout (`.kseg`)

```
┌──────────────────────────────┐
│ header 256 B │ schema_id verified on open
├──────────────────────────────┤
│ i64 block zstd+δ │ delta-encoded ts_ns + metric
│ u32 cols latency + counters
│ u16 cols status + flags + dims + id + labels
├──────────────────────────────┤
│ bloom filter 128 B │ primary dim skip
│ status bitmap zstd │ per-value compressed bitmaps
│ zone maps raw │ min/max per 256-row chunk × D
│ intern table zstd │ cached via OnceLock after first access
└──────────────────────────────┘
```

### `LogEvent` schema

`D` dims, `C` counters, `L` labels — chosen for you by `#[derive(Schema)]` based on field tags.

| Field | Type | Description |
|---|---|---|
| `id` | `String` | Unique event identifier. Interned per segment for fast point lookup |
| `ts_ns` | `i64` | Nanosecond timestamp. Sorted, binary-searchable |
| `metric` | `i64` | Primary signed metric — cost, duration, value |
| `counters[0..C]` | `u32` | Unsigned counters |
| `latency_ms` | `u32` | Primary latency |
| `latency_detail_ms` | `u32` | Secondary latency breakdown |
| `status` | `u16` | Status code. Bitmap-indexed |
| `flags` | `EventFlags` | 16 boolean bitflags (newtype) |
| `dims[0..D]` | `String` | Indexed, filterable dimensions. Interned per segment. Zone-mapped |
| `labels[0..L]` | `String` | Free-form string labels. Stored, not indexed |

Caps: `D ≤ 256`, `C ≤ 64`, `L ≤ 64` — enforced at compile time by the derive.

## API reference

```rust
// Lifecycle
let engine = Engine::open(config)?;
engine.flush()?;

// Write
engine.append(&event)?;
engine.append_batch(&events)?;

// Read
let events = engine.query_recent(&filter, limit)?;
let totals = engine.aggregate(&filter)?;
let rollups = engine.query_rollups(from_day, to_day, &dim_filters)?;
let event = engine.get_event("event-id")?;

// Delete (tombstone) + GC
engine.delete_event("event-id")?;
let stats = engine.gc(cutoff_ts_ns)?;
```

## Durability

`Engine::append` / `append_batch` writes the event into the on-disk WAL and returns once the bytes hit the kernel's page cache. The WAL is **fsync'd in batches**. Two knobs on `EngineConfig`:

| Field | Default | Meaning |
|---|---|---|
| `wal_sync_interval` | `64` | fsync after this many events (per shard) |
| `wal_sync_bytes` | `262_144` | fsync when buffered bytes crosses this (256 KB) |

Consequence: on a hard crash, you can lose **up to one full sync interval per shard** — at defaults up to `wal_sync_interval × wal_shard_count` events. Three profiles:

- **Zero-loss** — `wal_sync_interval = 1`, `wal_sync_bytes = 1`. Every append fsyncs.
- **Balanced (default)** — 64 events / 256 KB.
- **Best-effort** — `wal_sync_interval = u32::MAX`, `wal_sync_bytes = u64::MAX`. fsync only at rotation.

On clean shutdown call `engine.flush()` to rotate the WAL into a segment.

## Production readiness

KeplorDB is a **pre-1.0 release** — API and on-disk format may change before 1.0. Before adopting for production, read [`PRODUCTION.md`](PRODUCTION.md) for the explicit list of known gaps (schema evolution, observability hooks, fuzzing, sanitizer CI, replication, backup/restore, size-based GC).

Short version: ship it today for embedded-in-process structured log storage where the operator owns both sides. Defer for externally-sourced input, regulated workloads, multi-node deployments, or anything needing per-row durability.

## Dependencies

| Crate | Purpose |
|---|---|
| `zstd` | Compression for the i64 block, intern table, status bitmap |
| `zerocopy` | Zero-copy column reads for u32/u16 columns |
| `memmap2` | Memory-mapped segment files |
| `thiserror` | Error type derivation |
| `rustc-hash` | FxHash for the bloom filter |
| `hashbrown` | Arena-backed string intern table |
| `mimalloc` | Global allocator |
| `rayon` | Parallel cross-segment aggregate |
| `crc32fast` | WAL record integrity checksums |
| `arc-swap` | Lock-free segment index + tombstone reads |
| `keplordb-macros` | `#[derive(Schema)]` proc-macro (re-exported) |

The proc-macro crate pulls `syn 2`, `quote`, `proc-macro2` as build-time dependencies only.

## License

[Apache-2.0](LICENSE)