https://github.com/themankindproject/keplordb
A columnar log engine optimized for high-throughput ingestion of structured, append-only, time-ordered events.
https://github.com/themankindproject/keplordb
append-only avx2 columnar-storage database db embeddable iot log-engine logging mmap rust simd telemetry time-series wal
Last synced: 29 days ago
JSON representation
A columnar log engine optimized for high-throughput ingestion of structured, append-only, time-ordered events.
- Host: GitHub
- URL: https://github.com/themankindproject/keplordb
- Owner: themankindproject
- License: apache-2.0
- Created: 2026-04-19T13:40:54.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-28T18:17:38.000Z (about 2 months ago)
- Last Synced: 2026-04-28T20:19:37.382Z (about 2 months ago)
- Topics: append-only, avx2, columnar-storage, database, db, embeddable, iot, log-engine, logging, mmap, rust, simd, telemetry, time-series, wal
- Language: HTML
- Homepage: https://keplordb.pages.dev/
- Size: 970 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# KeplorDB
[](https://crates.io/crates/keplordb)
[](https://docs.rs/keplordb)
[](https://github.com/themankindproject/keplordb/actions)
[](./LICENSE)
A columnar append-only log engine written in Rust — purpose-built for high-throughput structured event ingestion. LLM observability, HTTP access logs, IoT telemetry, payment ledgers — any workload that's append-only and time-ordered.
## Why
Existing options are either too heavy (ClickHouse, TimescaleDB) or too general (SQLite, RocksDB). **KeplorDB is an embeddable library** with no server process, no SQL parser, and no background threads. Declare your schema with a derive macro, open the engine, append events, query columns.
## Install
```toml
[dependencies]
keplordb = "0.1"
```
Or via git (pre-crates.io):
```toml
[dependencies]
keplordb = { git = "https://github.com/themankindproject/keplordb" }
```
Requires Rust **1.82** or newer. The workspace ships two crates — `keplordb` (engine) and `keplordb-macros` (the `#[derive(Schema)]` proc-macro, re-exported from the main crate). One dep gets both.
## Quick Start — typed schema derive
```rust
use keplordb::{Engine, Schema};
#[derive(Schema)]
#[keplordb(id = 1)]
pub struct AiCall {
#[dim(bloom, rollup)] pub user: String,
#[dim(rollup)] pub org: String,
#[dim] pub model: String,
#[counter] pub input_tokens: u32,
#[counter] pub output_tokens: u32,
#[label] pub region: String,
}
fn main() -> Result<(), keplordb::DbError> {
// `default_config` prefills bloom_dim + rollup_dims from the schema.
let engine: Engine<3, 2, 1> =
Engine::open(AiCall::default_config("/tmp/logs".into()))?;
// Fluent typed builder — no positional indexing.
engine.append(
&AiCall::new(ts_ns())
.user("alice")
.org("acme")
.model("gpt-4o")
.input_tokens(1_200)
.output_tokens(850)
.region("us-east")
.metric(5_000_000) // cost in nanodollars
.status(200)
.into_log_event(),
)?;
// Named filter — maps each setter to the right dim internally.
let alice = engine.aggregate(
&AiCall::filter().user("alice").into_filter(),
)?;
println!("alice events: {}, cost sum: {}", alice.event_count, alice.metric);
engine.flush()?;
Ok(())
}
fn ts_ns() -> i64 {
std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_nanos() as i64
}
```
The raw positional `LogEvent` + `QueryFilter` API is still available for dynamic / runtime-decided schemas — see [`examples/raw_positional.rs`](crates/keplordb/examples/raw_positional.rs).
## Features
### Ergonomics
- **Typed schema derive** — `#[derive(keplordb::Schema)]` generates typed event + filter builders. No positional `dims[0]` indexing anywhere in user code
- **Compile-time validation** — wrong counter type, two bloom dims, missing `schema_id`, or caps exceeded all rejected by the macro with spans pointing at your source
- **Schema ID guard** — segment headers record the active schema; opening a data directory with a mismatched `schema_id` fails fast with `DbError::Corrupt`
### Storage
- **Columnar segments** — queries touch only the columns they need; aggregations scan contiguous arrays via mmap
- **Zero-copy reads** — u32/u16 columns read directly from mmap'd segment files via `zerocopy`; no deserialization
- **Pre-decompressed i64 cache** — `ts_ns` + `metric` delta-decoded once at segment open and reused across readers
- **Intern table cache** — string → u16 resolve table decompressed once per segment via `OnceLock`; 170× faster filtered aggregate after warmup
### Indexing
- **Bloom filter skip** — per-segment bloom on the primary dimension; skip entire files on mismatch
- **Zone maps per-dimension** — min/max per 256-row chunk, built with AVX2 SIMD min/max during rotation; chunks that can't match a filter are skipped before any column access
- **Status bitmap index** — per-value compressed bitmaps for O(1) status lookups; full-scan fallback
### Throughput
- **AVX2 SIMD aggregation** — vectorized sum, count, filtered-sum using 256-bit registers with hardware prefetching and scalar fallback
- **Sharded WAL, batch-routed** — each `append_batch` claims one shard via `fetch_add`; N concurrent writers → N shards, zero contention in the common case
- **Rayon parallel scan** — cross-segment aggregates fan out across cores
- **`query_recent` global merge** — candidates sorted by `max_ts` descending, per-segment results pooled + merged by `ts_ns` descending, with early termination once the kth-best ts exceeds any remaining segment's `max_ts`
### Durability
- **CRC32-framed WAL** — every record checksummed; partial frames detected and recovered up to the last complete frame on replay
- **Crash-safe rotation** — three-phase `rename → write segment → unlink`; orphaned `*.wal.rotating` files replayed on next open
- **Tunable fsync** — batched per `wal_sync_interval` (default 64 events) or `wal_sync_bytes` (default 256 KB). Set to `1` each for zero-loss, or `u32::MAX` for best-effort
### Operations
- **Segment-level GC** — `engine.gc(cutoff)` drops segments whose `max_ts` is below the threshold. No compaction, no write amplification, no read pause
- **Lock-free reads** — segment manifest + tombstones behind `ArcSwap`; one atomic load per query
- **Embeddable** — `Engine::open()` in your Rust binary. No TCP, no SQL, no external service
## Performance
Measured with Criterion over **1 million events in 10 segments** on an Intel i5-1135G7 (4c/8t). Appends are WAL-durable; reads run against real on-disk column data via mmap + AVX2 SIMD.
### Write path
| Operation | Latency | Throughput |
|---|---|---|
| `append_batch` · 4096 events | **4.76 ms** | 860K ev/s |
| `append_batch` · 1024 events | 873 µs | 1.17M ev/s |
| concurrent · 8t × 1024 events | **1.24 ms** | **6.6M ev/s** |
| WAL memory-only | 352 µs | 2.9M ev/s |
| rotation · 1 shard · 1024 | 10.1 ms | compress+fsync |
### Read path
| Operation | Latency | Throughput |
|---|---|---|
| `aggregate` · no filter | **756 µs** | 1.3G ev/s |
| `aggregate` · user filter | **254 µs** | 3.9G ev/s |
| `aggregate` · time range | 274 µs | 3.7G ev/s |
| `aggregate` · user + time | **105 µs** | **9.5G ev/s** |
| `query_recent` · 100 | **37 µs** | — |
| `query_recent` · 1000 | 456 µs | — |
| `query_recent` · user · 100 | 70 µs | — |
### Rollup (in-memory)
| Operation | Latency |
|---|---|
| `query_rollups` · single user · day | 7.5 µs |
| `query_rollups` · all buckets · day | 24 µs |
Run `cargo bench --workspace` to reproduce on your hardware.
## Use cases
The typed schema derive maps to any append-only, time-ordered workload. Example schemas per domain:
### LLM observability
```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 1)]
pub struct LlmCall {
#[dim(bloom, rollup)] pub user: String,
#[dim(rollup)] pub api_key: String,
#[dim] pub model: String,
#[dim] pub provider: String,
#[counter] pub input_tokens: u32,
#[counter] pub output_tokens: u32,
#[counter] pub cache_tokens: u32,
#[label] pub region: String,
}
```
### HTTP access logs
```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 2)]
pub struct HttpLog {
#[dim(bloom)] pub client_ip: String,
#[dim] pub method: String,
#[dim] pub route: String,
#[counter] pub bytes: u32,
}
```
### IoT telemetry
```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 3)]
pub struct Reading {
#[dim(bloom)] pub device_id: String,
#[dim] pub sensor: String,
}
// metric = sensor reading (fixed-point)
```
### Payment ledger
```rust
#[derive(keplordb::Schema)]
#[keplordb(id = 4)]
pub struct Txn {
#[dim(bloom, rollup)] pub merchant: String,
#[dim(rollup)] pub customer: String,
#[dim] pub currency: String,
#[dim] pub method: String,
#[counter] pub items: u32,
#[counter] pub tax: u32,
}
// metric = amount in cents
```
See the full set under [`crates/keplordb/examples/`](crates/keplordb/examples/):
- [`typed_schema.rs`](crates/keplordb/examples/typed_schema.rs) — end-to-end typed API
- [`raw_positional.rs`](crates/keplordb/examples/raw_positional.rs) — dynamic schemas via the positional API
- [`concurrent_writes.rs`](crates/keplordb/examples/concurrent_writes.rs) — 8-thread producers with `Arc`
- [`pagination.rs`](crates/keplordb/examples/pagination.rs) — cursor-based pagination through `query_recent`
- [`crash_recovery.rs`](crates/keplordb/examples/crash_recovery.rs) — WAL replay on restart
## Architecture
```
Write path
──────────
append_batch() ──► fetch_add → shard N ──► WAL (CRC32 + fsync/interval)
│
▼
rotate (3-phase, crash-safe)
│
▼
Segment .kseg
Read path
─────────
query() ──► ArcSwap index load
│
▼
candidate segments sorted by max_ts desc
│
▼
per-segment scan (AVX2 SIMD + zone-map prune)
│
▼
pool → sort ts_ns desc → truncate(limit)
│
▼
intern resolve via OnceLock cache
```
### Segment file layout (`.kseg`)
```
┌──────────────────────────────┐
│ header 256 B │ schema_id verified on open
├──────────────────────────────┤
│ i64 block zstd+δ │ delta-encoded ts_ns + metric
│ u32 cols latency + counters
│ u16 cols status + flags + dims + id + labels
├──────────────────────────────┤
│ bloom filter 128 B │ primary dim skip
│ status bitmap zstd │ per-value compressed bitmaps
│ zone maps raw │ min/max per 256-row chunk × D
│ intern table zstd │ cached via OnceLock after first access
└──────────────────────────────┘
```
### `LogEvent` schema
`D` dims, `C` counters, `L` labels — chosen for you by `#[derive(Schema)]` based on field tags.
| Field | Type | Description |
|---|---|---|
| `id` | `String` | Unique event identifier. Interned per segment for fast point lookup |
| `ts_ns` | `i64` | Nanosecond timestamp. Sorted, binary-searchable |
| `metric` | `i64` | Primary signed metric — cost, duration, value |
| `counters[0..C]` | `u32` | Unsigned counters |
| `latency_ms` | `u32` | Primary latency |
| `latency_detail_ms` | `u32` | Secondary latency breakdown |
| `status` | `u16` | Status code. Bitmap-indexed |
| `flags` | `EventFlags` | 16 boolean bitflags (newtype) |
| `dims[0..D]` | `String` | Indexed, filterable dimensions. Interned per segment. Zone-mapped |
| `labels[0..L]` | `String` | Free-form string labels. Stored, not indexed |
Caps: `D ≤ 256`, `C ≤ 64`, `L ≤ 64` — enforced at compile time by the derive.
## API reference
```rust
// Lifecycle
let engine = Engine::open(config)?;
engine.flush()?;
// Write
engine.append(&event)?;
engine.append_batch(&events)?;
// Read
let events = engine.query_recent(&filter, limit)?;
let totals = engine.aggregate(&filter)?;
let rollups = engine.query_rollups(from_day, to_day, &dim_filters)?;
let event = engine.get_event("event-id")?;
// Delete (tombstone) + GC
engine.delete_event("event-id")?;
let stats = engine.gc(cutoff_ts_ns)?;
```
## Durability
`Engine::append` / `append_batch` writes the event into the on-disk WAL and returns once the bytes hit the kernel's page cache. The WAL is **fsync'd in batches**. Two knobs on `EngineConfig`:
| Field | Default | Meaning |
|---|---|---|
| `wal_sync_interval` | `64` | fsync after this many events (per shard) |
| `wal_sync_bytes` | `262_144` | fsync when buffered bytes crosses this (256 KB) |
Consequence: on a hard crash, you can lose **up to one full sync interval per shard** — at defaults up to `wal_sync_interval × wal_shard_count` events. Three profiles:
- **Zero-loss** — `wal_sync_interval = 1`, `wal_sync_bytes = 1`. Every append fsyncs.
- **Balanced (default)** — 64 events / 256 KB.
- **Best-effort** — `wal_sync_interval = u32::MAX`, `wal_sync_bytes = u64::MAX`. fsync only at rotation.
On clean shutdown call `engine.flush()` to rotate the WAL into a segment.
## Production readiness
KeplorDB is a **pre-1.0 release** — API and on-disk format may change before 1.0. Before adopting for production, read [`PRODUCTION.md`](PRODUCTION.md) for the explicit list of known gaps (schema evolution, observability hooks, fuzzing, sanitizer CI, replication, backup/restore, size-based GC).
Short version: ship it today for embedded-in-process structured log storage where the operator owns both sides. Defer for externally-sourced input, regulated workloads, multi-node deployments, or anything needing per-row durability.
## Dependencies
| Crate | Purpose |
|---|---|
| `zstd` | Compression for the i64 block, intern table, status bitmap |
| `zerocopy` | Zero-copy column reads for u32/u16 columns |
| `memmap2` | Memory-mapped segment files |
| `thiserror` | Error type derivation |
| `rustc-hash` | FxHash for the bloom filter |
| `hashbrown` | Arena-backed string intern table |
| `mimalloc` | Global allocator |
| `rayon` | Parallel cross-segment aggregate |
| `crc32fast` | WAL record integrity checksums |
| `arc-swap` | Lock-free segment index + tombstone reads |
| `keplordb-macros` | `#[derive(Schema)]` proc-macro (re-exported) |
The proc-macro crate pulls `syn 2`, `quote`, `proc-macro2` as build-time dependencies only.
## License
[Apache-2.0](LICENSE)