https://github.com/quackscience/rawduck
Experimental RawMergeTree-like Extension for DuckDB
https://github.com/quackscience/rawduck
duckdb duckdb-extension duckdb-json faster-json json rawmergetree rawtree unstructured-data
Last synced: 17 days ago
JSON representation
Experimental RawMergeTree-like Extension for DuckDB
- Host: GitHub
- URL: https://github.com/quackscience/rawduck
- Owner: quackscience
- License: mit
- Created: 2026-06-10T16:10:37.000Z (23 days ago)
- Default Branch: main
- Last Pushed: 2026-06-10T22:08:41.000Z (23 days ago)
- Last Synced: 2026-06-10T22:15:12.704Z (23 days ago)
- Topics: duckdb, duckdb-extension, duckdb-json, faster-json, json, rawmergetree, rawtree, unstructured-data
- Language: C++
- Homepage: https://query.farm
- Size: 63.5 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# RawDuck
**Schema-less JSON analytics for DuckDB, RawMergeTree style**
RawDuck brings the RawMergeTree *"ingest first, schema later"* model to DuckDB: point raw JSON,
NDJSON files, or OTLP telemetry at tables that don't exist yet — RawDuck creates them, types them,
flattens nested objects into real columns, transforms and evolves the schema as the data changes.
### ⚡ Benefits
No `CREATE TABLE`, no schema declarations, no `json_extract` at query time. Because data lands
shredded into native typed columns instead of opaque JSON strings, analytical queries run
**45–265× faster** on **40% smaller** than the JSON-column approach (see benchmark).
### ⚙️ Under the hood
RawDuck delivers a complete engine rather than a parser: ingestion is transactional, pipelined, and
multi-threaded through DuckDB's own catalog and storage APIs (`BEGIN`/`ROLLBACK`) and the optimizer
observes the workload and adapts — physically re-sorting tables by the columns queries actually
filter on (incrementally, MergeTree-parts style) and answering recurring aggregations from projections.
## Usage
Attach a store and `INSERT` raw JSON — tables, typed columns, and schema all emerge from the data:
```sql
ATTACH 'rawduck:store.db' AS raw;
-- no table 'events' exists yet
INSERT INTO raw.ingest.events VALUES
('{"id": 1, "action": "click", "ts": "2024-01-15T10:30:00", "user": {"name": "alice", "plan": "pro"}}'),
('{"id": 2, "action": "view", "ts": "2024-01-15T10:31:00", "user": {"name": "bob"}}');
DESCRIBE raw.events;
-- id BIGINT, action VARCHAR, ts TIMESTAMP, user.name VARCHAR, user.plan VARCHAR
SELECT "user.name", count(*) FROM raw.events GROUP BY 1;
```
The `ingest` schema accepts any SQL source through a fully parallel zero-copy sink
(**6.1M rows/s** on narrow JSON; a 956 MB heterogeneous NDJSON file in ~6 s):
```sql
INSERT INTO raw.ingest.events SELECT json FROM read_json('events.ndjson',
format='newline_delimited', records='false', columns={json: 'JSON'});
CALL raw_ingest_file('raw.events', 'events.ndjson.gz'); -- or the one-call file loader
```
Ingest a different shape and the table follows the data: new keys become columns, conflicting
types widen, missing keys read as `NULL` — nothing is ever dropped. And RawMergeTree tables stay
regular DuckDB tables, so every statement and tool works at native speed:
```sql
UPDATE raw.events SET "user.plan" = 'enterprise' WHERE id = 1;
CREATE TABLE raw.daily AS SELECT date_trunc('day', ts) AS day, count(*) FROM raw.events GROUP BY 1;
```
For ingestion outside a RawDuck store (the default in-memory catalog, DuckLake, the async buffer),
`CALL raw_ingest('table', payload)` is the equivalent with the same engine underneath. All RawDuck
commands are table functions: invoke them with `CALL`, or use `SELECT ... FROM fn(...)` when you
want to project or filter their result columns.
## Benchmark: one hour of GitHub, three ways
Real [GH Archive](https://www.gharchive.org/) data — **247,199 GitHub events, 956 MB of NDJSON,
wildly heterogeneous payloads** (the dataset RawBench uses). One `INSERT` shredded it into
**914 typed columns**, schema evolution included. The baseline is the standard DuckDB JSON
extension pattern: a single `JSON` column queried with `->>` paths.
Same machine (Apple Silicon, DuckDB v1.5.3), best of 3:
| RawBench-style query | JSON column (`->>`) | RawDuck typed columns | speedup |
|---|---:|---:|---:|
| count by event type | 231 ms | 1 ms | **231×** |
| top repos by pushes | 268 ms | 3 ms | **89×** |
| distinct repos per actor | 457 ms | 10 ms | **46×** |
| sum of push payload sizes | 265 ms | 1 ms | **265×** |
| events per minute | 236 ms | 3 ms | **79×** |
| *all five combined* | *1.46 s* | *18 ms* | **~80×** |
| | JSON column | RawDuck |
|---|---:|---:|
| ingest (full hour, 956 MB) | 1.4 s | **~6 s** |
| storage on disk | 1.05 GB | **627 MB** |
Ingestion is fully parallel (zero-copy parse from source vectors, multi-threaded appends,
drain-free schema evolution): the pipeline sustains **~6.1M rows/s** on narrow JSON and lands the
heterogeneous 956 MB hour in ~6 s — a one-time cost a few times that of loading opaque JSON
strings, in exchange for every later query being 45–265× faster and the data 40% smaller on disk.
```sql
INSERT INTO raw.ingest.gh_events SELECT json::VARCHAR FROM read_json(...); -- ~6s, 914 columns
SELECT type, count(*) FROM raw.gh_events GROUP BY type ORDER BY 2 DESC; -- 1 ms
SELECT "repo.name", count(*) AS pushes FROM raw.gh_events
WHERE type = 'PushEvent' GROUP BY 1 ORDER BY pushes DESC LIMIT 10; -- 3 ms
```
## Functions
| Function | Kind | Description |
|---|---|---|
| `INSERT INTO .ingest. ...` | SQL | The primary lane: any VALUES or SELECT source streams through a parallel zero-copy sink into ``, auto-creation and evolution included. |
| `raw_ingest(table, payload)` | table | Schema-less ingest: auto-creates the table, adds new columns, widens conflicting types, appends — natively, inside your transaction. Accepts a JSON array, a single object, scalars, or NDJSON. Returns `(table, created, columns_added, columns_widened, rows, errors)`. |
| `raw_ingest_file(table, path, batch_size := 30000)` | table | Streaming ingest of NDJSON files (gzip auto-detected, any DuckDB filesystem) in bounded-memory batches, evolving the schema between batches. The whole file is one atomic operation. |
| `raw_records(payload)` | table | Parse + infer + flatten a JSON payload into typed rows without touching any table. |
| `raw_stats()` | table | Observed usage statistics per column: pushed-down filters and GROUP BY keys, collected automatically by an optimizer hook. |
| `raw_optimize(table)` | table | RawMergeTree adaptive layout: physically reorders the table by its hottest columns. Incremental: append-only growth since the last optimize sorts only the new tail into a fresh sorted run (`mode` = `full` / `incremental` / `noop`). |
| `raw_transforms()` / `raw_transform_define(name, path)` | table / scalar | List and register ingest-time transforms; definitions compose with `read_json`, tables, or any query. |
| `raw_stats_save(catalog?)` / `raw_stats_load(catalog?)` | table | Persist observed statistics into a store (`__rawduck_stats` table) and merge them back after restart. |
| `raw_projections()` | table | The projection advisor: GROUP BY shapes queries actually run, with observation counts and materialization status. |
| `raw_project(table)` | table | RawMergeTree auto-projections: materializes the hottest observed aggregation as a lightweight `__proj` summary table. |
| `raw_serve(host, port, token)` / `raw_serve_stop()` | table | Start/stop the in-process HTTP API (see below). |
| `raw_serve_grpc(host, port, token)` / `raw_serve_grpc_stop()` | table | Start/stop the OTLP/gRPC collector (opt-in build, see Building). |
| `raw_flush()` | table | Synchronously drain the async-insert buffers. |
| `raw_type(json)` | scalar | Concrete type of a JSON value (RawMergeTree's `dynamicType()`): `Null`, `Bool`, `Int64`, `UInt64`, `Double`, `String`, `Array`, `Object`. |
| `raw_infer(json)` | scalar | The DuckDB type RawDuck assigns to a value, e.g. `BIGINT`, `DOUBLE[]`, or the flattened layout for objects: `OBJECT(a BIGINT, b.c VARCHAR)`. |
All ingest functions accept `transform := '...'`, `explode := '...'` and `ignore_errors := true`.
## Asynchronous inserts
By default every `raw_ingest` call parses, evolves the schema, appends, and commits before
returning — callers immediately see their rows. Under many concurrent writers issuing small
payloads, that means one transaction per call. Asynchronous mode trades immediate visibility for
throughput: calls enqueue the payload into a per-table buffer and return instantly, and a
background flusher ingests each buffer as a single batch.
```sql
SET rawduck_async_insert = true;
CALL raw_ingest('events', '[{"id": 1, "action": "click"}]'); -- returns immediately, rows = 0
CALL raw_ingest('events', '[{"id": 2, "action": "view"}]');
-- a buffer flushes when it exceeds the size threshold or its oldest entry exceeds the age
-- threshold; force it when you need the data now:
CALL raw_flush();
-- ┌─────────┬──────┐
-- │ targets │ rows │
-- │ 1 │ 2 │
-- └─────────┴──────┘
SELECT count(*) FROM events; -- 2
```
| Setting / function | Default | Meaning |
|---|---|---|
| `rawduck_async_insert` | `false` | Enable buffered ingestion for `raw_ingest` / `raw_ingest_file`. |
| `rawduck_async_max_data_size` | `1048576` | Flush a table's buffer once it holds this many bytes. |
| `rawduck_async_busy_timeout_ms` | `200` | Flush a buffer once its oldest payload is this old. |
| `raw_flush()` | — | Drain all buffers synchronously; returns `(targets, rows)`. |
Semantics to know before enabling it:
- Buffered payloads commit in the flusher's own transactions — a `ROLLBACK` in the calling
session does not un-enqueue them, and a failed background flush drops that batch.
- Data buffered for less than the age threshold is lost if the database closes first; call
`raw_flush()` before shutdown.
- The HTTP and gRPC servers ingest asynchronously by default (their clients are exactly the
many-small-writers case, and a single flusher also serializes schema evolution instead of
letting per-request transactions race on it). Start them with `async := false` to make every
request its own synchronous transaction.
## HTTP API
RawDuck can serve an in-process HTTP API for ingestion and querying
```sql
CALL raw_serve(host := '127.0.0.1', port := 9999, token := 'rt_secret');
CALL raw_serve_stop();
```
```sh
curl -X POST localhost:9999/v1/tables/events -H "Authorization: Bearer rt_secret" \
-d '[{"action":"click","user":"alice","value":42}]'
# {"table":"events","inserted":1,"created":true,"columns_added":3,"errors":0}
curl -X POST localhost:9999/v1/query -H "Authorization: Bearer rt_secret" \
-d '{"sql":"SELECT action, count(*) FROM events GROUP BY action"}'
# {"meta":[...],"data":[["click",1]],"rows":1,"statistics":{"elapsed":0.0016}}
```
| Endpoint | Behavior |
|---|---|
| `GET /health` | `{"status":"ok"}` |
| `POST /v1/query` | `{"sql": "..."}` → `meta` / `data` / `rows` / `statistics` |
| `GET /v1/tables`, `GET /v1/tables/{t}` | list tables / describe schema |
| `POST /v1/tables/{t}` | schema-less ingest (`?transform=`, `?explode=`, `?ignore_errors=true`) |
| `DELETE /v1/tables/{t}` | drop table |
| `POST /otlp/v1/{traces,logs,metrics}` | OTLP/HTTP ingest (JSON or protobuf bodies) with envelope unwrapping and spec-shaped `partialSuccess` responses |
Requests run on their own connections/transactions; a bearer `token` guards everything except
`/health`; CORS is enabled for browser clients; gzip is supported both ways (request bodies with
`Content-Encoding: gzip`, compressed responses for `Accept-Encoding: gzip` clients). Binds to
localhost by default.
### OpenTelemetry SDKs
The OTLP routes follow the standard signal paths and accept both wire encodings — `http/protobuf`
(the SDK default) and `http/json` — so SDKs only need the endpoint base:
```sh
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:9999/otlp
export OTEL_EXPORTER_OTLP_HEADERS="authorization=Bearer rt_secret"
```
Signals land in `otel_traces`, `otel_logs`, and `otel_metrics` by default; route them to custom
tables with the `x-rawduck-traces-table`, `x-rawduck-logs-table`, or `x-rawduck-metrics-table`
headers (the generic `x-rawduck-table` also works). Responses are OTLP-conformant in the request's
encoding: an empty `partialSuccess` on full acceptance, signal-specific rejected counts otherwise.
Both encodings produce identical columns — trace/span ids stored as hex, enum fields as integers,
`*UnixNano` timestamps as `BIGINT` — so mixed fleets of exporters share tables cleanly.
### OTLP/gRPC
Builds made with `make release RAWDUCK_ENABLE_GRPC=1` (see Building) also serve the standard
OpenTelemetry collector services natively:
```sql
CALL raw_serve_grpc(port := 4317, token := 'rt_secret'); -- TraceService/LogsService/MetricsService
CALL raw_serve_grpc_stop();
```
```sh
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_HEADERS="authorization=Bearer rt_secret"
```
Requests are converted through protobuf's canonical OTLP/JSON mapping and flow through the same
native ingestion path as the HTTP routes, with the same `x-rawduck-*-table` routing (sent as gRPC
metadata) and `partialSuccess` semantics. On builds without gRPC the functions explain themselves;
OTLP/HTTP (both encodings) stays fully functional.
## ATTACH: RawMergeTree stores
```sql
ATTACH 'rawduck:store.db' AS raw;
```
A RawDuck store is a native DuckDB database under a RawDuck-typed catalog: everything DuckDB can
do — joins, window functions, updates, exports, other extensions — works on RawMergeTree tables
transparently and at full native speed, while the store identifies itself for RawDuck's ingestion
and adaptive-layout machinery. Stores persist and reattach like any database file.
Two kinds of `INSERT` coexist: typed inserts into the real tables behave exactly like DuckDB
(fixed columns, binder-validated), while inserts into the virtual `ingest` schema take raw JSON
payloads and handle creation and evolution. Both run in your transaction.
## Transforms
RawDuck reshapes envelope-style telemetry at ingest time: one row per nested event,
with the wrapper's fields merged into each row.
```sql
-- {"owner":"123","logGroup":"/aws/lambda/api","logEvents":[{"id":"1","message":"started"},...]}
CALL raw_ingest('logs', payload, transform := 'cloudwatch-logs');
-- one row per log event, with owner and logGroup columns on each
-- the generic form works for any envelope shape
CALL raw_ingest('events', payload, explode := 'batch.items');
```
Transforms also apply to the INSERT lane through a session setting (a transform name or a
dotted explode path):
```sql
SET rawduck_insert_transform = 'otlp-traces';
INSERT INTO raw.ingest.spans SELECT json FROM read_json('traces.ndjson', ...);
RESET rawduck_insert_transform;
```
Built-in transforms: `cloudwatch-logs`, `cloudtrail`, `firehose`, `otlp-traces`, `otlp-logs`,
`otlp-metrics` (multi-level envelopes like `resourceSpans[].scopeSpans[].spans[]` are unwrapped
with resource/scope fields merged into every row). Transforms are user-extensible — definitions
are data, so they load from files or tables like anything else in DuckDB:
```sql
SELECT raw_transform_define('my-batch', 'data.items'); -- one-off
SELECT raw_transform_define(name, explode) FROM read_json('transforms.json'); -- from a file
SELECT raw_transform_define(name, explode) FROM raw.transform_config; -- from a table
CALL raw_transforms(); -- list them all
```
Dirty NDJSON streams can be ingested with `ignore_errors := true`; skipped lines are counted in the `errors` column.
## The type lattice
Per JSON path, RawDuck infers the narrowest type that holds everything seen, widening monotonically
as data arrives (existing columns are `ALTER`ed in place, never rewritten destructively):
```
BOOLEAN BIGINT ──> DOUBLE DATE ──> TIMESTAMP
│ │ │ │ │
└────────┴────> VARCHAR <──────┴──────────┘ (scalar conflicts)
object vs scalar, mixed arrays, arrays of objects ──> JSON (structural conflicts)
```
- integers out of `BIGINT` range degrade to `DOUBLE`
- ISO `DATE` / `TIMESTAMP` strings are sniffed into temporal columns
- homogeneous scalar arrays become typed `LIST`s (`BIGINT[]`, nested `BIGINT[][]`, …)
- nothing is ever dropped: structurally conflicting values are preserved verbatim as `JSON`
## Adaptive layout from observed workloads
An optimizer hook records every filter DuckDB pushes into a table scan and every GROUP BY
column set. `raw_optimize` turns filter/group usage into a physical sort order
(RawMergeTree adaptive primary keys); `raw_project` materializes the hottest aggregation
as a summary table (RawMergeTree lightweight projections):
```sql
SELECT count(*) FROM gh_events WHERE type = 'PushEvent';
SELECT sum("payload.size") FROM gh_events WHERE type = 'PushEvent' AND "repo.id" > 700000000;
CALL raw_stats();
-- gh_events | type | 2
-- gh_events | repo.id | 1
CALL raw_optimize('gh_events');
-- gh_events | "type", "repo.id" | 247199
SELECT type, count(*) FROM gh_events GROUP BY type; -- observed by the advisor
CALL raw_project('gh_events');
-- gh_events | gh_events__proj | type | 15 -- pre-aggregated summary table
SET rawduck_use_projections = true;
SELECT type, count(*) FROM gh_events GROUP BY type; -- now answered from the projection
```
With `rawduck_use_projections` enabled (off by default), eligible `count(*)` aggregations are
rewritten onto fresh projections transparently — result types and values are identical, and a
physical-row-count staleness token guarantees a changed base table always falls back to a full
scan. Intended for append-only analytics; in-place `UPDATE`s of group columns require re-running
`raw_project`. Statistics persist across sessions with `raw_stats_save('store')` /
`raw_stats_load('store')`.
## DuckLake as a backend
For non-native catalogs RawDuck falls back to catalog-level SQL, so schema-less ingestion also
works against [DuckLake](https://ducklake.select) — straight into a lakehouse with snapshots and
schema evolution tracked in the metadata:
```sql
ATTACH 'ducklake:metadata.ducklake' AS lake (DATA_PATH 's3://bucket/raw');
CALL raw_ingest('lake.main.events', payload);
```
Catalogs that cannot rewrite columns with expressions (DuckLake rejects `ALTER ... USING`)
degrade gracefully: RawDuck keeps the existing column type and converts incoming values instead.
## Building
```sh
git submodule update --init
GEN=ninja make release
```
OTLP/HTTP protobuf decoding is **on by default**: builds pick up protobuf from the vcpkg
manifest's default `protobuf` feature (or a system package locally) and skip it gracefully when
unavailable (wasm builds, or `make release RAWDUCK_DISABLE_OTLP_PROTOBUF=1`); without it, protobuf
bodies get a 415 pointing at `http/json`.
The OTLP/gRPC server is **opt-in at build time** (it pulls the full gRPC stack, which
significantly lengthens builds): `make release RAWDUCK_ENABLE_GRPC=1` enables it, using the
`grpc` vcpkg manifest feature in CI or system gRPC/protobuf locally. Default builds skip it —
OTLP/HTTP stays fully functional and `raw_serve_grpc()` explains how to enable support. The flags
are cached per build directory; run `make clean` when toggling them. wasm builds never include them.
Artifacts:
```sh
./build/release/duckdb # shell with rawduck linked in
./build/release/test/unittest # test runner
./build/release/extension/rawduck/rawduck.duckdb_extension # loadable extension
```
## Tests
```sh
make test
```
The sqllogictests in `test/sql/` cover all standard JSON types, nested flattening, NDJSON, type
widening, schema evolution, structural conflicts, streaming file ingestion, multi-threaded appends, transforms, projections,
error-tolerant ingestion, RawDuck stores (`ATTACH 'rawduck:...'`), transactional rollback,
predicate statistics + adaptive reordering, and DuckLake catalogs (`test/sql/ducklake.test`).
## Status
All RawMergeTree concepts are implemented: schema-less evolving ingestion
(native, transactional, pipelined, multi-threaded), adaptive physical layout from observed
predicates with incremental re-sorting, the projection advisor with automatic aggregate rewriting,
extensible ingest-time transforms, persisted statistics, RawDuck stores, DuckLake fallback, and
an in-process HTTP API for ingestion and querying.
See [BENCHMARK.md](BENCHMARK.md) to reproduce the numbers and [AGENTS.md](AGENTS.md) for the
design guide.
---
Based on the [DuckDB extension template](https://github.com/duckdb/extension-template).
JSON parsing via DuckDB's vendored [yyjson](https://github.com/ibireme/yyjson).