{"id":50919051,"url":"https://github.com/quackscience/rawduck","last_synced_at":"2026-06-16T18:01:23.963Z","repository":{"id":363904657,"uuid":"1265270899","full_name":"quackscience/rawduck","owner":"quackscience","description":"Experimental RawMergeTree-like Extension for DuckDB","archived":false,"fork":false,"pushed_at":"2026-06-10T22:08:41.000Z","size":65,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T22:15:12.704Z","etag":null,"topics":["duckdb","duckdb-extension","duckdb-json","faster-json","json","rawmergetree","rawtree","unstructured-data"],"latest_commit_sha":null,"homepage":"https://query.farm","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quackscience.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-06-10T16:10:37.000Z","updated_at":"2026-06-10T22:08:47.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/quackscience/rawduck","commit_stats":null,"previous_names":["quackscience/rawduck"],"tags_count":null,"template":false,"template_full_name":"duckdb/extension-template","purl":"pkg:github/quackscience/rawduck","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quackscience%2Frawduck","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quackscience%2Frawduck/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quackscience%2Frawduck/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quackscience%2Frawduck/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quackscience","download_url":"https://codeload.github.com/quackscience/rawduck/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quackscience%2Frawduck/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34417416,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duckdb","duckdb-extension","duckdb-json","faster-json","json","rawmergetree","rawtree","unstructured-data"],"created_at":"2026-06-16T18:01:23.011Z","updated_at":"2026-06-16T18:01:23.957Z","avatar_url":"https://github.com/quackscience.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg width=\"120\" alt=\"rawduck\" src=\"https://github.com/user-attachments/assets/e44ee764-7639-433c-b904-03a2d4ee38e2\" /\u003e \n\n# RawDuck\n\n**Schema-less JSON analytics for DuckDB, RawMergeTree style**\n\nRawDuck brings the RawMergeTree *\"ingest first, schema later\"* model to DuckDB: point raw JSON,\nNDJSON files, or OTLP telemetry at tables that don't exist yet — RawDuck creates them, types them,\nflattens nested objects into real columns, transforms and evolves the schema as the data changes. \n\n### ⚡ Benefits\nNo `CREATE TABLE`, no schema declarations, no `json_extract` at query time. Because data lands\nshredded into native typed columns instead of opaque JSON strings, analytical queries run\n**45–265× faster** on **40% smaller** than the JSON-column approach (see benchmark).\n\n### ⚙️ Under the hood\nRawDuck delivers a complete engine rather than a parser: ingestion is transactional, pipelined, and\nmulti-threaded through DuckDB's own catalog and storage APIs (`BEGIN`/`ROLLBACK`) and the optimizer \nobserves the workload and adapts — physically re-sorting tables by the columns queries actually \nfilter on (incrementally, MergeTree-parts style) and answering recurring aggregations from projections. \n\n## Usage\n\nAttach a store and `INSERT` raw JSON — tables, typed columns, and schema all emerge from the data:\n\n```sql\nATTACH 'rawduck:store.db' AS raw;\n\n-- no table 'events' exists yet\nINSERT INTO raw.ingest.events VALUES\n    ('{\"id\": 1, \"action\": \"click\", \"ts\": \"2024-01-15T10:30:00\", \"user\": {\"name\": \"alice\", \"plan\": \"pro\"}}'),\n    ('{\"id\": 2, \"action\": \"view\",  \"ts\": \"2024-01-15T10:31:00\", \"user\": {\"name\": \"bob\"}}');\n\nDESCRIBE raw.events;\n-- id BIGINT, action VARCHAR, ts TIMESTAMP, user.name VARCHAR, user.plan VARCHAR\n\nSELECT \"user.name\", count(*) FROM raw.events GROUP BY 1;\n```\n\nThe `ingest` schema accepts any SQL source through a fully parallel zero-copy sink\n(**6.1M rows/s** on narrow JSON; a 956 MB heterogeneous NDJSON file in ~6 s):\n\n```sql\nINSERT INTO raw.ingest.events SELECT json FROM read_json('events.ndjson',\n    format='newline_delimited', records='false', columns={json: 'JSON'});\n\nCALL raw_ingest_file('raw.events', 'events.ndjson.gz');   -- or the one-call file loader\n```\n\nIngest a different shape and the table follows the data: new keys become columns, conflicting\ntypes widen, missing keys read as `NULL` — nothing is ever dropped. And RawMergeTree tables stay\nregular DuckDB tables, so every statement and tool works at native speed:\n\n```sql\nUPDATE raw.events SET \"user.plan\" = 'enterprise' WHERE id = 1;\nCREATE TABLE raw.daily AS SELECT date_trunc('day', ts) AS day, count(*) FROM raw.events GROUP BY 1;\n```\n\nFor ingestion outside a RawDuck store (the default in-memory catalog, DuckLake, the async buffer),\n`CALL raw_ingest('table', payload)` is the equivalent with the same engine underneath. All RawDuck\ncommands are table functions: invoke them with `CALL`, or use `SELECT ... FROM fn(...)` when you\nwant to project or filter their result columns.\n\n## Benchmark: one hour of GitHub, three ways\n\nReal [GH Archive](https://www.gharchive.org/) data — **247,199 GitHub events, 956 MB of NDJSON,\nwildly heterogeneous payloads** (the dataset RawBench uses). One `INSERT` shredded it into\n**914 typed columns**, schema evolution included. The baseline is the standard DuckDB JSON\nextension pattern: a single `JSON` column queried with `-\u003e\u003e` paths.\n\nSame machine (Apple Silicon, DuckDB v1.5.3), best of 3:\n\n| RawBench-style query | JSON column (`-\u003e\u003e`) | RawDuck typed columns | speedup |\n|---|---:|---:|---:|\n| count by event type | 231 ms | 1 ms | **231×** |\n| top repos by pushes | 268 ms | 3 ms | **89×** |\n| distinct repos per actor | 457 ms | 10 ms | **46×** |\n| sum of push payload sizes | 265 ms | 1 ms | **265×** |\n| events per minute | 236 ms | 3 ms | **79×** |\n| *all five combined* | *1.46 s* | *18 ms* | **~80×** |\n\n| | JSON column | RawDuck |\n|---|---:|---:|\n| ingest (full hour, 956 MB) | 1.4 s | **~6 s** |\n| storage on disk | 1.05 GB | **627 MB** |\n\nIngestion is fully parallel (zero-copy parse from source vectors, multi-threaded appends,\ndrain-free schema evolution): the pipeline sustains **~6.1M rows/s** on narrow JSON and lands the\nheterogeneous 956 MB hour in ~6 s — a one-time cost a few times that of loading opaque JSON\nstrings, in exchange for every later query being 45–265× faster and the data 40% smaller on disk.\n\n```sql\nINSERT INTO raw.ingest.gh_events SELECT json::VARCHAR FROM read_json(...);  -- ~6s, 914 columns\n\nSELECT type, count(*) FROM raw.gh_events GROUP BY type ORDER BY 2 DESC;      -- 1 ms\nSELECT \"repo.name\", count(*) AS pushes FROM raw.gh_events\nWHERE type = 'PushEvent' GROUP BY 1 ORDER BY pushes DESC LIMIT 10;           -- 3 ms\n```\n\n## Functions\n\n| Function | Kind | Description |\n|---|---|---|\n| `INSERT INTO \u003cstore\u003e.ingest.\u003ctable\u003e ...` | SQL | The primary lane: any VALUES or SELECT source streams through a parallel zero-copy sink into `\u003ctable\u003e`, auto-creation and evolution included. |\n| `raw_ingest(table, payload)` | table | Schema-less ingest: auto-creates the table, adds new columns, widens conflicting types, appends — natively, inside your transaction. Accepts a JSON array, a single object, scalars, or NDJSON. Returns `(table, created, columns_added, columns_widened, rows, errors)`. |\n| `raw_ingest_file(table, path, batch_size := 30000)` | table | Streaming ingest of NDJSON files (gzip auto-detected, any DuckDB filesystem) in bounded-memory batches, evolving the schema between batches. The whole file is one atomic operation. |\n| `raw_records(payload)` | table | Parse + infer + flatten a JSON payload into typed rows without touching any table. |\n| `raw_stats()` | table | Observed usage statistics per column: pushed-down filters and GROUP BY keys, collected automatically by an optimizer hook. |\n| `raw_optimize(table)` | table | RawMergeTree adaptive layout: physically reorders the table by its hottest columns. Incremental: append-only growth since the last optimize sorts only the new tail into a fresh sorted run (`mode` = `full` / `incremental` / `noop`). |\n| `raw_transforms()` / `raw_transform_define(name, path)` | table / scalar | List and register ingest-time transforms; definitions compose with `read_json`, tables, or any query. |\n| `raw_stats_save(catalog?)` / `raw_stats_load(catalog?)` | table | Persist observed statistics into a store (`__rawduck_stats` table) and merge them back after restart. |\n| `raw_projections()` | table | The projection advisor: GROUP BY shapes queries actually run, with observation counts and materialization status. |\n| `raw_project(table)` | table | RawMergeTree auto-projections: materializes the hottest observed aggregation as a lightweight `\u003ctable\u003e__proj` summary table. |\n| `raw_serve(host, port, token)` / `raw_serve_stop()` | table | Start/stop the in-process HTTP API (see below). |\n| `raw_serve_grpc(host, port, token)` / `raw_serve_grpc_stop()` | table | Start/stop the OTLP/gRPC collector (opt-in build, see Building). |\n| `raw_flush()` | table | Synchronously drain the async-insert buffers. |\n| `raw_type(json)` | scalar | Concrete type of a JSON value (RawMergeTree's `dynamicType()`): `Null`, `Bool`, `Int64`, `UInt64`, `Double`, `String`, `Array`, `Object`. |\n| `raw_infer(json)` | scalar | The DuckDB type RawDuck assigns to a value, e.g. `BIGINT`, `DOUBLE[]`, or the flattened layout for objects: `OBJECT(a BIGINT, b.c VARCHAR)`. |\n\nAll ingest functions accept `transform := '...'`, `explode := '...'` and `ignore_errors := true`.\n\n## Asynchronous inserts\n\nBy default every `raw_ingest` call parses, evolves the schema, appends, and commits before\nreturning — callers immediately see their rows. Under many concurrent writers issuing small\npayloads, that means one transaction per call. Asynchronous mode trades immediate visibility for\nthroughput: calls enqueue the payload into a per-table buffer and return instantly, and a\nbackground flusher ingests each buffer as a single batch.\n\n```sql\nSET rawduck_async_insert = true;\n\nCALL raw_ingest('events', '[{\"id\": 1, \"action\": \"click\"}]');  -- returns immediately, rows = 0\nCALL raw_ingest('events', '[{\"id\": 2, \"action\": \"view\"}]');\n\n-- a buffer flushes when it exceeds the size threshold or its oldest entry exceeds the age\n-- threshold; force it when you need the data now:\nCALL raw_flush();\n-- ┌─────────┬──────┐\n-- │ targets │ rows │\n-- │       1 │    2 │\n-- └─────────┴──────┘\n\nSELECT count(*) FROM events;   -- 2\n```\n\n| Setting / function | Default | Meaning |\n|---|---|---|\n| `rawduck_async_insert` | `false` | Enable buffered ingestion for `raw_ingest` / `raw_ingest_file`. |\n| `rawduck_async_max_data_size` | `1048576` | Flush a table's buffer once it holds this many bytes. |\n| `rawduck_async_busy_timeout_ms` | `200` | Flush a buffer once its oldest payload is this old. |\n| `raw_flush()` | — | Drain all buffers synchronously; returns `(targets, rows)`. |\n\nSemantics to know before enabling it:\n\n- Buffered payloads commit in the flusher's own transactions — a `ROLLBACK` in the calling\n  session does not un-enqueue them, and a failed background flush drops that batch.\n- Data buffered for less than the age threshold is lost if the database closes first; call\n  `raw_flush()` before shutdown.\n- The HTTP and gRPC servers ingest asynchronously by default (their clients are exactly the\n  many-small-writers case, and a single flusher also serializes schema evolution instead of\n  letting per-request transactions race on it). Start them with `async := false` to make every\n  request its own synchronous transaction.\n\n## HTTP API\n\nRawDuck can serve an in-process HTTP API for ingestion and querying\n\n```sql\nCALL raw_serve(host := '127.0.0.1', port := 9999, token := 'rt_secret');\nCALL raw_serve_stop();\n```\n\n```sh\ncurl -X POST localhost:9999/v1/tables/events -H \"Authorization: Bearer rt_secret\" \\\n     -d '[{\"action\":\"click\",\"user\":\"alice\",\"value\":42}]'\n# {\"table\":\"events\",\"inserted\":1,\"created\":true,\"columns_added\":3,\"errors\":0}\n\ncurl -X POST localhost:9999/v1/query -H \"Authorization: Bearer rt_secret\" \\\n     -d '{\"sql\":\"SELECT action, count(*) FROM events GROUP BY action\"}'\n# {\"meta\":[...],\"data\":[[\"click\",1]],\"rows\":1,\"statistics\":{\"elapsed\":0.0016}}\n```\n\n| Endpoint | Behavior |\n|---|---|\n| `GET /health` | `{\"status\":\"ok\"}` |\n| `POST /v1/query` | `{\"sql\": \"...\"}` → `meta` / `data` / `rows` / `statistics` |\n| `GET /v1/tables`, `GET /v1/tables/{t}` | list tables / describe schema |\n| `POST /v1/tables/{t}` | schema-less ingest (`?transform=`, `?explode=`, `?ignore_errors=true`) |\n| `DELETE /v1/tables/{t}` | drop table |\n| `POST /otlp/v1/{traces,logs,metrics}` | OTLP/HTTP ingest (JSON or protobuf bodies) with envelope unwrapping and spec-shaped `partialSuccess` responses |\n\nRequests run on their own connections/transactions; a bearer `token` guards everything except\n`/health`; CORS is enabled for browser clients; gzip is supported both ways (request bodies with\n`Content-Encoding: gzip`, compressed responses for `Accept-Encoding: gzip` clients). Binds to\nlocalhost by default.\n\n### OpenTelemetry SDKs\n\nThe OTLP routes follow the standard signal paths and accept both wire encodings — `http/protobuf`\n(the SDK default) and `http/json` — so SDKs only need the endpoint base:\n\n```sh\nexport OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:9999/otlp\nexport OTEL_EXPORTER_OTLP_HEADERS=\"authorization=Bearer rt_secret\"\n```\n\nSignals land in `otel_traces`, `otel_logs`, and `otel_metrics` by default; route them to custom\ntables with the `x-rawduck-traces-table`, `x-rawduck-logs-table`, or `x-rawduck-metrics-table`\nheaders (the generic `x-rawduck-table` also works). Responses are OTLP-conformant in the request's\nencoding: an empty `partialSuccess` on full acceptance, signal-specific rejected counts otherwise.\nBoth encodings produce identical columns — trace/span ids stored as hex, enum fields as integers,\n`*UnixNano` timestamps as `BIGINT` — so mixed fleets of exporters share tables cleanly.\n\n### OTLP/gRPC\n\nBuilds made with `make release RAWDUCK_ENABLE_GRPC=1` (see Building) also serve the standard\nOpenTelemetry collector services natively:\n\n```sql\nCALL raw_serve_grpc(port := 4317, token := 'rt_secret');   -- TraceService/LogsService/MetricsService\nCALL raw_serve_grpc_stop();\n```\n\n```sh\nexport OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317\nexport OTEL_EXPORTER_OTLP_PROTOCOL=grpc\nexport OTEL_EXPORTER_OTLP_HEADERS=\"authorization=Bearer rt_secret\"\n```\n\nRequests are converted through protobuf's canonical OTLP/JSON mapping and flow through the same\nnative ingestion path as the HTTP routes, with the same `x-rawduck-*-table` routing (sent as gRPC\nmetadata) and `partialSuccess` semantics. On builds without gRPC the functions explain themselves;\nOTLP/HTTP (both encodings) stays fully functional.\n\n## ATTACH: RawMergeTree stores\n\n```sql\nATTACH 'rawduck:store.db' AS raw;\n```\n\nA RawDuck store is a native DuckDB database under a RawDuck-typed catalog: everything DuckDB can\ndo — joins, window functions, updates, exports, other extensions — works on RawMergeTree tables\ntransparently and at full native speed, while the store identifies itself for RawDuck's ingestion\nand adaptive-layout machinery. Stores persist and reattach like any database file.\n\nTwo kinds of `INSERT` coexist: typed inserts into the real tables behave exactly like DuckDB\n(fixed columns, binder-validated), while inserts into the virtual `ingest` schema take raw JSON\npayloads and handle creation and evolution. Both run in your transaction.\n\n## Transforms\n\nRawDuck reshapes envelope-style telemetry at ingest time: one row per nested event,\nwith the wrapper's fields merged into each row.\n\n```sql\n-- {\"owner\":\"123\",\"logGroup\":\"/aws/lambda/api\",\"logEvents\":[{\"id\":\"1\",\"message\":\"started\"},...]}\nCALL raw_ingest('logs', payload, transform := 'cloudwatch-logs');\n-- one row per log event, with owner and logGroup columns on each\n\n-- the generic form works for any envelope shape\nCALL raw_ingest('events', payload, explode := 'batch.items');\n```\n\nTransforms also apply to the INSERT lane through a session setting (a transform name or a\ndotted explode path):\n\n```sql\nSET rawduck_insert_transform = 'otlp-traces';\nINSERT INTO raw.ingest.spans SELECT json FROM read_json('traces.ndjson', ...);\nRESET rawduck_insert_transform;\n```\n\nBuilt-in transforms: `cloudwatch-logs`, `cloudtrail`, `firehose`, `otlp-traces`, `otlp-logs`,\n`otlp-metrics` (multi-level envelopes like `resourceSpans[].scopeSpans[].spans[]` are unwrapped\nwith resource/scope fields merged into every row). Transforms are user-extensible — definitions\nare data, so they load from files or tables like anything else in DuckDB:\n\n```sql\nSELECT raw_transform_define('my-batch', 'data.items');                          -- one-off\nSELECT raw_transform_define(name, explode) FROM read_json('transforms.json');   -- from a file\nSELECT raw_transform_define(name, explode) FROM raw.transform_config;           -- from a table\nCALL raw_transforms();                                                 -- list them all\n```\n\nDirty NDJSON streams can be ingested with `ignore_errors := true`; skipped lines are counted in the `errors` column.\n\n## The type lattice\n\nPer JSON path, RawDuck infers the narrowest type that holds everything seen, widening monotonically\nas data arrives (existing columns are `ALTER`ed in place, never rewritten destructively):\n\n```\n            BOOLEAN   BIGINT ──\u003e DOUBLE     DATE ──\u003e TIMESTAMP\n                │        │          │          │          │\n                └────────┴────\u003e VARCHAR \u003c──────┴──────────┘      (scalar conflicts)\n\n            object vs scalar, mixed arrays, arrays of objects ──\u003e JSON   (structural conflicts)\n```\n\n- integers out of `BIGINT` range degrade to `DOUBLE`\n- ISO `DATE` / `TIMESTAMP` strings are sniffed into temporal columns\n- homogeneous scalar arrays become typed `LIST`s (`BIGINT[]`, nested `BIGINT[][]`, …)\n- nothing is ever dropped: structurally conflicting values are preserved verbatim as `JSON`\n\n## Adaptive layout from observed workloads\n\nAn optimizer hook records every filter DuckDB pushes into a table scan and every GROUP BY\ncolumn set. `raw_optimize` turns filter/group usage into a physical sort order\n(RawMergeTree adaptive primary keys); `raw_project` materializes the hottest aggregation\nas a summary table (RawMergeTree lightweight projections):\n\n```sql\nSELECT count(*) FROM gh_events WHERE type = 'PushEvent';\nSELECT sum(\"payload.size\") FROM gh_events WHERE type = 'PushEvent' AND \"repo.id\" \u003e 700000000;\n\nCALL raw_stats();\n-- gh_events | type    | 2\n-- gh_events | repo.id | 1\n\nCALL raw_optimize('gh_events');\n-- gh_events | \"type\", \"repo.id\" | 247199\n\nSELECT type, count(*) FROM gh_events GROUP BY type;        -- observed by the advisor\nCALL raw_project('gh_events');\n-- gh_events | gh_events__proj | type | 15                 -- pre-aggregated summary table\n\nSET rawduck_use_projections = true;\nSELECT type, count(*) FROM gh_events GROUP BY type;        -- now answered from the projection\n```\n\nWith `rawduck_use_projections` enabled (off by default), eligible `count(*)` aggregations are\nrewritten onto fresh projections transparently — result types and values are identical, and a\nphysical-row-count staleness token guarantees a changed base table always falls back to a full\nscan. Intended for append-only analytics; in-place `UPDATE`s of group columns require re-running\n`raw_project`. Statistics persist across sessions with `raw_stats_save('store')` /\n`raw_stats_load('store')`.\n\n## DuckLake as a backend\n\nFor non-native catalogs RawDuck falls back to catalog-level SQL, so schema-less ingestion also\nworks against [DuckLake](https://ducklake.select) — straight into a lakehouse with snapshots and\nschema evolution tracked in the metadata:\n\n```sql\nATTACH 'ducklake:metadata.ducklake' AS lake (DATA_PATH 's3://bucket/raw');\nCALL raw_ingest('lake.main.events', payload);\n```\n\nCatalogs that cannot rewrite columns with expressions (DuckLake rejects `ALTER ... USING`)\ndegrade gracefully: RawDuck keeps the existing column type and converts incoming values instead.\n\n## Building\n\n```sh\ngit submodule update --init\nGEN=ninja make release\n```\n\nOTLP/HTTP protobuf decoding is **on by default**: builds pick up protobuf from the vcpkg\nmanifest's default `protobuf` feature (or a system package locally) and skip it gracefully when\nunavailable (wasm builds, or `make release RAWDUCK_DISABLE_OTLP_PROTOBUF=1`); without it, protobuf\nbodies get a 415 pointing at `http/json`.\n\nThe OTLP/gRPC server is **opt-in at build time** (it pulls the full gRPC stack, which\nsignificantly lengthens builds): `make release RAWDUCK_ENABLE_GRPC=1` enables it, using the\n`grpc` vcpkg manifest feature in CI or system gRPC/protobuf locally. Default builds skip it —\nOTLP/HTTP stays fully functional and `raw_serve_grpc()` explains how to enable support. The flags\nare cached per build directory; run `make clean` when toggling them. wasm builds never include them.\n\nArtifacts:\n\n```sh\n./build/release/duckdb                                          # shell with rawduck linked in\n./build/release/test/unittest                                   # test runner\n./build/release/extension/rawduck/rawduck.duckdb_extension      # loadable extension\n```\n\n## Tests\n\n```sh\nmake test\n```\n\nThe sqllogictests in `test/sql/` cover all standard JSON types, nested flattening, NDJSON, type\nwidening, schema evolution, structural conflicts, streaming file ingestion, multi-threaded appends, transforms, projections,\nerror-tolerant ingestion, RawDuck stores (`ATTACH 'rawduck:...'`), transactional rollback,\npredicate statistics + adaptive reordering, and DuckLake catalogs (`test/sql/ducklake.test`).\n\n## Status\n\nAll RawMergeTree concepts are implemented: schema-less evolving ingestion\n(native, transactional, pipelined, multi-threaded), adaptive physical layout from observed\npredicates with incremental re-sorting, the projection advisor with automatic aggregate rewriting,\nextensible ingest-time transforms, persisted statistics, RawDuck stores, DuckLake fallback, and\nan in-process HTTP API for ingestion and querying.\n\nSee [BENCHMARK.md](BENCHMARK.md) to reproduce the numbers and [AGENTS.md](AGENTS.md) for the\ndesign guide.\n\n---\n\nBased on the [DuckDB extension template](https://github.com/duckdb/extension-template).\nJSON parsing via DuckDB's vendored [yyjson](https://github.com/ibireme/yyjson).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquackscience%2Frawduck","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquackscience%2Frawduck","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquackscience%2Frawduck/lists"}