{"id":50618928,"url":"https://github.com/jeroenflvr/datapress","last_synced_at":"2026-06-06T09:01:26.810Z","repository":{"id":361597174,"uuid":"1247303707","full_name":"jeroenflvr/datapress","owner":"jeroenflvr","description":"A config-driven Rust server that publishes Parquet and Delta datasets as fast, typed HTTP APIs from local disk or object storage, with interchangeable DuckDB or Arrow+DataFusion backends, JSON and Arrow IPC output, and production-ready features like auth, metrics, and hot reloads.","archived":false,"fork":false,"pushed_at":"2026-06-05T05:45:04.000Z","size":2262,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-05T07:27:30.318Z","etag":null,"topics":["api","arrow","arrow-ipc","authz","datafusion","deltalake","duckdb","http","in-memory","parquet","s3","sql"],"latest_commit_sha":null,"homepage":"https://docs.datap-rs.org","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jeroenflvr.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-23T06:19:14.000Z","updated_at":"2026-06-05T05:45:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jeroenflvr/datapress","commit_stats":null,"previous_names":["jeroenflvr/fast-api","jeroenflvr/datapress"],"tags_count":46,"template":false,"template_full_name":null,"purl":"pkg:github/jeroenflvr/datapress","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeroenflvr%2Fdatapress","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeroenflvr%2Fdatapress/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeroenflvr%2Fdatapress/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeroenflvr%2Fdatapress/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jeroenflvr","download_url":"https://codeload.github.com/jeroenflvr/datapress/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jeroenflvr%2Fdatapress/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33975476,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-06T02:00:07.033Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","arrow","arrow-ipc","authz","datafusion","deltalake","duckdb","http","in-memory","parquet","s3","sql"],"created_at":"2026-06-06T09:01:26.231Z","updated_at":"2026-06-06T09:01:26.802Z","avatar_url":"https://github.com/jeroenflvr.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Rust](https://img.shields.io/badge/built%20with-Rust-orange?logo=rust)\n![DuckDB](https://img.shields.io/badge/backend-DuckDB-yellow?logo=duckdb)\n![DataFusion](https://img.shields.io/badge/backend-DataFusion-blue?logo=apache)![actix](https://img.shields.io/badge/backend-actix-orange?logo=actix)\n\n# datap-rs\n\nA Rust **Cargo workspace** that exposes one or more **Parquet / Delta\ndatasets** over a JSON HTTP API. The same surface area is implemented twice —\nonce on top of **DuckDB**, once on top of **Apache Arrow + DataFusion** — so\nyou can A/B the engines under identical workloads. A Python wheel\n(`datap-rs`, built with maturin + PyO3) bundles both engines and lets you\nconfigure and launch the server from Python.\n\n**[Overview presentation → datap-rs.org](https://datap-rs.org)** ·\n[Documentation](https://docs.datap-rs.org)\n\n- Built on [actix-web](https://actix.rs/) 4\n- Datasets declared in a single [`datasets.toml`](datasets.toml) (Rust\n  binaries) or programmatically (Python wrapper)\n- Dynamic schema inference at startup (no hard-coded columns)\n- Identical request/response shapes across both backends\n\n---\n\n## Quick start\n\nFor testing, we're using this [kaggle US accidents 2016-2023](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) dataset.\n\n\n```bash\n# 1. Put a parquet file somewhere (or point the config at an existing one).\nls data/accidents.parquet\n\n# 2. Edit datasets.toml — see the example shipped in this repo.\n\n# 3. Run a backend.\ntask run:duckdb        # or: task run:datafusion\n\n# 4. Talk to it.\ncurl http://localhost:8080/api/v1/datasets\n```\n\n`Taskfile.yml` wraps the typical `cargo build --release -p …` invocations;\nsee [`task --list`](Taskfile.yml) for the full menu.\n\n### Install the prebuilt binary\n\nThe quickest way to get the unified `datapress` binary (both backends bundled,\nselected at runtime via `server.backend`) without a Rust toolchain:\n\n```bash\n# Linux / macOS\ncurl -LsSf https://datap-rs.org/install.sh | sh\n\n# Windows (PowerShell)\npowershell -ExecutionPolicy ByPass -c \"irm https://datap-rs.org/install.ps1 | iex\"\n\n# Homebrew (macOS / Linux)\nbrew install jeroenflvr/tap/datapress\n\n# winget (Windows)\nwinget install datap-rs.DataPress\n```\n\nThe install scripts drop the binary in a per-user directory (`~/.local/bin` on\nUnix, `%LOCALAPPDATA%\\datapress\\bin` on Windows) and tell you how to add it to\nyour `PATH`. See [packaging/](packaging/) for details and release automation.\n\nPrefer cargo? Install from crates.io:\n\n```bash\ncargo install datapress        # both DuckDB + DataFusion\ndatapress                      # reads ./datasets.toml (or $DATASETS_CONFIG)\n```\n\nFor a slimmer single-backend build, or to opt into the docs / Swagger /\nmetrics / auth features:\n\n```bash\ncargo install datapress --no-default-features --features duckdb\ncargo install datapress --features swagger,auth,metrics\n```\n\nThe installed binary resolves its config from (first match wins)\n`--config \u003cFILE\u003e`, `$DATAPRESS_CONFIG_FILE`, `./datasets.toml`, then\n`$HOME/datasets.toml`. Generate a starter template with `datapress init`\n(writes `datasets.toml.template` to a directory, or `$HOME` when omitted):\n\n```bash\ndatapress init                 # ~/datasets.toml.template\ncp ~/datasets.toml.template ~/datasets.toml   # then edit and run `datapress`\n```\n\n### From Python\n\nThe same server can be configured and launched from Python via the\n`datapress` wheel (one wheel, both engines bundled):\n\n```python\nimport asyncio\nfrom datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig\n\nasync def main():\n    ds = DatasetConfig(\n        name=\"accidents\",\n        source=\"data/accidents.parquet\",\n        format=\"parquet\",   # or \"delta\"\n        mode=\"auto\",        # index policy: \"auto\" | \"none\" | \"list\"\n    )\n    cfg = DataPressConfig(backend=\"duckdb\", listen=\"0.0.0.0\", port=8000, workers=8)\n    server = DataPress(cfg, datasets=[ds])\n    await server.run()      # blocks until SIGINT\n\nasyncio.run(main())\n```\n\nBuild the wheel with `task py:develop` (uses `uv` + `maturin`).\n\n---\n\n## The two backends\n\n| Aspect              | `datapress-duckdb`                             | `datapress-datafusion`                               |\n|---------------------|------------------------------------------------|------------------------------------------------------|\n| Engine              | DuckDB (embedded C++)                          | Arrow compute + DataFusion (pure Rust)               |\n| Storage             | DuckDB in-memory table per dataset             | One contiguous `RecordBatch` per dataset             |\n| Concurrency model   | Connection pool, blocking → `web::block`       | Async-native, multi-threaded `MemTable` partitions   |\n| Predicate execution | DuckDB optimiser + parallel hash/vector ops    | Equality index → SIMD scan → DataFusion SQL          |\n| Indexes             | Native DuckDB internals (zone maps, etc.)      | Per-dataset eq-index built at startup (configurable) |\n| Memory profile      | DuckDB's own buffer manager                    | Whole dataset resident in RAM                        |\n| Binary size         | Bundled DuckDB ≈ tens of MB                    | Lean — pure Rust                                     |\n| Startup time        | Fast (just `read_parquet`)                     | Slower — reads all rows + builds eq-index            |\n| Best at             | Heterogeneous SQL, joins, aggregations         | Dense filter scans, low-latency point lookups        |\n\n### When to pick which\n\n- **DuckDB** is the right default. It handles arbitrary SQL well, has a\n  battle-tested optimiser, manages memory itself, and starts up in\n  milliseconds because it lazily reads parquet pages on demand.\n- **DataFusion** shines when:\n  - the dataset fits comfortably in RAM,\n  - you query the same columns repeatedly with equality/`IN` predicates\n    (the in-process equality index turns those into O(1) lookups), and\n  - you want a single static binary without a vendored C++ runtime.\n\nThe HTTP API is identical, so the practical comparison is \"throughput and\np99 on your queries\" — see [`TEST_Q.md`](TEST_Q.md) for a benchmark suite.\n\n---\n\n## Configuration: `datasets.toml`\n\nEvery instance reads this file at startup. One `[server]` block plus one\n`[[dataset]]` entry per table you want to expose.\n\n```toml\n[server]\nbackend = \"datafusion\"   # \"datafusion\" (default) | \"duckdb\"\nlisten  = \"127.0.0.1\"    # default; set to \"0.0.0.0\" to expose\nport    = 8080\n# workers = 8            # omit for one worker per CPU\n# compress = true        # negotiate gzip/brotli/zstd via Accept-Encoding (default)\n# max_body_bytes     = 1048576  # 413 above this; default 1 MiB\n# max_page_size      = 100000   # clamp query page_size above this\n# request_timeout_ms = 30000    # 504 above this; 0 disables; default 30s\n# shutdown_timeout_secs = 30    # SIGTERM/SIGINT grace period, in seconds\n\n# DuckDB backend only: enable the experimental Quack remote protocol.\n# [server.quack]\n# enabled = false\n# uri = \"quack:localhost\"\n# token = \"change-me\"\n# read_only = true\n\n[[dataset]]\nname = \"accidents\"                    # used in the URL: /api/datasets/accidents/...\n\n  [dataset.source]\n  kind     = \"parquet\"                # \"parquet\" | \"delta\"\n  location = \"data/accidents.parquet\" # file, directory of *.parquet, or s3://…\n\n  # Optional — DataFusion only. DuckDB ignores this block.\n  [dataset.index]\n  mode             = \"auto\"           # \"auto\" | \"none\" | \"list\"\n  columns          = []               # required when mode = \"list\"\n  max_cardinality  = 100000           # used by \"auto\" to skip wide cols\n```\n\n### Server\n\n| Field     | Default       | Notes                                                                                          |\n|-----------|---------------|------------------------------------------------------------------------------------------------|\n| `backend` | `datafusion`  | Informational hint; logged at startup. Each binary always runs as its own backend regardless of this value. |\n| `listen`  | `127.0.0.1`   | Loopback by default — the service is **not** exposed on a network interface unless you opt in. |\n| `port`    | `8080`        |                                                                                                |\n| `workers` | *(unset)*     | Actix worker threads. Unset = one per CPU.                                                     |\n| `prefix`  | `\"\"`          | URL path prefix mounted in front of every route (e.g. `\"/datapress\"`) — useful behind a reverse proxy that passes the path through unchanged. Must start with `/` and not end with `/`. |\n| `compress`           | `true`     | Negotiate response compression via `Accept-Encoding` (gzip / brotli / zstd). Disable when sitting behind a proxy that compresses for you. |\n| `max_body_bytes`     | `1048576`  | Maximum accepted JSON request body, in bytes. Bigger bodies are rejected with `413 Payload Too Large`. |\n| `max_page_size`      | `100000`   | Maximum rows returned by one `/query` page. Larger `page_size` values are clamped. |\n| `request_timeout_ms` | `30000`    | Per-request handler timeout, in milliseconds. Long-running handlers are cancelled and the client gets `504 Gateway Timeout`. `0` disables the timeout. |\n| `shutdown_timeout_secs` | `30`     | Grace period for in-flight requests after the process receives `SIGTERM` / `SIGINT`, in seconds. The listening socket is closed immediately; existing connections then have up to this many seconds to finish before workers are force-stopped. |\n\nDuckDB builds can also opt into `[server.quack]`, DuckDB's experimental\nremote protocol server. Keep it disabled unless you intentionally want\nDuckDB clients to attach/query this process directly. It binds to\n`quack:localhost` by default, uses token authentication, and DataPress\ninstalls a read-only authorization hook by default.\n\nThe server exposes three probe endpoints. `/healthz` and `/readyz` are\nmounted at the bare host root (regardless of `prefix`) so orchestrators\ndon't need to know how the service is exposed. `/health` lives under\n`prefix` and is intended for in-app health checks.\n\n| Route      | Status                                                                 | Body                                                                       |\n|------------|------------------------------------------------------------------------|----------------------------------------------------------------------------|\n| `/healthz` | Liveness — always `200` while the process is running.                  | `{\"status\":\"ok\"}`                                                          |\n| `/readyz`  | Readiness — `200` once at least one dataset is registered, `503` otherwise. | `{\"status\":\"ready\",\"datasets\":N}` / `{\"status\":\"not ready\",\"reason\":\"no datasets registered\"}` |\n| `/version` | Build / version metadata — always `200`.                              | `{\"name\":\"datapress-core\",\"version\":\"x.y.z\",\"backend\":\"DuckDB\\|DataFusion\",\"profile\":\"debug\\|release\", ...}` |\n| `{prefix}/health` | App-level liveness — always `200`.                             | `{\"status\":\"ok\"}`                                                          |\n\n`/healthz` does not touch the backend, so it stays `200` even while the\ndataset registry is still loading at startup. Use `/readyz` to gate\ntraffic until the server is actually able to serve queries.\n\n`/version` also includes optional fields populated from build-time env\nvars when set: `git_sha` (`DATAPRESS_GIT_SHA`), `build_time`\n(`DATAPRESS_BUILD_TIME`, ISO-8601), and `target`\n(`DATAPRESS_TARGET`, e.g. `aarch64-apple-darwin`). Unset vars are\nomitted from the JSON. Example:\n\n```bash\nDATAPRESS_GIT_SHA=$(git rev-parse --short HEAD) \\\nDATAPRESS_BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \\\nDATAPRESS_TARGET=$(rustc -vV | awk '/host:/ {print $2}') \\\n  cargo build --release -p datapress-duckdb\n```\n\n### Online documentation\n\nDataPress can embed two browsable sources of documentation into the\nbinary itself:\n\n- An [MkDocs Material](https://squidfunk.github.io/mkdocs-material/)\n  site (the one you are reading) at `[docs].path` (default `/mkdocs`).\n- An interactive [Swagger UI](https://swagger.io/tools/swagger-ui/)\n  with a hand-written OpenAPI spec at `[swagger].path` (default\n  `/docs`). The raw spec is also exposed at `\u003cpath\u003e/openapi.json`.\n\nBoth are opt-in at build time (so wheels stay slim when you don't\nwant them) and **enabled by default at runtime** once compiled in —\nset `enabled = false` to disable in prod.\n\n1. Build the MkDocs site (only needed for the `docs` feature):\n\n   ```bash\n   task docs:build\n   ```\n\n2. Build the backend with one or both features:\n\n   ```bash\n   cargo build --release -p datapress-duckdb --features docs,swagger\n   ```\n\n3. Tweak in `datasets.toml` if you want to relocate or disable either:\n\n   ```toml\n   [docs]\n   enabled = true        # default: true\n   path    = \"/mkdocs\"   # default: /mkdocs\n\n   [swagger]\n   enabled = true        # default: true (set to false in prod)\n   path    = \"/docs\"     # default: /docs\n   ```\n\nBoth `path` values must start with `/`, not end with `/`, not collide\nwith `/api`, `/api/v1`, `/health{z,}`, `/readyz`, or `/version`, and\nmust differ from each other. When the binary is built without the\nrelevant feature but the TOML enables it, the server logs a warning at\nstartup and continues without that surface.\n\n### Authentication (OIDC / OAuth2)\n\nBuild with `--features auth` to enable JWT bearer enforcement against\nany OpenID-Connect issuer (Entra ID, Auth0, Keycloak, Okta, …). When\nenabled, the server fetches the issuer's JWKS at startup, refreshes it\nin the background, and validates `Authorization: Bearer \u003cjwt\u003e` headers\nagainst the configured issuer, audience, algorithms, and scopes.\n\n```toml\n[auth]\nenabled         = true\nissuer          = \"https://login.microsoftonline.com/\u003ctenant-id\u003e/v2.0\"\naudience        = \"api://datapress\"\nalgorithms      = [\"RS256\"]\nread_scopes     = [\"datasets:read\"]\nreload_scopes   = [\"datasets:reload\"]\nanonymous_read  = false      # set true to keep read endpoints public\ntenant_claim    = \"/tid\"     # JSON-pointer into the JWT claims\nallowed_tenants = [\"\u003ctenant-id\u003e\"]\nadmin_token_fallback = true  # keep X-Admin-Token working in parallel\n```\n\nHealth probes (`/healthz`, `/readyz`, `/version`) stay unauthenticated\nso load balancers keep working. The legacy `X-Admin-Token` header keeps\nworking for `POST .../reload` as long as `admin_token_fallback = true`.\n\nTo turn the Swagger UI itself into an SSO client, add an `[swagger.oauth2]`\nblock — it gets rendered as an `OpenIdConnect` security scheme with PKCE.\n\n### Source\n\n`[dataset.source]` is a tagged enum.\n\n| `kind`    | `location`                                          | Notes                                                                                  |\n|-----------|-----------------------------------------------------|----------------------------------------------------------------------------------------|\n| `parquet` | a `.parquet` file                                   | Read as-is.                                                                            |\n| `parquet` | a directory                                         | Every `*.parquet` inside (sorted, non-recursive). No glob patterns.                    |\n| `parquet` | `s3://bucket/key.parquet` or `s3://bucket/prefix/`  | Requires a `[dataset.s3]` block. DuckDB autoloads `httpfs`.                            |\n| `delta`   | a local directory                                   | Pointed at the table root (the dir containing `_delta_log/`).                          |\n| `delta`   | `s3://bucket/path/to/table`                         | Requires `[dataset.s3]`. DuckDB autoloads `delta`; DataFusion uses the `deltalake` crate. |\n\n#### S3 / S3-compatible storage\n\n```toml\n[[dataset]]\nname = \"events\"\n\n  [dataset.source]\n  kind     = \"parquet\"           # or \"delta\"\n  location = \"s3://events/2025/*.parquet\"\n\n  [dataset.s3]\n  region            = \"us-east-1\"\n  endpoint          = \"http://localhost:9000\"  # omit for AWS\n  addressing_style  = \"path\"                   # \"virtual\" (default) | \"path\"\n  allow_http        = true                     # only for non-https endpoints\n```\n\n| Field              | Default       | Notes                                                                          |\n|--------------------|---------------|--------------------------------------------------------------------------------|\n| `region`           | `us-east-1`   | Falls back to `AWS_REGION` env, then `us-east-1`.                              |\n| `endpoint`         | *(unset)*     | Custom S3 endpoint (MinIO, R2, Wasabi, Backblaze, …).                          |\n| `addressing_style` | `virtual`     | `virtual` = `https://bucket.host`, `path` = `https://host/bucket` (MinIO).     |\n| `allow_http`       | `false`       | Must be `true` if `endpoint` is `http://…`.                                    |\n| `partitioning`     | `auto`        | Hive partition discovery: `auto`, `hive` (force on), `none` (force off).        |\n| `endpoint_bucket_in_host` | `auto` | Fold the bucket into the endpoint host: `auto` (follows `addressing_style`), `true`, `false`. |\n| `access_key_id`, `secret_access_key`, `session_token` | *(unset)* | Inline creds. Discouraged for prod — use env vars instead. |\n\n**Credential precedence** (highest → lowest):\n\n1. Per-dataset env vars: `${PREFIX}_AWS_ACCESS_KEY_ID`, `${PREFIX}_AWS_SECRET_ACCESS_KEY`, `${PREFIX}_AWS_SESSION_TOKEN`, `${PREFIX}_AWS_REGION`.\n   `PREFIX` is the dataset name uppercased with every non-alphanumeric character mapped to `_` (e.g. `accidents` → `ACCIDENTS_AWS_…`, `my-bucket` → `MY_BUCKET_AWS_…`).\n2. Inline `[dataset.s3]` keys.\n3. Plain `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_REGION`.\n4. The backend's default credential chain (`~/.aws/credentials`, IMDS, etc.).\n\n\u003e **Python:** the `S3Config` binding also accepts a `credentials_provider` — a zero-argument callable returning an `HMACKeyPair`. It is invoked once when `DataPress(...)` is constructed, the result is cached indefinitely, and it overrides any inline `access_key_id` / `secret_access_key`. See the [Python S3 docs](https://docs.datap-rs.org/python/config/#s3config).\n\n\n\u003e When `kind = \"delta\"` and `location` is an `s3://…` URL, both backends fully materialise the table at startup. There is no incremental scan path — switch to `parquet` if you need on-demand page reads.\n\n### Equality-index policy (DataFusion only)\n\nThe DataFusion backend builds an in-memory `value -\u003e [row ids]` map at\nstartup so that `eq` / `in` predicates resolve in O(1).\n\n| `mode`   | Behaviour                                                              |\n|----------|------------------------------------------------------------------------|\n| `auto`   | Index every column whose distinct count stays below `max_cardinality`. |\n| `none`   | Skip the index entirely — every query goes through DataFusion SQL.     |\n| `list`   | Index only the named `columns`. Useful for huge datasets.              |\n\nOverride the config path with `DATASETS_CONFIG=/path/to/file.toml`.\n\n## HTTP API\n\nFour routes, both backends:\n\n### API versioning\n\nThe canonical paths live under `/api/v1/...`. The un-versioned\n`/api/...` paths continue to work as a **legacy alias** for v1, so\nexisting clients keep running. To upgrade, replace `/api/` with\n`/api/v1/` in your URLs — nothing else changes.\n\n```text\nPOST /api/v1/datasets/accidents/query      # canonical (recommended)\nPOST /api/datasets/accidents/query         # legacy alias, still v1\n```\n\nWhen a breaking schema change is introduced, it will ship as `/api/v2`\nin a sibling module ([crates/core/src/handlers/v1.rs](crates/core/src/handlers/v1.rs))\nand v1 will stay mounted alongside it for a deprecation window.\n\n### `GET /api/v1/datasets`\n\n```json\n{ \"datasets\": [ { \"name\": \"accidents\", \"columns\": 47 } ] }\n```\n\n### `GET /api/v1/datasets/{name}/schema`\n\nReturns the inferred columns plus a sample row so a client can see what\nvalues look like without issuing a query.\n\n```json\n{\n  \"name\": \"accidents\",\n  \"columns\": [\n    { \"name\": \"ID\",       \"logical\": \"utf8\", \"sql_type\": \"VARCHAR\",   \"nullable\": false },\n    { \"name\": \"Severity\", \"logical\": \"int\",  \"sql_type\": \"INTEGER\",   \"nullable\": true  },\n    { \"name\": \"Start_Time\", \"logical\": \"temporal\", \"sql_type\": \"TIMESTAMP\", \"nullable\": true }\n  ],\n  \"sample\": { \"ID\": \"A-1\", \"Severity\": 2, \"Start_Time\": \"2016-02-08 05:46:00\", ... }\n}\n```\n\n`logical` values: `bool | int | float | utf8 | temporal | other`. Temporal\ncolumns are returned as strings.\n\n### `POST /api/v1/datasets/{name}/query`\n\n```json\n{\n  \"columns\":   [\"ID\", \"City\", \"State\", \"Severity\"],\n  \"predicates\": [\n    { \"col\": \"State\",    \"op\": \"eq\",  \"val\": \"TX\" },\n    { \"col\": \"Severity\", \"op\": \"gte\", \"val\": 3   }\n  ],\n  \"order_by\": [\n    { \"col\": \"Severity\", \"dir\": \"desc\" },\n    { \"col\": \"ID\" }\n  ],\n  \"limit\":     1000,\n  \"page\":      1,\n  \"page_size\": 50\n}\n```\n\nResponse:\n\n```json\n{ \"data\": [ { ... }, ... ], \"page\": 1, \"page_size\": 50 }\n```\n\n#### Request fields\n\n| Field        | Type                | Default | Notes                                  |\n|--------------|---------------------|---------|----------------------------------------|\n| `columns`    | `string[]`          | `[]`    | Empty = all columns.                   |\n| `predicates` | `Predicate[]`       | `[]`    | ANDed together.                        |\n| `order_by`   | `OrderBy[]`         | `[]`    | `{ col, dir? }`; `dir` is `asc` (default) or `desc`, case-insensitive. When `group_by` is set, `col` must be a group column or aggregation alias. |\n| `group_by`   | `string[]`          | `[]`    | Columns to group by. When set, `columns` is ignored. Empty `aggregations` implies `[{ op: \"count\" }]`. |\n| `aggregations` | `Aggregation[]`   | `[]`    | `{ col?, op, alias? }`; `op` is `count\\|sum\\|avg\\|min\\|max`. `col` may be omitted only for `count` (= `COUNT(*)`). Requires `group_by`. |\n| `distinct`   | `bool`              | `false` | Dedup the projected columns. Mutually exclusive with `group_by` / `aggregations`. |\n| `limit`      | `int \u003e= 0` or null  | `null`  | Hard cap on total rows across all pages. `null` = unlimited. |\n| `page`       | `int \u003e= 1`          | `1`     | 1-based.                               |\n| `page_size`  | `int \u003e= 1`               | `1000`   | Clamped to `server.max_page_size` (`100_000` by default). |\n\n#### Predicate shape\n\n```json\n{ \"col\": \"\u003ccolumn\u003e\", \"op\": \"\u003coperator\u003e\", \"val\": \u003cjson value | array | omitted\u003e }\n```\n\n| `op`           | `val`                  | Meaning                              |\n|----------------|------------------------|--------------------------------------|\n| `eq`           | scalar                 | `col = val`                          |\n| `neq`          | scalar                 | `col \u003c\u003e val`                         |\n| `gt` / `gte`   | number / string        | `col \u003e val` / `col \u003e= val`           |\n| `lt` / `lte`   | number / string        | `col \u003c val` / `col \u003c= val`           |\n| `like`         | string with `%` / `_`  | SQL `LIKE`                           |\n| `ilike`        | string with `%` / `_`  | Case-insensitive `LIKE`              |\n| `in`           | non-empty array        | `col IN (v1, v2, …)`                 |\n| `is_null`      | omit                   | `col IS NULL`                        |\n| `is_not_null`  | omit                   | `col IS NOT NULL`                    |\n\nColumn names are looked up case-insensitively against the inferred schema\nand quoted automatically, so `Temperature(F)` and similar identifiers work.\n\n#### Response format — JSON or Arrow IPC\n\n`/query` can return its result set in two wire formats. Same body, same\npredicates, same pagination — only the response encoding differs.\n\n| Aspect              | JSON (default)                                       | Arrow IPC stream                                                                 |\n|---------------------|------------------------------------------------------|----------------------------------------------------------------------------------|\n| Content-Type        | `application/json`                                   | `application/vnd.apache.arrow.stream`                                            |\n| How to ask          | nothing — it's the default                           | `Accept: application/vnd.apache.arrow.stream` **or** `?format=arrow` on the URL  |\n| Shape               | Array of row objects (`[{...}, {...}, ...]`)         | Self-describing stream: 1 schema message + N `RecordBatch` messages + EOS        |\n| Layout              | Row-oriented; column names repeated on every row     | Columnar; one contiguous buffer per column per batch                             |\n| Types preserved     | Scalars become JSON (`int`/`float`/`bool`/`string`); temporals stringified to ISO-8601 | Native Arrow types — `Int32`, `Timestamp(ns)`, `Decimal128`, dictionary, etc. retained end-to-end |\n| Page metadata       | In the body (just the rows, no envelope)             | In headers: `X-Page`, `X-Page-Size`                                              |\n| Empty result        | `[]`                                                 | Valid stream with the schema message only, zero batches                          |\n| Compression         | Big win — JSON is text                               | Smaller starting point; gzip/zstd still help on wide / repetitive cols, brotli usually skipped |\n| Client cost         | `json.loads` + per-row dict construction             | `pyarrow.ipc.open_stream(...).read_all()` → zero-copy `pyarrow.Table`            |\n| Best for            | Small responses, browsers, ad-hoc `curl`, dashboards | Bulk data into Polars / pandas / DuckDB-on-the-client, ML feature pipelines      |\n\n**When to pick which.** Use JSON when the consumer is JavaScript, the\nresponse is small (\u003c~10k rows), or you're poking at the API by hand.\nUse Arrow IPC when you're moving result pages into a dataframe library,\nthe schema has non-string types you want preserved, or page sizes are\nlarge enough that JSON parse time shows up in profiles.\n\n```bash\n# JSON (default)\ncurl -X POST http://localhost:8080/api/v1/datasets/accidents/query \\\n  -H 'Content-Type: application/json' \\\n  -d '{ \"predicates\": [{ \"col\": \"State\", \"op\": \"eq\", \"val\": \"TX\" }] }'\n\n# Arrow IPC — via Accept header\ncurl -X POST http://localhost:8080/api/v1/datasets/accidents/query \\\n  -H 'Content-Type: application/json' \\\n  -H 'Accept: application/vnd.apache.arrow.stream' \\\n  --output result.arrow \\\n  -d '{ \"predicates\": [{ \"col\": \"State\", \"op\": \"eq\", \"val\": \"TX\" }] }'\n\n# Arrow IPC — via query string (handy when you can't set headers)\ncurl -X POST 'http://localhost:8080/api/v1/datasets/accidents/query?format=arrow' \\\n  -H 'Content-Type: application/json' \\\n  --output result.arrow \\\n  -d '{ \"predicates\": [{ \"col\": \"State\", \"op\": \"eq\", \"val\": \"TX\" }] }'\n```\n\n```python\nimport requests, pyarrow.ipc as ipc\nr = requests.post(url, json=req, headers={\"Accept\": \"application/vnd.apache.arrow.stream\"})\ntable = ipc.open_stream(r.content).read_all()  # → pyarrow.Table\npage  = int(r.headers[\"X-Page\"])\nsize  = int(r.headers[\"X-Page-Size\"])\n```\n\nSupported on **both** backends — DuckDB streams batches out via its\nnative `query_arrow` API, DataFusion uses its Arrow plan directly.\nThe `Compress` middleware still applies. `count`, `schema`, and the\ndataset-listing endpoints are JSON-only.\n\n#### Grouping / aggregation\n\nWhen `group_by` is non-empty the SELECT list is derived from the group\ncolumns plus each aggregation's output alias — the top-level `columns`\nfield is ignored. Supported ops: `count`, `sum`, `avg`, `min`, `max`\n(case-insensitive). `col` may be omitted only for `count` (= `COUNT(*)`).\nIf `aggregations` is omitted an implicit `COUNT(*) AS count` is added.\n\n```bash\ncurl -X POST http://localhost:8080/api/v1/datasets/accidents/query \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"group_by\": [\"State\"],\n    \"aggregations\": [\n      { \"op\":  \"count\" },\n      { \"col\": \"Severity\", \"op\": \"avg\", \"alias\": \"avg_sev\" }\n    ],\n    \"order_by\": [{ \"col\": \"count\", \"dir\": \"desc\" }],\n    \"page_size\": 10\n  }'\n# → { \"data\": [ { \"State\": \"CA\", \"count\": 1741433, \"avg_sev\": 2.21 }, ... ], ... }\n```\n\n`aggregations` without `group_by` returns `400`. `order_by` keys must\nreference a group column or an aggregation alias (no arbitrary dataset\ncolumns — they are not in scope after `GROUP BY`). Grouped queries always\ngo through the SQL engine; no in-memory fast path applies.\n\n#### Distinct rows\n\n`distinct: true` deduplicates on the projected columns. Useful for\nbuilding dropdowns / facet lists.\n\n```bash\ncurl -X POST http://localhost:8080/api/v1/datasets/accidents/query \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"columns\":  [\"State\"],\n    \"distinct\": true,\n    \"order_by\": [{ \"col\": \"State\" }],\n    \"page_size\": 100\n  }'\n# → { \"data\": [ { \"State\": \"AL\" }, { \"State\": \"AR\" }, ... ], ... }\n```\n\nMutually exclusive with `group_by` / `aggregations` (returns `400` if\ncombined). Also bypasses the in-memory fast paths.\n\n### `POST /api/v1/datasets/{name}/count`\n\nReturns the number of rows matching `predicates`. Same predicate shape as\n`/query`; only the `predicates` field is read. Empty body counts every row.\n\n```bash\ncurl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \\\n  -H 'Content-Type: application/json' -d '{}'\n# → { \"count\": 7728394 }\n\ncurl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"predicates\": [\n      { \"col\": \"State\",    \"op\": \"eq\",  \"val\": \"TX\" },\n      { \"col\": \"Severity\", \"op\": \"gte\", \"val\": 3   }\n    ]\n  }'\n# → { \"count\": 187423 }\n```\n\nOn materialised DataFusion datasets the no-predicate path is O(1) (uses the\nresident chunk metadata, no scan); indexable predicates short-circuit\nthrough the equality index. Otherwise it runs `SELECT COUNT(*) … WHERE …`\nthrough the engine.\n\n### `POST /api/v1/datasets/{name}/reload` *(admin)*\n\nRebuilds the dataset from its configured `source` and publishes the new\ncontents without a server restart. Running queries finish against a\nconsistent old snapshot; later queries see the new data. If the rebuild\nfails, the previously published dataset stays live.\n\nRequires `X-Admin-Token: $ADMIN_TOKEN`. **If `ADMIN_TOKEN` is unset the\nendpoint is disabled** — the secure default. The comparison is\nconstant-time.\n\n```bash\ncurl -s -X POST \\\n  -H \"X-Admin-Token: $ADMIN_TOKEN\" \\\n  http://localhost:8080/api/v1/datasets/accidents/reload\n# { \"dataset\": \"accidents\", \"rows\": 7728394, \"elapsed_ms\": 1842 }\n```\n\n| Status | Body                                          | Meaning                                              |\n|--------|-----------------------------------------------|------------------------------------------------------|\n| `200`  | `{ dataset, rows, elapsed_ms }`               | New data live.                                       |\n| `403`  | `{ \"error\": \"forbidden: …\" }`                 | Token missing/wrong, or `ADMIN_TOKEN` not set.       |\n| `404`  | `{ \"error\": \"not found: dataset: …\" }`        | No such dataset in `datasets.toml`.                  |\n| `500`  | `{ \"error\": \"internal error: …\" }`            | Parquet read failed — old data stays live.           |\n\nConcurrent reloads of the **same** dataset are serialised (per-name mutex);\nreloads of **different** datasets run in parallel.\n\n#### Backend-specific reload semantics\n\n- **DataFusion** uses a service-level double buffer. The backend builds a\n  fresh `DatasetState` off to the side (parquet/Delta read, Arrow\n  `RecordBatch` chunks, equality indexes, partition metadata), registers\n  the new provider, then publishes it with an `ArcSwap` snapshot update.\n  Queries that already captured the old `Arc` keep running; later queries\n  see the new state. The old buffers are dropped once the last reader\n  releases its reference. Trade-off: for materialised datasets, peak RSS\n  can approach roughly twice the dataset size plus index overhead during\n  reload.\n- **DuckDB** delegates publication to the database engine. Reload runs\n  `CREATE OR REPLACE TABLE ... AS SELECT ...` against the dataset source.\n  DuckDB treats that as an ACID transaction over the table/catalog\n  replacement: if the source read or table creation fails, the existing\n  table remains live; if it succeeds, later queries see the replacement\n  atomically. In-flight queries continue against the snapshot they started\n  with through DuckDB's transaction/MVCC semantics. DataPress then\n  refreshes only the small cached schema and row-count metadata.\n\nThe HTTP contract is the same for both backends: clients observe either\nthe old dataset or the new dataset, never a partially loaded one. The\nresource profile differs: DataFusion owns the Arrow buffers in process;\nDuckDB relies on DuckDB's storage engine and buffer manager.\n\n---\n\n\n## Examples\n\n```bash\n# Discovery\ncurl -s http://localhost:8080/api/v1/datasets | jq\ncurl -s http://localhost:8080/api/v1/datasets/accidents/schema | jq\n\n# Equality + range\ncurl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"columns\": [\"ID\",\"Severity\",\"City\",\"State\",\"Start_Time\"],\n    \"predicates\": [\n      { \"col\": \"State\",    \"op\": \"eq\",  \"val\": \"TX\" },\n      { \"col\": \"Severity\", \"op\": \"gte\", \"val\": 3 }\n    ],\n    \"page\": 1, \"page_size\": 5\n  }' | jq\n\n# Substring + numeric range\ncurl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"predicates\": [\n      { \"col\": \"Description\",    \"op\": \"ilike\", \"val\": \"%fog%\" },\n      { \"col\": \"Temperature(F)\", \"op\": \"lt\",    \"val\": 32 }\n    ],\n    \"page_size\": 10\n  }' | jq\n\n# IN list\ncurl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"predicates\": [\n      { \"col\": \"State\", \"op\": \"in\", \"val\": [\"NY\",\"NJ\",\"CT\"] }\n    ]\n  }' | jq\n```\n\nFor a deeper benchmark catalogue (light load + CPU/memory stress tests), see\n[`TEST_Q.md`](TEST_Q.md).\n\n---\n\n## Project layout\n\n```\nCargo.toml                          # workspace manifest\npyproject.toml                      # maturin / PyO3 build\ncrates/\n├── core/                           # datapress-core: config, schema, errors, admin\n│   └── src/\n│       ├── admin.rs                # X-Admin-Token verification (constant-time)\n│       ├── config.rs               # datasets.toml parsing + validation\n│       ├── schema.rs               # backend-agnostic schema model\n│       ├── models.rs               # Predicate / QueryRequest\n│       └── errors.rs               # AppError + actix ResponseError\n├── duckdb/                         # datapress-duckdb\n│   └── src/\n│       ├── lib.rs                  # pub async fn serve(cfg) -\u003e io::Result\u003c()\u003e\n│       ├── db.rs                   # Registry: pool + schemas + reload\n│       ├── repository.rs           # DatasetRepository (SQL builder)\n│       ├── handlers.rs             # actix routes\n│       └── bin/datapress-duckdb.rs # entrypoint binary\n├── datafusion/                     # datapress-datafusion\n│   └── src/\n│       ├── lib.rs                  # pub async fn serve(cfg) -\u003e io::Result\u003c()\u003e\n│       ├── store.rs                # Store: RecordBatch + eq-index + reload\n│       ├── handlers.rs             # actix routes\n│       └── bin/datapress-datafusion.rs\n└── python/                         # datapress (Python wheel, cdylib)\n    └── src/lib.rs                  # PyO3 bindings — DataPress, DataPressConfig, ...\n```\n\nCore re-exports compile without any backend; each backend crate adds the\nfeature flag it needs on `datapress-core`. The Python crate depends on both\nbackends, so the wheel can dispatch between them at runtime based on\n`DataPressConfig(backend=...)`.\n\n---\n\n## Build flags\n\n```bash\n# DuckDB only\ncargo build --release -p datapress-duckdb\n\n# DataFusion only\ncargo build --release -p datapress-datafusion\n\n# Both Rust binaries\ntask build\n\n# Python wheel (compiles both backends into one extension)\ntask py:develop     # editable install into ./.venv (uses uv + maturin)\ntask py:build       # release wheel into ./target/wheels/\n```\n\nRelease builds use LTO + `codegen-units = 1` (see `[profile.release]` in\n`Cargo.toml`). Expect noticeably longer link times in exchange for tighter\ninner loops.\n\n---\n\n## Environment variables\n\n| Variable          | Default          | Purpose                                                                          |\n|-------------------|------------------|----------------------------------------------------------------------------------|\n| `DATASETS_CONFIG` | `datasets.toml`  | Path to the dataset registry file.                                               |\n| `ADMIN_TOKEN`     | *(unset)*        | Enables `POST /api/v1/datasets/{name}/reload`. Unset = admin endpoints disabled. |\n| `DB_POOL_SIZE`    | `num_cpus`       | DuckDB connection pool size (DuckDB only).                                       |\n| `RUST_LOG`        | `info`           | Standard `env_logger` filter.                                                    |\n| `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` | *(unset)* | Fallback S3 credentials used by any dataset that doesn't override them. |\n| `AWS_REGION`      | `us-east-1`      | Fallback S3 region.                                                              |\n| `${PREFIX}_AWS_*` | *(unset)*        | Per-dataset overrides for the four `AWS_*` vars above. See \"Credential precedence\" under `[dataset.s3]`. |\n\nBind address, port, worker count and backend selection live in `[server]`\nin `datasets.toml`, not in env vars.\n\n---\n\n## Status / non-goals\n\n- No authentication or rate-limiting on query routes — put this behind your\n  own gateway. The `reload` admin route is gated by a shared-secret header\n  (`X-Admin-Token`) and disabled unless `ADMIN_TOKEN` is set.\n- No write path: parquet sources are read-only. The only mutation is\n  reloading a dataset from disk via the admin route.\n- No cursor pagination — pagination is plain `OFFSET / LIMIT`, so deep\n  pages get expensive (see `H5` in `TEST_Q.md`). `ORDER BY` is supported via\n  the `order_by` field, but sorted queries always go through the SQL engine\n  (no in-memory fast path).\n- DataFusion backend keeps the whole dataset in memory. DuckDB does not.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeroenflvr%2Fdatapress","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjeroenflvr%2Fdatapress","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjeroenflvr%2Fdatapress/lists"}