https://github.com/jeroenflvr/datapress

A config-driven Rust server that publishes Parquet and Delta datasets as fast, typed HTTP APIs from local disk or object storage, with interchangeable DuckDB or Arrow+DataFusion backends, JSON and Arrow IPC output, and production-ready features like auth, metrics, and hot reloads.
https://github.com/jeroenflvr/datapress
api arrow arrow-ipc authz datafusion deltalake duckdb http in-memory parquet s3 sql
Last synced: about 1 month ago
JSON representation
Host: GitHub
URL: https://github.com/jeroenflvr/datapress
Owner: jeroenflvr
License: mit
Created: 2026-05-23T06:19:14.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-06-05T05:45:04.000Z (about 2 months ago)
Last Synced: 2026-06-05T07:27:30.318Z (about 2 months ago)
Topics: api, arrow, arrow-ipc, authz, datafusion, deltalake, duckdb, http, in-memory, parquet, s3, sql
Language: Rust
Homepage: https://docs.datap-rs.org
Size: 2.16 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project

README

          ![Rust](https://img.shields.io/badge/built%20with-Rust-orange?logo=rust)

![DuckDB](https://img.shields.io/badge/backend-DuckDB-yellow?logo=duckdb)

![DataFusion](https://img.shields.io/badge/backend-DataFusion-blue?logo=apache)![actix](https://img.shields.io/badge/backend-actix-orange?logo=actix)

# datap-rs

A Rust **Cargo workspace** that exposes one or more **Parquet / Delta

datasets** over a JSON HTTP API. The same surface area is implemented twice —

once on top of **DuckDB**, once on top of **Apache Arrow + DataFusion** — so

you can A/B the engines under identical workloads. A Python wheel

(`datap-rs`, built with maturin + PyO3) bundles both engines and lets you

configure and launch the server from Python.

**[Overview presentation → datap-rs.org](https://datap-rs.org)** ·

[Documentation](https://docs.datap-rs.org)

- Built on [actix-web](https://actix.rs/) 4

- Datasets declared in a single [`datasets.toml`](datasets.toml) (Rust

  binaries) or programmatically (Python wrapper)

- Dynamic schema inference at startup (no hard-coded columns)

- Identical request/response shapes across both backends

---

## Quick start

For testing, we're using this [kaggle US accidents 2016-2023](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) dataset.

```bash

# 1. Put a parquet file somewhere (or point the config at an existing one).

ls data/accidents.parquet

# 2. Edit datasets.toml — see the example shipped in this repo.

# 3. Run a backend.

task run:duckdb        # or: task run:datafusion

# 4. Talk to it.

curl http://localhost:8080/api/v1/datasets

```

`Taskfile.yml` wraps the typical `cargo build --release -p …` invocations;

see [`task --list`](Taskfile.yml) for the full menu.

### Install the prebuilt binary

The quickest way to get the unified `datapress` binary (both backends bundled,

selected at runtime via `server.backend`) without a Rust toolchain:

```bash

# Linux / macOS

curl -LsSf https://datap-rs.org/install.sh | sh

# Windows (PowerShell)

powershell -ExecutionPolicy ByPass -c "irm https://datap-rs.org/install.ps1 | iex"

# Homebrew (macOS / Linux)

brew install jeroenflvr/tap/datapress

# winget (Windows)

winget install datap-rs.DataPress

```

The install scripts drop the binary in a per-user directory (`~/.local/bin` on

Unix, `%LOCALAPPDATA%\datapress\bin` on Windows) and tell you how to add it to

your `PATH`. See [packaging/](packaging/) for details and release automation.

Prefer cargo? Install from crates.io:

```bash

cargo install datapress        # both DuckDB + DataFusion

datapress                      # reads ./datasets.toml (or $DATASETS_CONFIG)

```

For a slimmer single-backend build, or to opt into the docs / Swagger /

metrics / auth features:

```bash

cargo install datapress --no-default-features --features duckdb

cargo install datapress --features swagger,auth,metrics

```

The installed binary resolves its config from (first match wins)

`--config `, `$DATAPRESS_CONFIG_FILE`, `./datasets.toml`, then

`$HOME/datasets.toml`. Generate a starter template with `datapress init`

(writes `datasets.toml.template` to a directory, or `$HOME` when omitted):

```bash

datapress init                 # ~/datasets.toml.template

cp ~/datasets.toml.template ~/datasets.toml   # then edit and run `datapress`

```

### From Python

The same server can be configured and launched from Python via the

`datapress` wheel (one wheel, both engines bundled):

```python

import asyncio

from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig

async def main():

    ds = DatasetConfig(

        name="accidents",

        source="data/accidents.parquet",

        format="parquet",   # or "delta"

        mode="auto",        # index policy: "auto" | "none" | "list"

    )

    cfg = DataPressConfig(backend="duckdb", listen="0.0.0.0", port=8000, workers=8)

    server = DataPress(cfg, datasets=[ds])

    await server.run()      # blocks until SIGINT

asyncio.run(main())

```

Build the wheel with `task py:develop` (uses `uv` + `maturin`).

---

## The two backends

| Aspect              | `datapress-duckdb`                             | `datapress-datafusion`                               |

|---------------------|------------------------------------------------|------------------------------------------------------|

| Engine              | DuckDB (embedded C++)                          | Arrow compute + DataFusion (pure Rust)               |

| Storage             | DuckDB in-memory table per dataset             | One contiguous `RecordBatch` per dataset             |

| Concurrency model   | Connection pool, blocking → `web::block`       | Async-native, multi-threaded `MemTable` partitions   |

| Predicate execution | DuckDB optimiser + parallel hash/vector ops    | Equality index → SIMD scan → DataFusion SQL          |

| Indexes             | Native DuckDB internals (zone maps, etc.)      | Per-dataset eq-index built at startup (configurable) |

| Memory profile      | DuckDB's own buffer manager                    | Whole dataset resident in RAM                        |

| Binary size         | Bundled DuckDB ≈ tens of MB                    | Lean — pure Rust                                     |

| Startup time        | Fast (just `read_parquet`)                     | Slower — reads all rows + builds eq-index            |

| Best at             | Heterogeneous SQL, joins, aggregations         | Dense filter scans, low-latency point lookups        |

### When to pick which

- **DuckDB** is the right default. It handles arbitrary SQL well, has a

  battle-tested optimiser, manages memory itself, and starts up in

  milliseconds because it lazily reads parquet pages on demand.

- **DataFusion** shines when:

  - the dataset fits comfortably in RAM,

  - you query the same columns repeatedly with equality/`IN` predicates

    (the in-process equality index turns those into O(1) lookups), and

  - you want a single static binary without a vendored C++ runtime.

The HTTP API is identical, so the practical comparison is "throughput and

p99 on your queries" — see [`TEST_Q.md`](TEST_Q.md) for a benchmark suite.

---

## Configuration: `datasets.toml`

Every instance reads this file at startup. One `[server]` block plus one

`[[dataset]]` entry per table you want to expose.

```toml

[server]

backend = "datafusion"   # "datafusion" (default) | "duckdb"

listen  = "127.0.0.1"    # default; set to "0.0.0.0" to expose

port    = 8080

# workers = 8            # omit for one worker per CPU

# compress = true        # negotiate gzip/brotli/zstd via Accept-Encoding (default)

# max_body_bytes     = 1048576  # 413 above this; default 1 MiB

# max_page_size      = 100000   # clamp query page_size above this

# request_timeout_ms = 30000    # 504 above this; 0 disables; default 30s

# shutdown_timeout_secs = 30    # SIGTERM/SIGINT grace period, in seconds

# DuckDB backend only: enable the experimental Quack remote protocol.

# [server.quack]

# enabled = false

# uri = "quack:localhost"

# token = "change-me"

# read_only = true

[[dataset]]

name = "accidents"                    # used in the URL: /api/datasets/accidents/...

  [dataset.source]

  kind     = "parquet"                # "parquet" | "delta"

  location = "data/accidents.parquet" # file, directory of *.parquet, or s3://…

  # Optional — DataFusion only. DuckDB ignores this block.

  [dataset.index]

  mode             = "auto"           # "auto" | "none" | "list"

  columns          = []               # required when mode = "list"

  max_cardinality  = 100000           # used by "auto" to skip wide cols

```

### Server

| Field     | Default       | Notes                                                                                          |

|-----------|---------------|------------------------------------------------------------------------------------------------|

| `backend` | `datafusion`  | Informational hint; logged at startup. Each binary always runs as its own backend regardless of this value. |

| `listen`  | `127.0.0.1`   | Loopback by default — the service is **not** exposed on a network interface unless you opt in. |

| `port`    | `8080`        |                                                                                                |

| `workers` | *(unset)*     | Actix worker threads. Unset = one per CPU.                                                     |

| `prefix`  | `""`          | URL path prefix mounted in front of every route (e.g. `"/datapress"`) — useful behind a reverse proxy that passes the path through unchanged. Must start with `/` and not end with `/`. |

| `compress`           | `true`     | Negotiate response compression via `Accept-Encoding` (gzip / brotli / zstd). Disable when sitting behind a proxy that compresses for you. |

| `max_body_bytes`     | `1048576`  | Maximum accepted JSON request body, in bytes. Bigger bodies are rejected with `413 Payload Too Large`. |

| `max_page_size`      | `100000`   | Maximum rows returned by one `/query` page. Larger `page_size` values are clamped. |

| `request_timeout_ms` | `30000`    | Per-request handler timeout, in milliseconds. Long-running handlers are cancelled and the client gets `504 Gateway Timeout`. `0` disables the timeout. |

| `shutdown_timeout_secs` | `30`     | Grace period for in-flight requests after the process receives `SIGTERM` / `SIGINT`, in seconds. The listening socket is closed immediately; existing connections then have up to this many seconds to finish before workers are force-stopped. |

DuckDB builds can also opt into `[server.quack]`, DuckDB's experimental

remote protocol server. Keep it disabled unless you intentionally want

DuckDB clients to attach/query this process directly. It binds to

`quack:localhost` by default, uses token authentication, and DataPress

installs a read-only authorization hook by default.

The server exposes three probe endpoints. `/healthz` and `/readyz` are

mounted at the bare host root (regardless of `prefix`) so orchestrators

don't need to know how the service is exposed. `/health` lives under

`prefix` and is intended for in-app health checks.

| Route      | Status                                                                 | Body                                                                       |

|------------|------------------------------------------------------------------------|----------------------------------------------------------------------------|

| `/healthz` | Liveness — always `200` while the process is running.                  | `{"status":"ok"}`                                                          |

| `/readyz`  | Readiness — `200` once at least one dataset is registered, `503` otherwise. | `{"status":"ready","datasets":N}` / `{"status":"not ready","reason":"no datasets registered"}` |

| `/version` | Build / version metadata — always `200`.                              | `{"name":"datapress-core","version":"x.y.z","backend":"DuckDB\|DataFusion","profile":"debug\|release", ...}` |

| `{prefix}/health` | App-level liveness — always `200`.                             | `{"status":"ok"}`                                                          |

`/healthz` does not touch the backend, so it stays `200` even while the

dataset registry is still loading at startup. Use `/readyz` to gate

traffic until the server is actually able to serve queries.

`/version` also includes optional fields populated from build-time env

vars when set: `git_sha` (`DATAPRESS_GIT_SHA`), `build_time`

(`DATAPRESS_BUILD_TIME`, ISO-8601), and `target`

(`DATAPRESS_TARGET`, e.g. `aarch64-apple-darwin`). Unset vars are

omitted from the JSON. Example:

```bash

DATAPRESS_GIT_SHA=$(git rev-parse --short HEAD) \

DATAPRESS_BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \

DATAPRESS_TARGET=$(rustc -vV | awk '/host:/ {print $2}') \

  cargo build --release -p datapress-duckdb

```

### Online documentation

DataPress can embed two browsable sources of documentation into the

binary itself:

- An [MkDocs Material](https://squidfunk.github.io/mkdocs-material/)

  site (the one you are reading) at `[docs].path` (default `/mkdocs`).

- An interactive [Swagger UI](https://swagger.io/tools/swagger-ui/)

  with a hand-written OpenAPI spec at `[swagger].path` (default

  `/docs`). The raw spec is also exposed at `/openapi.json`.

Both are opt-in at build time (so wheels stay slim when you don't

want them) and **enabled by default at runtime** once compiled in —

set `enabled = false` to disable in prod.

1. Build the MkDocs site (only needed for the `docs` feature):

   ```bash

   task docs:build

   ```

2. Build the backend with one or both features:

   ```bash

   cargo build --release -p datapress-duckdb --features docs,swagger

   ```

3. Tweak in `datasets.toml` if you want to relocate or disable either:

   ```toml

   [docs]

   enabled = true        # default: true

   path    = "/mkdocs"   # default: /mkdocs

   [swagger]

   enabled = true        # default: true (set to false in prod)

   path    = "/docs"     # default: /docs

   ```

Both `path` values must start with `/`, not end with `/`, not collide

with `/api`, `/api/v1`, `/health{z,}`, `/readyz`, or `/version`, and

must differ from each other. When the binary is built without the

relevant feature but the TOML enables it, the server logs a warning at

startup and continues without that surface.

### Authentication (OIDC / OAuth2)

Build with `--features auth` to enable JWT bearer enforcement against

any OpenID-Connect issuer (Entra ID, Auth0, Keycloak, Okta, …). When

enabled, the server fetches the issuer's JWKS at startup, refreshes it

in the background, and validates `Authorization: Bearer ` headers

against the configured issuer, audience, algorithms, and scopes.

```toml

[auth]

enabled         = true

issuer          = "https://login.microsoftonline.com//v2.0"

audience        = "api://datapress"

algorithms      = ["RS256"]

read_scopes     = ["datasets:read"]

reload_scopes   = ["datasets:reload"]

anonymous_read  = false      # set true to keep read endpoints public

tenant_claim    = "/tid"     # JSON-pointer into the JWT claims

allowed_tenants = [""]

admin_token_fallback = true  # keep X-Admin-Token working in parallel

```

Health probes (`/healthz`, `/readyz`, `/version`) stay unauthenticated

so load balancers keep working. The legacy `X-Admin-Token` header keeps

working for `POST .../reload` as long as `admin_token_fallback = true`.

To turn the Swagger UI itself into an SSO client, add an `[swagger.oauth2]`

block — it gets rendered as an `OpenIdConnect` security scheme with PKCE.

### Source

`[dataset.source]` is a tagged enum.

| `kind`    | `location`                                          | Notes                                                                                  |

|-----------|-----------------------------------------------------|----------------------------------------------------------------------------------------|

| `parquet` | a `.parquet` file                                   | Read as-is.                                                                            |

| `parquet` | a directory                                         | Every `*.parquet` inside (sorted, non-recursive). No glob patterns.                    |

| `parquet` | `s3://bucket/key.parquet` or `s3://bucket/prefix/`  | Requires a `[dataset.s3]` block. DuckDB autoloads `httpfs`.                            |

| `delta`   | a local directory                                   | Pointed at the table root (the dir containing `_delta_log/`).                          |

| `delta`   | `s3://bucket/path/to/table`                         | Requires `[dataset.s3]`. DuckDB autoloads `delta`; DataFusion uses the `deltalake` crate. |

#### S3 / S3-compatible storage

```toml

[[dataset]]

name = "events"

  [dataset.source]

  kind     = "parquet"           # or "delta"

  location = "s3://events/2025/*.parquet"

  [dataset.s3]

  region            = "us-east-1"

  endpoint          = "http://localhost:9000"  # omit for AWS

  addressing_style  = "path"                   # "virtual" (default) | "path"

  allow_http        = true                     # only for non-https endpoints

```

| Field              | Default       | Notes                                                                          |

|--------------------|---------------|--------------------------------------------------------------------------------|

| `region`           | `us-east-1`   | Falls back to `AWS_REGION` env, then `us-east-1`.                              |

| `endpoint`         | *(unset)*     | Custom S3 endpoint (MinIO, R2, Wasabi, Backblaze, …).                          |

| `addressing_style` | `virtual`     | `virtual` = `https://bucket.host`, `path` = `https://host/bucket` (MinIO).     |

| `allow_http`       | `false`       | Must be `true` if `endpoint` is `http://…`.                                    |

| `partitioning`     | `auto`        | Hive partition discovery: `auto`, `hive` (force on), `none` (force off).        |

| `endpoint_bucket_in_host` | `auto` | Fold the bucket into the endpoint host: `auto` (follows `addressing_style`), `true`, `false`. |

| `access_key_id`, `secret_access_key`, `session_token` | *(unset)* | Inline creds. Discouraged for prod — use env vars instead. |

**Credential precedence** (highest → lowest):

1. Per-dataset env vars: `${PREFIX}_AWS_ACCESS_KEY_ID`, `${PREFIX}_AWS_SECRET_ACCESS_KEY`, `${PREFIX}_AWS_SESSION_TOKEN`, `${PREFIX}_AWS_REGION`.

   `PREFIX` is the dataset name uppercased with every non-alphanumeric character mapped to `_` (e.g. `accidents` → `ACCIDENTS_AWS_…`, `my-bucket` → `MY_BUCKET_AWS_…`).

2. Inline `[dataset.s3]` keys.

3. Plain `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_REGION`.

4. The backend's default credential chain (`~/.aws/credentials`, IMDS, etc.).

> **Python:** the `S3Config` binding also accepts a `credentials_provider` — a zero-argument callable returning an `HMACKeyPair`. It is invoked once when `DataPress(...)` is constructed, the result is cached indefinitely, and it overrides any inline `access_key_id` / `secret_access_key`. See the [Python S3 docs](https://docs.datap-rs.org/python/config/#s3config).

> When `kind = "delta"` and `location` is an `s3://…` URL, both backends fully materialise the table at startup. There is no incremental scan path — switch to `parquet` if you need on-demand page reads.

### Equality-index policy (DataFusion only)

The DataFusion backend builds an in-memory `value -> [row ids]` map at

startup so that `eq` / `in` predicates resolve in O(1).

| `mode`   | Behaviour                                                              |

|----------|------------------------------------------------------------------------|

| `auto`   | Index every column whose distinct count stays below `max_cardinality`. |

| `none`   | Skip the index entirely — every query goes through DataFusion SQL.     |

| `list`   | Index only the named `columns`. Useful for huge datasets.              |

Override the config path with `DATASETS_CONFIG=/path/to/file.toml`.

## HTTP API

Four routes, both backends:

### API versioning

The canonical paths live under `/api/v1/...`. The un-versioned

`/api/...` paths continue to work as a **legacy alias** for v1, so

existing clients keep running. To upgrade, replace `/api/` with

`/api/v1/` in your URLs — nothing else changes.

```text

POST /api/v1/datasets/accidents/query      # canonical (recommended)

POST /api/datasets/accidents/query         # legacy alias, still v1

```

When a breaking schema change is introduced, it will ship as `/api/v2`

in a sibling module ([crates/core/src/handlers/v1.rs](crates/core/src/handlers/v1.rs))

and v1 will stay mounted alongside it for a deprecation window.

### `GET /api/v1/datasets`

```json

{ "datasets": [ { "name": "accidents", "columns": 47 } ] }

```

### `GET /api/v1/datasets/{name}/schema`

Returns the inferred columns plus a sample row so a client can see what

values look like without issuing a query.

```json

{

  "name": "accidents",

  "columns": [

    { "name": "ID",       "logical": "utf8", "sql_type": "VARCHAR",   "nullable": false },

    { "name": "Severity", "logical": "int",  "sql_type": "INTEGER",   "nullable": true  },

    { "name": "Start_Time", "logical": "temporal", "sql_type": "TIMESTAMP", "nullable": true }

  ],

  "sample": { "ID": "A-1", "Severity": 2, "Start_Time": "2016-02-08 05:46:00", ... }

}

```

`logical` values: `bool | int | float | utf8 | temporal | other`. Temporal

columns are returned as strings.

### `POST /api/v1/datasets/{name}/query`

```json

{

  "columns":   ["ID", "City", "State", "Severity"],

  "predicates": [

    { "col": "State",    "op": "eq",  "val": "TX" },

    { "col": "Severity", "op": "gte", "val": 3   }

  ],

  "order_by": [

    { "col": "Severity", "dir": "desc" },

    { "col": "ID" }

  ],

  "limit":     1000,

  "page":      1,

  "page_size": 50

}

```

Response:

```json

{ "data": [ { ... }, ... ], "page": 1, "page_size": 50 }

```

#### Request fields

| Field        | Type                | Default | Notes                                  |

|--------------|---------------------|---------|----------------------------------------|

| `columns`    | `string[]`          | `[]`    | Empty = all columns.                   |

| `predicates` | `Predicate[]`       | `[]`    | ANDed together.                        |

| `order_by`   | `OrderBy[]`         | `[]`    | `{ col, dir? }`; `dir` is `asc` (default) or `desc`, case-insensitive. When `group_by` is set, `col` must be a group column or aggregation alias. |

| `group_by`   | `string[]`          | `[]`    | Columns to group by. When set, `columns` is ignored. Empty `aggregations` implies `[{ op: "count" }]`. |

| `aggregations` | `Aggregation[]`   | `[]`    | `{ col?, op, alias? }`; `op` is `count\|sum\|avg\|min\|max`. `col` may be omitted only for `count` (= `COUNT(*)`). Requires `group_by`. |

| `distinct`   | `bool`              | `false` | Dedup the projected columns. Mutually exclusive with `group_by` / `aggregations`. |

| `limit`      | `int >= 0` or null  | `null`  | Hard cap on total rows across all pages. `null` = unlimited. |

| `page`       | `int >= 1`          | `1`     | 1-based.                               |

| `page_size`  | `int >= 1`               | `1000`   | Clamped to `server.max_page_size` (`100_000` by default). |

#### Predicate shape

```json

{ "col": "", "op": "", "val":  }

```

| `op`           | `val`                  | Meaning                              |

|----------------|------------------------|--------------------------------------|

| `eq`           | scalar                 | `col = val`                          |

| `neq`          | scalar                 | `col <> val`                         |

| `gt` / `gte`   | number / string        | `col > val` / `col >= val`           |

| `lt` / `lte`   | number / string        | `col < val` / `col <= val`           |

| `like`         | string with `%` / `_`  | SQL `LIKE`                           |

| `ilike`        | string with `%` / `_`  | Case-insensitive `LIKE`              |

| `in`           | non-empty array        | `col IN (v1, v2, …)`                 |

| `is_null`      | omit                   | `col IS NULL`                        |

| `is_not_null`  | omit                   | `col IS NOT NULL`                    |

Column names are looked up case-insensitively against the inferred schema

and quoted automatically, so `Temperature(F)` and similar identifiers work.

#### Response format — JSON or Arrow IPC

`/query` can return its result set in two wire formats. Same body, same

predicates, same pagination — only the response encoding differs.

| Aspect              | JSON (default)                                       | Arrow IPC stream                                                                 |

|---------------------|------------------------------------------------------|----------------------------------------------------------------------------------|

| Content-Type        | `application/json`                                   | `application/vnd.apache.arrow.stream`                                            |

| How to ask          | nothing — it's the default                           | `Accept: application/vnd.apache.arrow.stream` **or** `?format=arrow` on the URL  |

| Shape               | Array of row objects (`[{...}, {...}, ...]`)         | Self-describing stream: 1 schema message + N `RecordBatch` messages + EOS        |

| Layout              | Row-oriented; column names repeated on every row     | Columnar; one contiguous buffer per column per batch                             |

| Types preserved     | Scalars become JSON (`int`/`float`/`bool`/`string`); temporals stringified to ISO-8601 | Native Arrow types — `Int32`, `Timestamp(ns)`, `Decimal128`, dictionary, etc. retained end-to-end |

| Page metadata       | In the body (just the rows, no envelope)             | In headers: `X-Page`, `X-Page-Size`                                              |

| Empty result        | `[]`                                                 | Valid stream with the schema message only, zero batches                          |

| Compression         | Big win — JSON is text                               | Smaller starting point; gzip/zstd still help on wide / repetitive cols, brotli usually skipped |

| Client cost         | `json.loads` + per-row dict construction             | `pyarrow.ipc.open_stream(...).read_all()` → zero-copy `pyarrow.Table`            |

| Best for            | Small responses, browsers, ad-hoc `curl`, dashboards | Bulk data into Polars / pandas / DuckDB-on-the-client, ML feature pipelines      |

**When to pick which.** Use JSON when the consumer is JavaScript, the

response is small (<~10k rows), or you're poking at the API by hand.

Use Arrow IPC when you're moving result pages into a dataframe library,

the schema has non-string types you want preserved, or page sizes are

large enough that JSON parse time shows up in profiles.

```bash

# JSON (default)

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \

  -H 'Content-Type: application/json' \

  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

# Arrow IPC — via Accept header

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \

  -H 'Content-Type: application/json' \

  -H 'Accept: application/vnd.apache.arrow.stream' \

  --output result.arrow \

  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

# Arrow IPC — via query string (handy when you can't set headers)

curl -X POST 'http://localhost:8080/api/v1/datasets/accidents/query?format=arrow' \

  -H 'Content-Type: application/json' \

  --output result.arrow \

  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

```

```python

import requests, pyarrow.ipc as ipc

r = requests.post(url, json=req, headers={"Accept": "application/vnd.apache.arrow.stream"})

table = ipc.open_stream(r.content).read_all()  # → pyarrow.Table

page  = int(r.headers["X-Page"])

size  = int(r.headers["X-Page-Size"])

```

Supported on **both** backends — DuckDB streams batches out via its

native `query_arrow` API, DataFusion uses its Arrow plan directly.

The `Compress` middleware still applies. `count`, `schema`, and the

dataset-listing endpoints are JSON-only.

#### Grouping / aggregation

When `group_by` is non-empty the SELECT list is derived from the group

columns plus each aggregation's output alias — the top-level `columns`

field is ignored. Supported ops: `count`, `sum`, `avg`, `min`, `max`

(case-insensitive). `col` may be omitted only for `count` (= `COUNT(*)`).

If `aggregations` is omitted an implicit `COUNT(*) AS count` is added.

```bash

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \

  -H 'Content-Type: application/json' \

  -d '{

    "group_by": ["State"],

    "aggregations": [

      { "op":  "count" },

      { "col": "Severity", "op": "avg", "alias": "avg_sev" }

    ],

    "order_by": [{ "col": "count", "dir": "desc" }],

    "page_size": 10

  }'

# → { "data": [ { "State": "CA", "count": 1741433, "avg_sev": 2.21 }, ... ], ... }

```

`aggregations` without `group_by` returns `400`. `order_by` keys must

reference a group column or an aggregation alias (no arbitrary dataset

columns — they are not in scope after `GROUP BY`). Grouped queries always

go through the SQL engine; no in-memory fast path applies.

#### Distinct rows

`distinct: true` deduplicates on the projected columns. Useful for

building dropdowns / facet lists.

```bash

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \

  -H 'Content-Type: application/json' \

  -d '{

    "columns":  ["State"],

    "distinct": true,

    "order_by": [{ "col": "State" }],

    "page_size": 100

  }'

# → { "data": [ { "State": "AL" }, { "State": "AR" }, ... ], ... }

```

Mutually exclusive with `group_by` / `aggregations` (returns `400` if

combined). Also bypasses the in-memory fast paths.

### `POST /api/v1/datasets/{name}/count`

Returns the number of rows matching `predicates`. Same predicate shape as

`/query`; only the `predicates` field is read. Empty body counts every row.

```bash

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \

  -H 'Content-Type: application/json' -d '{}'

# → { "count": 7728394 }

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \

  -H 'Content-Type: application/json' \

  -d '{

    "predicates": [

      { "col": "State",    "op": "eq",  "val": "TX" },

      { "col": "Severity", "op": "gte", "val": 3   }

    ]

  }'

# → { "count": 187423 }

```

On materialised DataFusion datasets the no-predicate path is O(1) (uses the

resident chunk metadata, no scan); indexable predicates short-circuit

through the equality index. Otherwise it runs `SELECT COUNT(*) … WHERE …`

through the engine.

### `POST /api/v1/datasets/{name}/reload` *(admin)*

Rebuilds the dataset from its configured `source` and publishes the new

contents without a server restart. Running queries finish against a

consistent old snapshot; later queries see the new data. If the rebuild

fails, the previously published dataset stays live.

Requires `X-Admin-Token: $ADMIN_TOKEN`. **If `ADMIN_TOKEN` is unset the

endpoint is disabled** — the secure default. The comparison is

constant-time.

```bash

curl -s -X POST \

  -H "X-Admin-Token: $ADMIN_TOKEN" \

  http://localhost:8080/api/v1/datasets/accidents/reload

# { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }

```

| Status | Body                                          | Meaning                                              |

|--------|-----------------------------------------------|------------------------------------------------------|

| `200`  | `{ dataset, rows, elapsed_ms }`               | New data live.                                       |

| `403`  | `{ "error": "forbidden: …" }`                 | Token missing/wrong, or `ADMIN_TOKEN` not set.       |

| `404`  | `{ "error": "not found: dataset: …" }`        | No such dataset in `datasets.toml`.                  |

| `500`  | `{ "error": "internal error: …" }`            | Parquet read failed — old data stays live.           |

Concurrent reloads of the **same** dataset are serialised (per-name mutex);

reloads of **different** datasets run in parallel.

#### Backend-specific reload semantics

- **DataFusion** uses a service-level double buffer. The backend builds a

  fresh `DatasetState` off to the side (parquet/Delta read, Arrow

  `RecordBatch` chunks, equality indexes, partition metadata), registers

  the new provider, then publishes it with an `ArcSwap` snapshot update.

  Queries that already captured the old `Arc` keep running; later queries

  see the new state. The old buffers are dropped once the last reader

  releases its reference. Trade-off: for materialised datasets, peak RSS

  can approach roughly twice the dataset size plus index overhead during

  reload.

- **DuckDB** delegates publication to the database engine. Reload runs

  `CREATE OR REPLACE TABLE ... AS SELECT ...` against the dataset source.

  DuckDB treats that as an ACID transaction over the table/catalog

  replacement: if the source read or table creation fails, the existing

  table remains live; if it succeeds, later queries see the replacement

  atomically. In-flight queries continue against the snapshot they started

  with through DuckDB's transaction/MVCC semantics. DataPress then

  refreshes only the small cached schema and row-count metadata.

The HTTP contract is the same for both backends: clients observe either

the old dataset or the new dataset, never a partially loaded one. The

resource profile differs: DataFusion owns the Arrow buffers in process;

DuckDB relies on DuckDB's storage engine and buffer manager.

---

## Examples

```bash

# Discovery

curl -s http://localhost:8080/api/v1/datasets | jq

curl -s http://localhost:8080/api/v1/datasets/accidents/schema | jq

# Equality + range

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \

  -H 'Content-Type: application/json' \

  -d '{

    "columns": ["ID","Severity","City","State","Start_Time"],

    "predicates": [

      { "col": "State",    "op": "eq",  "val": "TX" },

      { "col": "Severity", "op": "gte", "val": 3 }

    ],

    "page": 1, "page_size": 5

  }' | jq

# Substring + numeric range

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \

  -H 'Content-Type: application/json' \

  -d '{

    "predicates": [

      { "col": "Description",    "op": "ilike", "val": "%fog%" },

      { "col": "Temperature(F)", "op": "lt",    "val": 32 }

    ],

    "page_size": 10

  }' | jq

# IN list

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \

  -H 'Content-Type: application/json' \

  -d '{

    "predicates": [

      { "col": "State", "op": "in", "val": ["NY","NJ","CT"] }

    ]

  }' | jq

```

For a deeper benchmark catalogue (light load + CPU/memory stress tests), see

[`TEST_Q.md`](TEST_Q.md).

---

## Project layout

```

Cargo.toml                          # workspace manifest

pyproject.toml                      # maturin / PyO3 build

crates/

├── core/                           # datapress-core: config, schema, errors, admin

│   └── src/

│       ├── admin.rs                # X-Admin-Token verification (constant-time)

│       ├── config.rs               # datasets.toml parsing + validation

│       ├── schema.rs               # backend-agnostic schema model

│       ├── models.rs               # Predicate / QueryRequest

│       └── errors.rs               # AppError + actix ResponseError

├── duckdb/                         # datapress-duckdb

│   └── src/

│       ├── lib.rs                  # pub async fn serve(cfg) -> io::Result<()>

│       ├── db.rs                   # Registry: pool + schemas + reload

│       ├── repository.rs           # DatasetRepository (SQL builder)

│       ├── handlers.rs             # actix routes

│       └── bin/datapress-duckdb.rs # entrypoint binary

├── datafusion/                     # datapress-datafusion

│   └── src/

│       ├── lib.rs                  # pub async fn serve(cfg) -> io::Result<()>

│       ├── store.rs                # Store: RecordBatch + eq-index + reload

│       ├── handlers.rs             # actix routes

│       └── bin/datapress-datafusion.rs

└── python/                         # datapress (Python wheel, cdylib)

    └── src/lib.rs                  # PyO3 bindings — DataPress, DataPressConfig, ...

```

Core re-exports compile without any backend; each backend crate adds the

feature flag it needs on `datapress-core`. The Python crate depends on both

backends, so the wheel can dispatch between them at runtime based on

`DataPressConfig(backend=...)`.

---

## Build flags

```bash

# DuckDB only

cargo build --release -p datapress-duckdb

# DataFusion only

cargo build --release -p datapress-datafusion

# Both Rust binaries

task build

# Python wheel (compiles both backends into one extension)

task py:develop     # editable install into ./.venv (uses uv + maturin)

task py:build       # release wheel into ./target/wheels/

```

Release builds use LTO + `codegen-units = 1` (see `[profile.release]` in

`Cargo.toml`). Expect noticeably longer link times in exchange for tighter

inner loops.

---

## Environment variables

| Variable          | Default          | Purpose                                                                          |

|-------------------|------------------|----------------------------------------------------------------------------------|

| `DATASETS_CONFIG` | `datasets.toml`  | Path to the dataset registry file.                                               |

| `ADMIN_TOKEN`     | *(unset)*        | Enables `POST /api/v1/datasets/{name}/reload`. Unset = admin endpoints disabled. |

| `DB_POOL_SIZE`    | `num_cpus`       | DuckDB connection pool size (DuckDB only).                                       |

| `RUST_LOG`        | `info`           | Standard `env_logger` filter.                                                    |

| `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` | *(unset)* | Fallback S3 credentials used by any dataset that doesn't override them. |

| `AWS_REGION`      | `us-east-1`      | Fallback S3 region.                                                              |

| `${PREFIX}_AWS_*` | *(unset)*        | Per-dataset overrides for the four `AWS_*` vars above. See "Credential precedence" under `[dataset.s3]`. |

Bind address, port, worker count and backend selection live in `[server]`

in `datasets.toml`, not in env vars.

---

## Status / non-goals

- No authentication or rate-limiting on query routes — put this behind your

  own gateway. The `reload` admin route is gated by a shared-secret header

  (`X-Admin-Token`) and disabled unless `ADMIN_TOKEN` is set.

- No write path: parquet sources are read-only. The only mutation is

  reloading a dataset from disk via the admin route.

- No cursor pagination — pagination is plain `OFFSET / LIMIT`, so deep

  pages get expensive (see `H5` in `TEST_Q.md`). `ORDER BY` is supported via

  the `order_by` field, but sorted queries always go through the SQL engine

  (no in-memory fast path).

- DataFusion backend keeps the whole dataset in memory. DuckDB does not.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jeroenflvr/datapress

Awesome Lists containing this project

README