https://github.com/jeroenflvr/datapress
A config-driven Rust server that publishes Parquet and Delta datasets as fast, typed HTTP APIs from local disk or object storage, with interchangeable DuckDB or Arrow+DataFusion backends, JSON and Arrow IPC output, and production-ready features like auth, metrics, and hot reloads.
https://github.com/jeroenflvr/datapress
api arrow arrow-ipc authz datafusion deltalake duckdb http in-memory parquet s3 sql
Last synced: 18 days ago
JSON representation
A config-driven Rust server that publishes Parquet and Delta datasets as fast, typed HTTP APIs from local disk or object storage, with interchangeable DuckDB or Arrow+DataFusion backends, JSON and Arrow IPC output, and production-ready features like auth, metrics, and hot reloads.
- Host: GitHub
- URL: https://github.com/jeroenflvr/datapress
- Owner: jeroenflvr
- License: mit
- Created: 2026-05-23T06:19:14.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-06-05T05:45:04.000Z (19 days ago)
- Last Synced: 2026-06-05T07:27:30.318Z (19 days ago)
- Topics: api, arrow, arrow-ipc, authz, datafusion, deltalake, duckdb, http, in-memory, parquet, s3, sql
- Language: Rust
- Homepage: https://docs.datap-rs.org
- Size: 2.16 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README



# datap-rs
A Rust **Cargo workspace** that exposes one or more **Parquet / Delta
datasets** over a JSON HTTP API. The same surface area is implemented twice —
once on top of **DuckDB**, once on top of **Apache Arrow + DataFusion** — so
you can A/B the engines under identical workloads. A Python wheel
(`datap-rs`, built with maturin + PyO3) bundles both engines and lets you
configure and launch the server from Python.
**[Overview presentation → datap-rs.org](https://datap-rs.org)** ·
[Documentation](https://docs.datap-rs.org)
- Built on [actix-web](https://actix.rs/) 4
- Datasets declared in a single [`datasets.toml`](datasets.toml) (Rust
binaries) or programmatically (Python wrapper)
- Dynamic schema inference at startup (no hard-coded columns)
- Identical request/response shapes across both backends
---
## Quick start
For testing, we're using this [kaggle US accidents 2016-2023](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) dataset.
```bash
# 1. Put a parquet file somewhere (or point the config at an existing one).
ls data/accidents.parquet
# 2. Edit datasets.toml — see the example shipped in this repo.
# 3. Run a backend.
task run:duckdb # or: task run:datafusion
# 4. Talk to it.
curl http://localhost:8080/api/v1/datasets
```
`Taskfile.yml` wraps the typical `cargo build --release -p …` invocations;
see [`task --list`](Taskfile.yml) for the full menu.
### Install the prebuilt binary
The quickest way to get the unified `datapress` binary (both backends bundled,
selected at runtime via `server.backend`) without a Rust toolchain:
```bash
# Linux / macOS
curl -LsSf https://datap-rs.org/install.sh | sh
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://datap-rs.org/install.ps1 | iex"
# Homebrew (macOS / Linux)
brew install jeroenflvr/tap/datapress
# winget (Windows)
winget install datap-rs.DataPress
```
The install scripts drop the binary in a per-user directory (`~/.local/bin` on
Unix, `%LOCALAPPDATA%\datapress\bin` on Windows) and tell you how to add it to
your `PATH`. See [packaging/](packaging/) for details and release automation.
Prefer cargo? Install from crates.io:
```bash
cargo install datapress # both DuckDB + DataFusion
datapress # reads ./datasets.toml (or $DATASETS_CONFIG)
```
For a slimmer single-backend build, or to opt into the docs / Swagger /
metrics / auth features:
```bash
cargo install datapress --no-default-features --features duckdb
cargo install datapress --features swagger,auth,metrics
```
The installed binary resolves its config from (first match wins)
`--config `, `$DATAPRESS_CONFIG_FILE`, `./datasets.toml`, then
`$HOME/datasets.toml`. Generate a starter template with `datapress init`
(writes `datasets.toml.template` to a directory, or `$HOME` when omitted):
```bash
datapress init # ~/datasets.toml.template
cp ~/datasets.toml.template ~/datasets.toml # then edit and run `datapress`
```
### From Python
The same server can be configured and launched from Python via the
`datapress` wheel (one wheel, both engines bundled):
```python
import asyncio
from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig
async def main():
ds = DatasetConfig(
name="accidents",
source="data/accidents.parquet",
format="parquet", # or "delta"
mode="auto", # index policy: "auto" | "none" | "list"
)
cfg = DataPressConfig(backend="duckdb", listen="0.0.0.0", port=8000, workers=8)
server = DataPress(cfg, datasets=[ds])
await server.run() # blocks until SIGINT
asyncio.run(main())
```
Build the wheel with `task py:develop` (uses `uv` + `maturin`).
---
## The two backends
| Aspect | `datapress-duckdb` | `datapress-datafusion` |
|---------------------|------------------------------------------------|------------------------------------------------------|
| Engine | DuckDB (embedded C++) | Arrow compute + DataFusion (pure Rust) |
| Storage | DuckDB in-memory table per dataset | One contiguous `RecordBatch` per dataset |
| Concurrency model | Connection pool, blocking → `web::block` | Async-native, multi-threaded `MemTable` partitions |
| Predicate execution | DuckDB optimiser + parallel hash/vector ops | Equality index → SIMD scan → DataFusion SQL |
| Indexes | Native DuckDB internals (zone maps, etc.) | Per-dataset eq-index built at startup (configurable) |
| Memory profile | DuckDB's own buffer manager | Whole dataset resident in RAM |
| Binary size | Bundled DuckDB ≈ tens of MB | Lean — pure Rust |
| Startup time | Fast (just `read_parquet`) | Slower — reads all rows + builds eq-index |
| Best at | Heterogeneous SQL, joins, aggregations | Dense filter scans, low-latency point lookups |
### When to pick which
- **DuckDB** is the right default. It handles arbitrary SQL well, has a
battle-tested optimiser, manages memory itself, and starts up in
milliseconds because it lazily reads parquet pages on demand.
- **DataFusion** shines when:
- the dataset fits comfortably in RAM,
- you query the same columns repeatedly with equality/`IN` predicates
(the in-process equality index turns those into O(1) lookups), and
- you want a single static binary without a vendored C++ runtime.
The HTTP API is identical, so the practical comparison is "throughput and
p99 on your queries" — see [`TEST_Q.md`](TEST_Q.md) for a benchmark suite.
---
## Configuration: `datasets.toml`
Every instance reads this file at startup. One `[server]` block plus one
`[[dataset]]` entry per table you want to expose.
```toml
[server]
backend = "datafusion" # "datafusion" (default) | "duckdb"
listen = "127.0.0.1" # default; set to "0.0.0.0" to expose
port = 8080
# workers = 8 # omit for one worker per CPU
# compress = true # negotiate gzip/brotli/zstd via Accept-Encoding (default)
# max_body_bytes = 1048576 # 413 above this; default 1 MiB
# max_page_size = 100000 # clamp query page_size above this
# request_timeout_ms = 30000 # 504 above this; 0 disables; default 30s
# shutdown_timeout_secs = 30 # SIGTERM/SIGINT grace period, in seconds
# DuckDB backend only: enable the experimental Quack remote protocol.
# [server.quack]
# enabled = false
# uri = "quack:localhost"
# token = "change-me"
# read_only = true
[[dataset]]
name = "accidents" # used in the URL: /api/datasets/accidents/...
[dataset.source]
kind = "parquet" # "parquet" | "delta"
location = "data/accidents.parquet" # file, directory of *.parquet, or s3://…
# Optional — DataFusion only. DuckDB ignores this block.
[dataset.index]
mode = "auto" # "auto" | "none" | "list"
columns = [] # required when mode = "list"
max_cardinality = 100000 # used by "auto" to skip wide cols
```
### Server
| Field | Default | Notes |
|-----------|---------------|------------------------------------------------------------------------------------------------|
| `backend` | `datafusion` | Informational hint; logged at startup. Each binary always runs as its own backend regardless of this value. |
| `listen` | `127.0.0.1` | Loopback by default — the service is **not** exposed on a network interface unless you opt in. |
| `port` | `8080` | |
| `workers` | *(unset)* | Actix worker threads. Unset = one per CPU. |
| `prefix` | `""` | URL path prefix mounted in front of every route (e.g. `"/datapress"`) — useful behind a reverse proxy that passes the path through unchanged. Must start with `/` and not end with `/`. |
| `compress` | `true` | Negotiate response compression via `Accept-Encoding` (gzip / brotli / zstd). Disable when sitting behind a proxy that compresses for you. |
| `max_body_bytes` | `1048576` | Maximum accepted JSON request body, in bytes. Bigger bodies are rejected with `413 Payload Too Large`. |
| `max_page_size` | `100000` | Maximum rows returned by one `/query` page. Larger `page_size` values are clamped. |
| `request_timeout_ms` | `30000` | Per-request handler timeout, in milliseconds. Long-running handlers are cancelled and the client gets `504 Gateway Timeout`. `0` disables the timeout. |
| `shutdown_timeout_secs` | `30` | Grace period for in-flight requests after the process receives `SIGTERM` / `SIGINT`, in seconds. The listening socket is closed immediately; existing connections then have up to this many seconds to finish before workers are force-stopped. |
DuckDB builds can also opt into `[server.quack]`, DuckDB's experimental
remote protocol server. Keep it disabled unless you intentionally want
DuckDB clients to attach/query this process directly. It binds to
`quack:localhost` by default, uses token authentication, and DataPress
installs a read-only authorization hook by default.
The server exposes three probe endpoints. `/healthz` and `/readyz` are
mounted at the bare host root (regardless of `prefix`) so orchestrators
don't need to know how the service is exposed. `/health` lives under
`prefix` and is intended for in-app health checks.
| Route | Status | Body |
|------------|------------------------------------------------------------------------|----------------------------------------------------------------------------|
| `/healthz` | Liveness — always `200` while the process is running. | `{"status":"ok"}` |
| `/readyz` | Readiness — `200` once at least one dataset is registered, `503` otherwise. | `{"status":"ready","datasets":N}` / `{"status":"not ready","reason":"no datasets registered"}` |
| `/version` | Build / version metadata — always `200`. | `{"name":"datapress-core","version":"x.y.z","backend":"DuckDB\|DataFusion","profile":"debug\|release", ...}` |
| `{prefix}/health` | App-level liveness — always `200`. | `{"status":"ok"}` |
`/healthz` does not touch the backend, so it stays `200` even while the
dataset registry is still loading at startup. Use `/readyz` to gate
traffic until the server is actually able to serve queries.
`/version` also includes optional fields populated from build-time env
vars when set: `git_sha` (`DATAPRESS_GIT_SHA`), `build_time`
(`DATAPRESS_BUILD_TIME`, ISO-8601), and `target`
(`DATAPRESS_TARGET`, e.g. `aarch64-apple-darwin`). Unset vars are
omitted from the JSON. Example:
```bash
DATAPRESS_GIT_SHA=$(git rev-parse --short HEAD) \
DATAPRESS_BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
DATAPRESS_TARGET=$(rustc -vV | awk '/host:/ {print $2}') \
cargo build --release -p datapress-duckdb
```
### Online documentation
DataPress can embed two browsable sources of documentation into the
binary itself:
- An [MkDocs Material](https://squidfunk.github.io/mkdocs-material/)
site (the one you are reading) at `[docs].path` (default `/mkdocs`).
- An interactive [Swagger UI](https://swagger.io/tools/swagger-ui/)
with a hand-written OpenAPI spec at `[swagger].path` (default
`/docs`). The raw spec is also exposed at `/openapi.json`.
Both are opt-in at build time (so wheels stay slim when you don't
want them) and **enabled by default at runtime** once compiled in —
set `enabled = false` to disable in prod.
1. Build the MkDocs site (only needed for the `docs` feature):
```bash
task docs:build
```
2. Build the backend with one or both features:
```bash
cargo build --release -p datapress-duckdb --features docs,swagger
```
3. Tweak in `datasets.toml` if you want to relocate or disable either:
```toml
[docs]
enabled = true # default: true
path = "/mkdocs" # default: /mkdocs
[swagger]
enabled = true # default: true (set to false in prod)
path = "/docs" # default: /docs
```
Both `path` values must start with `/`, not end with `/`, not collide
with `/api`, `/api/v1`, `/health{z,}`, `/readyz`, or `/version`, and
must differ from each other. When the binary is built without the
relevant feature but the TOML enables it, the server logs a warning at
startup and continues without that surface.
### Authentication (OIDC / OAuth2)
Build with `--features auth` to enable JWT bearer enforcement against
any OpenID-Connect issuer (Entra ID, Auth0, Keycloak, Okta, …). When
enabled, the server fetches the issuer's JWKS at startup, refreshes it
in the background, and validates `Authorization: Bearer ` headers
against the configured issuer, audience, algorithms, and scopes.
```toml
[auth]
enabled = true
issuer = "https://login.microsoftonline.com//v2.0"
audience = "api://datapress"
algorithms = ["RS256"]
read_scopes = ["datasets:read"]
reload_scopes = ["datasets:reload"]
anonymous_read = false # set true to keep read endpoints public
tenant_claim = "/tid" # JSON-pointer into the JWT claims
allowed_tenants = [""]
admin_token_fallback = true # keep X-Admin-Token working in parallel
```
Health probes (`/healthz`, `/readyz`, `/version`) stay unauthenticated
so load balancers keep working. The legacy `X-Admin-Token` header keeps
working for `POST .../reload` as long as `admin_token_fallback = true`.
To turn the Swagger UI itself into an SSO client, add an `[swagger.oauth2]`
block — it gets rendered as an `OpenIdConnect` security scheme with PKCE.
### Source
`[dataset.source]` is a tagged enum.
| `kind` | `location` | Notes |
|-----------|-----------------------------------------------------|----------------------------------------------------------------------------------------|
| `parquet` | a `.parquet` file | Read as-is. |
| `parquet` | a directory | Every `*.parquet` inside (sorted, non-recursive). No glob patterns. |
| `parquet` | `s3://bucket/key.parquet` or `s3://bucket/prefix/` | Requires a `[dataset.s3]` block. DuckDB autoloads `httpfs`. |
| `delta` | a local directory | Pointed at the table root (the dir containing `_delta_log/`). |
| `delta` | `s3://bucket/path/to/table` | Requires `[dataset.s3]`. DuckDB autoloads `delta`; DataFusion uses the `deltalake` crate. |
#### S3 / S3-compatible storage
```toml
[[dataset]]
name = "events"
[dataset.source]
kind = "parquet" # or "delta"
location = "s3://events/2025/*.parquet"
[dataset.s3]
region = "us-east-1"
endpoint = "http://localhost:9000" # omit for AWS
addressing_style = "path" # "virtual" (default) | "path"
allow_http = true # only for non-https endpoints
```
| Field | Default | Notes |
|--------------------|---------------|--------------------------------------------------------------------------------|
| `region` | `us-east-1` | Falls back to `AWS_REGION` env, then `us-east-1`. |
| `endpoint` | *(unset)* | Custom S3 endpoint (MinIO, R2, Wasabi, Backblaze, …). |
| `addressing_style` | `virtual` | `virtual` = `https://bucket.host`, `path` = `https://host/bucket` (MinIO). |
| `allow_http` | `false` | Must be `true` if `endpoint` is `http://…`. |
| `partitioning` | `auto` | Hive partition discovery: `auto`, `hive` (force on), `none` (force off). |
| `endpoint_bucket_in_host` | `auto` | Fold the bucket into the endpoint host: `auto` (follows `addressing_style`), `true`, `false`. |
| `access_key_id`, `secret_access_key`, `session_token` | *(unset)* | Inline creds. Discouraged for prod — use env vars instead. |
**Credential precedence** (highest → lowest):
1. Per-dataset env vars: `${PREFIX}_AWS_ACCESS_KEY_ID`, `${PREFIX}_AWS_SECRET_ACCESS_KEY`, `${PREFIX}_AWS_SESSION_TOKEN`, `${PREFIX}_AWS_REGION`.
`PREFIX` is the dataset name uppercased with every non-alphanumeric character mapped to `_` (e.g. `accidents` → `ACCIDENTS_AWS_…`, `my-bucket` → `MY_BUCKET_AWS_…`).
2. Inline `[dataset.s3]` keys.
3. Plain `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_REGION`.
4. The backend's default credential chain (`~/.aws/credentials`, IMDS, etc.).
> **Python:** the `S3Config` binding also accepts a `credentials_provider` — a zero-argument callable returning an `HMACKeyPair`. It is invoked once when `DataPress(...)` is constructed, the result is cached indefinitely, and it overrides any inline `access_key_id` / `secret_access_key`. See the [Python S3 docs](https://docs.datap-rs.org/python/config/#s3config).
> When `kind = "delta"` and `location` is an `s3://…` URL, both backends fully materialise the table at startup. There is no incremental scan path — switch to `parquet` if you need on-demand page reads.
### Equality-index policy (DataFusion only)
The DataFusion backend builds an in-memory `value -> [row ids]` map at
startup so that `eq` / `in` predicates resolve in O(1).
| `mode` | Behaviour |
|----------|------------------------------------------------------------------------|
| `auto` | Index every column whose distinct count stays below `max_cardinality`. |
| `none` | Skip the index entirely — every query goes through DataFusion SQL. |
| `list` | Index only the named `columns`. Useful for huge datasets. |
Override the config path with `DATASETS_CONFIG=/path/to/file.toml`.
## HTTP API
Four routes, both backends:
### API versioning
The canonical paths live under `/api/v1/...`. The un-versioned
`/api/...` paths continue to work as a **legacy alias** for v1, so
existing clients keep running. To upgrade, replace `/api/` with
`/api/v1/` in your URLs — nothing else changes.
```text
POST /api/v1/datasets/accidents/query # canonical (recommended)
POST /api/datasets/accidents/query # legacy alias, still v1
```
When a breaking schema change is introduced, it will ship as `/api/v2`
in a sibling module ([crates/core/src/handlers/v1.rs](crates/core/src/handlers/v1.rs))
and v1 will stay mounted alongside it for a deprecation window.
### `GET /api/v1/datasets`
```json
{ "datasets": [ { "name": "accidents", "columns": 47 } ] }
```
### `GET /api/v1/datasets/{name}/schema`
Returns the inferred columns plus a sample row so a client can see what
values look like without issuing a query.
```json
{
"name": "accidents",
"columns": [
{ "name": "ID", "logical": "utf8", "sql_type": "VARCHAR", "nullable": false },
{ "name": "Severity", "logical": "int", "sql_type": "INTEGER", "nullable": true },
{ "name": "Start_Time", "logical": "temporal", "sql_type": "TIMESTAMP", "nullable": true }
],
"sample": { "ID": "A-1", "Severity": 2, "Start_Time": "2016-02-08 05:46:00", ... }
}
```
`logical` values: `bool | int | float | utf8 | temporal | other`. Temporal
columns are returned as strings.
### `POST /api/v1/datasets/{name}/query`
```json
{
"columns": ["ID", "City", "State", "Severity"],
"predicates": [
{ "col": "State", "op": "eq", "val": "TX" },
{ "col": "Severity", "op": "gte", "val": 3 }
],
"order_by": [
{ "col": "Severity", "dir": "desc" },
{ "col": "ID" }
],
"limit": 1000,
"page": 1,
"page_size": 50
}
```
Response:
```json
{ "data": [ { ... }, ... ], "page": 1, "page_size": 50 }
```
#### Request fields
| Field | Type | Default | Notes |
|--------------|---------------------|---------|----------------------------------------|
| `columns` | `string[]` | `[]` | Empty = all columns. |
| `predicates` | `Predicate[]` | `[]` | ANDed together. |
| `order_by` | `OrderBy[]` | `[]` | `{ col, dir? }`; `dir` is `asc` (default) or `desc`, case-insensitive. When `group_by` is set, `col` must be a group column or aggregation alias. |
| `group_by` | `string[]` | `[]` | Columns to group by. When set, `columns` is ignored. Empty `aggregations` implies `[{ op: "count" }]`. |
| `aggregations` | `Aggregation[]` | `[]` | `{ col?, op, alias? }`; `op` is `count\|sum\|avg\|min\|max`. `col` may be omitted only for `count` (= `COUNT(*)`). Requires `group_by`. |
| `distinct` | `bool` | `false` | Dedup the projected columns. Mutually exclusive with `group_by` / `aggregations`. |
| `limit` | `int >= 0` or null | `null` | Hard cap on total rows across all pages. `null` = unlimited. |
| `page` | `int >= 1` | `1` | 1-based. |
| `page_size` | `int >= 1` | `1000` | Clamped to `server.max_page_size` (`100_000` by default). |
#### Predicate shape
```json
{ "col": "", "op": "", "val": }
```
| `op` | `val` | Meaning |
|----------------|------------------------|--------------------------------------|
| `eq` | scalar | `col = val` |
| `neq` | scalar | `col <> val` |
| `gt` / `gte` | number / string | `col > val` / `col >= val` |
| `lt` / `lte` | number / string | `col < val` / `col <= val` |
| `like` | string with `%` / `_` | SQL `LIKE` |
| `ilike` | string with `%` / `_` | Case-insensitive `LIKE` |
| `in` | non-empty array | `col IN (v1, v2, …)` |
| `is_null` | omit | `col IS NULL` |
| `is_not_null` | omit | `col IS NOT NULL` |
Column names are looked up case-insensitively against the inferred schema
and quoted automatically, so `Temperature(F)` and similar identifiers work.
#### Response format — JSON or Arrow IPC
`/query` can return its result set in two wire formats. Same body, same
predicates, same pagination — only the response encoding differs.
| Aspect | JSON (default) | Arrow IPC stream |
|---------------------|------------------------------------------------------|----------------------------------------------------------------------------------|
| Content-Type | `application/json` | `application/vnd.apache.arrow.stream` |
| How to ask | nothing — it's the default | `Accept: application/vnd.apache.arrow.stream` **or** `?format=arrow` on the URL |
| Shape | Array of row objects (`[{...}, {...}, ...]`) | Self-describing stream: 1 schema message + N `RecordBatch` messages + EOS |
| Layout | Row-oriented; column names repeated on every row | Columnar; one contiguous buffer per column per batch |
| Types preserved | Scalars become JSON (`int`/`float`/`bool`/`string`); temporals stringified to ISO-8601 | Native Arrow types — `Int32`, `Timestamp(ns)`, `Decimal128`, dictionary, etc. retained end-to-end |
| Page metadata | In the body (just the rows, no envelope) | In headers: `X-Page`, `X-Page-Size` |
| Empty result | `[]` | Valid stream with the schema message only, zero batches |
| Compression | Big win — JSON is text | Smaller starting point; gzip/zstd still help on wide / repetitive cols, brotli usually skipped |
| Client cost | `json.loads` + per-row dict construction | `pyarrow.ipc.open_stream(...).read_all()` → zero-copy `pyarrow.Table` |
| Best for | Small responses, browsers, ad-hoc `curl`, dashboards | Bulk data into Polars / pandas / DuckDB-on-the-client, ML feature pipelines |
**When to pick which.** Use JSON when the consumer is JavaScript, the
response is small (<~10k rows), or you're poking at the API by hand.
Use Arrow IPC when you're moving result pages into a dataframe library,
the schema has non-string types you want preserved, or page sizes are
large enough that JSON parse time shows up in profiles.
```bash
# JSON (default)
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
-H 'Content-Type: application/json' \
-d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'
# Arrow IPC — via Accept header
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
-H 'Content-Type: application/json' \
-H 'Accept: application/vnd.apache.arrow.stream' \
--output result.arrow \
-d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'
# Arrow IPC — via query string (handy when you can't set headers)
curl -X POST 'http://localhost:8080/api/v1/datasets/accidents/query?format=arrow' \
-H 'Content-Type: application/json' \
--output result.arrow \
-d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'
```
```python
import requests, pyarrow.ipc as ipc
r = requests.post(url, json=req, headers={"Accept": "application/vnd.apache.arrow.stream"})
table = ipc.open_stream(r.content).read_all() # → pyarrow.Table
page = int(r.headers["X-Page"])
size = int(r.headers["X-Page-Size"])
```
Supported on **both** backends — DuckDB streams batches out via its
native `query_arrow` API, DataFusion uses its Arrow plan directly.
The `Compress` middleware still applies. `count`, `schema`, and the
dataset-listing endpoints are JSON-only.
#### Grouping / aggregation
When `group_by` is non-empty the SELECT list is derived from the group
columns plus each aggregation's output alias — the top-level `columns`
field is ignored. Supported ops: `count`, `sum`, `avg`, `min`, `max`
(case-insensitive). `col` may be omitted only for `count` (= `COUNT(*)`).
If `aggregations` is omitted an implicit `COUNT(*) AS count` is added.
```bash
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
-H 'Content-Type: application/json' \
-d '{
"group_by": ["State"],
"aggregations": [
{ "op": "count" },
{ "col": "Severity", "op": "avg", "alias": "avg_sev" }
],
"order_by": [{ "col": "count", "dir": "desc" }],
"page_size": 10
}'
# → { "data": [ { "State": "CA", "count": 1741433, "avg_sev": 2.21 }, ... ], ... }
```
`aggregations` without `group_by` returns `400`. `order_by` keys must
reference a group column or an aggregation alias (no arbitrary dataset
columns — they are not in scope after `GROUP BY`). Grouped queries always
go through the SQL engine; no in-memory fast path applies.
#### Distinct rows
`distinct: true` deduplicates on the projected columns. Useful for
building dropdowns / facet lists.
```bash
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
-H 'Content-Type: application/json' \
-d '{
"columns": ["State"],
"distinct": true,
"order_by": [{ "col": "State" }],
"page_size": 100
}'
# → { "data": [ { "State": "AL" }, { "State": "AR" }, ... ], ... }
```
Mutually exclusive with `group_by` / `aggregations` (returns `400` if
combined). Also bypasses the in-memory fast paths.
### `POST /api/v1/datasets/{name}/count`
Returns the number of rows matching `predicates`. Same predicate shape as
`/query`; only the `predicates` field is read. Empty body counts every row.
```bash
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \
-H 'Content-Type: application/json' -d '{}'
# → { "count": 7728394 }
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \
-H 'Content-Type: application/json' \
-d '{
"predicates": [
{ "col": "State", "op": "eq", "val": "TX" },
{ "col": "Severity", "op": "gte", "val": 3 }
]
}'
# → { "count": 187423 }
```
On materialised DataFusion datasets the no-predicate path is O(1) (uses the
resident chunk metadata, no scan); indexable predicates short-circuit
through the equality index. Otherwise it runs `SELECT COUNT(*) … WHERE …`
through the engine.
### `POST /api/v1/datasets/{name}/reload` *(admin)*
Rebuilds the dataset from its configured `source` and publishes the new
contents without a server restart. Running queries finish against a
consistent old snapshot; later queries see the new data. If the rebuild
fails, the previously published dataset stays live.
Requires `X-Admin-Token: $ADMIN_TOKEN`. **If `ADMIN_TOKEN` is unset the
endpoint is disabled** — the secure default. The comparison is
constant-time.
```bash
curl -s -X POST \
-H "X-Admin-Token: $ADMIN_TOKEN" \
http://localhost:8080/api/v1/datasets/accidents/reload
# { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }
```
| Status | Body | Meaning |
|--------|-----------------------------------------------|------------------------------------------------------|
| `200` | `{ dataset, rows, elapsed_ms }` | New data live. |
| `403` | `{ "error": "forbidden: …" }` | Token missing/wrong, or `ADMIN_TOKEN` not set. |
| `404` | `{ "error": "not found: dataset: …" }` | No such dataset in `datasets.toml`. |
| `500` | `{ "error": "internal error: …" }` | Parquet read failed — old data stays live. |
Concurrent reloads of the **same** dataset are serialised (per-name mutex);
reloads of **different** datasets run in parallel.
#### Backend-specific reload semantics
- **DataFusion** uses a service-level double buffer. The backend builds a
fresh `DatasetState` off to the side (parquet/Delta read, Arrow
`RecordBatch` chunks, equality indexes, partition metadata), registers
the new provider, then publishes it with an `ArcSwap` snapshot update.
Queries that already captured the old `Arc` keep running; later queries
see the new state. The old buffers are dropped once the last reader
releases its reference. Trade-off: for materialised datasets, peak RSS
can approach roughly twice the dataset size plus index overhead during
reload.
- **DuckDB** delegates publication to the database engine. Reload runs
`CREATE OR REPLACE TABLE ... AS SELECT ...` against the dataset source.
DuckDB treats that as an ACID transaction over the table/catalog
replacement: if the source read or table creation fails, the existing
table remains live; if it succeeds, later queries see the replacement
atomically. In-flight queries continue against the snapshot they started
with through DuckDB's transaction/MVCC semantics. DataPress then
refreshes only the small cached schema and row-count metadata.
The HTTP contract is the same for both backends: clients observe either
the old dataset or the new dataset, never a partially loaded one. The
resource profile differs: DataFusion owns the Arrow buffers in process;
DuckDB relies on DuckDB's storage engine and buffer manager.
---
## Examples
```bash
# Discovery
curl -s http://localhost:8080/api/v1/datasets | jq
curl -s http://localhost:8080/api/v1/datasets/accidents/schema | jq
# Equality + range
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
-H 'Content-Type: application/json' \
-d '{
"columns": ["ID","Severity","City","State","Start_Time"],
"predicates": [
{ "col": "State", "op": "eq", "val": "TX" },
{ "col": "Severity", "op": "gte", "val": 3 }
],
"page": 1, "page_size": 5
}' | jq
# Substring + numeric range
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
-H 'Content-Type: application/json' \
-d '{
"predicates": [
{ "col": "Description", "op": "ilike", "val": "%fog%" },
{ "col": "Temperature(F)", "op": "lt", "val": 32 }
],
"page_size": 10
}' | jq
# IN list
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
-H 'Content-Type: application/json' \
-d '{
"predicates": [
{ "col": "State", "op": "in", "val": ["NY","NJ","CT"] }
]
}' | jq
```
For a deeper benchmark catalogue (light load + CPU/memory stress tests), see
[`TEST_Q.md`](TEST_Q.md).
---
## Project layout
```
Cargo.toml # workspace manifest
pyproject.toml # maturin / PyO3 build
crates/
├── core/ # datapress-core: config, schema, errors, admin
│ └── src/
│ ├── admin.rs # X-Admin-Token verification (constant-time)
│ ├── config.rs # datasets.toml parsing + validation
│ ├── schema.rs # backend-agnostic schema model
│ ├── models.rs # Predicate / QueryRequest
│ └── errors.rs # AppError + actix ResponseError
├── duckdb/ # datapress-duckdb
│ └── src/
│ ├── lib.rs # pub async fn serve(cfg) -> io::Result<()>
│ ├── db.rs # Registry: pool + schemas + reload
│ ├── repository.rs # DatasetRepository (SQL builder)
│ ├── handlers.rs # actix routes
│ └── bin/datapress-duckdb.rs # entrypoint binary
├── datafusion/ # datapress-datafusion
│ └── src/
│ ├── lib.rs # pub async fn serve(cfg) -> io::Result<()>
│ ├── store.rs # Store: RecordBatch + eq-index + reload
│ ├── handlers.rs # actix routes
│ └── bin/datapress-datafusion.rs
└── python/ # datapress (Python wheel, cdylib)
└── src/lib.rs # PyO3 bindings — DataPress, DataPressConfig, ...
```
Core re-exports compile without any backend; each backend crate adds the
feature flag it needs on `datapress-core`. The Python crate depends on both
backends, so the wheel can dispatch between them at runtime based on
`DataPressConfig(backend=...)`.
---
## Build flags
```bash
# DuckDB only
cargo build --release -p datapress-duckdb
# DataFusion only
cargo build --release -p datapress-datafusion
# Both Rust binaries
task build
# Python wheel (compiles both backends into one extension)
task py:develop # editable install into ./.venv (uses uv + maturin)
task py:build # release wheel into ./target/wheels/
```
Release builds use LTO + `codegen-units = 1` (see `[profile.release]` in
`Cargo.toml`). Expect noticeably longer link times in exchange for tighter
inner loops.
---
## Environment variables
| Variable | Default | Purpose |
|-------------------|------------------|----------------------------------------------------------------------------------|
| `DATASETS_CONFIG` | `datasets.toml` | Path to the dataset registry file. |
| `ADMIN_TOKEN` | *(unset)* | Enables `POST /api/v1/datasets/{name}/reload`. Unset = admin endpoints disabled. |
| `DB_POOL_SIZE` | `num_cpus` | DuckDB connection pool size (DuckDB only). |
| `RUST_LOG` | `info` | Standard `env_logger` filter. |
| `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` | *(unset)* | Fallback S3 credentials used by any dataset that doesn't override them. |
| `AWS_REGION` | `us-east-1` | Fallback S3 region. |
| `${PREFIX}_AWS_*` | *(unset)* | Per-dataset overrides for the four `AWS_*` vars above. See "Credential precedence" under `[dataset.s3]`. |
Bind address, port, worker count and backend selection live in `[server]`
in `datasets.toml`, not in env vars.
---
## Status / non-goals
- No authentication or rate-limiting on query routes — put this behind your
own gateway. The `reload` admin route is gated by a shared-secret header
(`X-Admin-Token`) and disabled unless `ADMIN_TOKEN` is set.
- No write path: parquet sources are read-only. The only mutation is
reloading a dataset from disk via the admin route.
- No cursor pagination — pagination is plain `OFFSET / LIMIT`, so deep
pages get expensive (see `H5` in `TEST_Q.md`). `ORDER BY` is supported via
the `order_by` field, but sorted queries always go through the SQL engine
(no in-memory fast path).
- DataFusion backend keeps the whole dataset in memory. DuckDB does not.