{"id":50784314,"url":"https://github.com/av/openpuffer","last_synced_at":"2026-06-12T06:07:20.163Z","repository":{"id":362268441,"uuid":"1257649406","full_name":"av/openpuffer","owner":"av","description":"Stateless S3-backed vector and FTS search (turbopuffer-compatible API)","archived":false,"fork":false,"pushed_at":"2026-06-10T21:19:12.000Z","size":1415,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T23:11:35.998Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/av.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-06-02T22:04:09.000Z","updated_at":"2026-06-10T21:19:40.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/av/openpuffer","commit_stats":null,"previous_names":["av/openpuffer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/av/openpuffer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/av%2Fopenpuffer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/av%2Fopenpuffer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/av%2Fopenpuffer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/av%2Fopenpuffer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/av","download_url":"https://codeload.github.com/av/openpuffer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/av%2Fopenpuffer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34231270,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-12T06:07:19.436Z","updated_at":"2026-06-12T06:07:20.155Z","avatar_url":"https://github.com/av.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# openpuffer\n\n[![CI](https://github.com/av/openpuffer/actions/workflows/ci.yml/badge.svg)](https://github.com/av/openpuffer/actions/workflows/ci.yml)\n[![version](https://img.shields.io/badge/version-0.2.0-blue)](CHANGELOG.md)\n\nStateless vector and full-text search server backed by **S3-compatible object storage**. HTTP API is compatible with [turbopuffer](https://turbopuffer.com/docs) core write/query paths; the **on-disk architecture** follows [turbopuffer’s WAL + index model](https://turbopuffer.com/docs/architecture), not a per-document JSON store.\n\n## How it compares to turbopuffer\n\n| Area | turbopuffer | openpuffer (v1) |\n|------|-------------|-----------------|\n| **Durable layout** | WAL + index segments on object storage | Same: `meta.json`, `wal/{seq}.bin`, `index/*` under `openpuffer/{ns}/` |\n| **Write ACK** | After durable WAL commit | Group-commit buffer → one WAL PUT + `meta.json` CAS per batch |\n| **Indexing** | Async SPFresh-style ANN + FTS | Async background indexer: BM25 FTS, k-means centroids/clusters, attribute filter index |\n| **Query** | Indexed candidates + unindexed WAL tail | `strong` (default); `eventual` skips WAL tail + catch-up on pinned views (sub-10ms warm path) |\n| **Cache** | NVMe + in-process warm | Optional `--cache-dir` disk mirror + `POST …/warm` view pin |\n| **Scale / polish** | Production multi-tenant | Single binary, MinIO integration tests; simplified ANN (one-level k-means, no SPFresh hierarchy) |\n| **API surface** | Full product API | Core write/query/metadata/export/warm; no billing portal, CMEK, or all v2 edge cases |\n\n**Honest gaps:** no managed cloud, no cross-region replication, ANN is a simplified two-level k-means probe (not production SPFresh), throughput is ~1 WAL commit/s/namespace by default, and filter/FTS merges are simpler than turbopuffer at scale.\n\n**Full comparison (implemented vs missing, when to use which):** [docs/COMPARISON.md](docs/COMPARISON.md).\n\nDesign detail: [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).\n\n## Architecture (high level)\n\n```\n                    ┌─────────────┐\n  POST /v2/...      │ write       │  group-commit buffer (time / batch size)\n  (upsert/delete)   │ buffer      │\n                    └──────┬──────┘\n                           │ durable ACK\n                           ▼\n              S3: openpuffer/{ns}/\n              ├── meta.json          ← index_cursor, wal_commit_seq, schema\n              ├── wal/\n              │   ├── 00000001.bin   ← [0x01][bincode WalEntry][crc32 LE]\n              │   └── snapshot.bin   ← compaction snapshot (optional)\n              └── index/             ← async indexer\n                  ├── fts-*.bin\n                  ├── {field}/centroids-l0.bin + centroids-l1-*.bin + clusters-*.bin\n                  └── filter-*.bin\n\n  POST /v2/.../query\n       │\n       ├─ load meta + index segments (disk cache if warm)\n       ├─ ANN (L0/L1 probe) / BM25 / hybrid candidate generation\n       ├─ apply filters (intersect before score)\n       └─ score unindexed WAL tail (strong) → top_k\n```\n\nWAL replay verifies CRC on v1 segments; corrupt segments use [`fail` or `skip`](#wal-corrupt-policy) (default `fail`). Legacy segments without the `0x01` prefix remain readable.\n\n**Consistency:** writes are visible after `wal_commit_seq` advances; queries under `consistency: \"strong\"` also scan WAL entries with `seq \u003e index_cursor` until the indexer catches up.\n\n## Features\n\n- WAL-backed writes with strong consistency before ACK\n- Background indexer (FTS BM25, vector ANN clusters, attribute filters)\n- Vector ANN, BM25 FTS, hybrid `rank_by` (`Sum` / `Product`)\n- Query filters (`Eq`, `In`, `And`, …), `delete_by_filter`, `patch_by_filter`, `patch_rows`\n- Namespace export at `wal_commit_seq`, warm-cache endpoint\n- Single static binary — no sidecar databases\n\n## Quickstart (local dev with Docker)\n\nRequires [Docker](https://docs.docker.com/get-docker/) for MinIO.\n\n```bash\n./scripts/dev-up.sh      # MinIO on :9000, bucket openpuffer-dev\n./scripts/dev-serve.sh   # build + serve on :8080\n```\n\n**SPFresh v3 index + cold S3 path** (probed cluster fetch, no disk cache):\n\n```bash\nexport OPENPUFFER_ANN_VERSION=3          # v3 index layout at build (default 2)\nexport OPENPUFFER_CACHE_DIR=\"\"           # cold query: S3-only index load (same as --cache-dir \"\")\nexport OPENPUFFER_COLD_S3_CONCURRENCY=32 # parallel GETs per cold sub-batch (default 32)\n# optional: OPENPUFFER_ANN_RERANK=1      # exact re-rank over probed clusters (higher recall, larger candidate pool)\n\n./scripts/dev-serve.sh\n```\n\nAfter upserts and `index_cursor == wal_commit_seq`, vector queries report `performance.storage_roundtrips`, `cold_s3_keys_fetched`, and `ann_probed_clusters`. See [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for probe/cold tuning.\n\n**Benchmarks / performance (MinIO vs turbopuffer scaling):** [benchmarks/OP_VS_TPUF.md](benchmarks/OP_VS_TPUF.md) (one-page verdict); full report — [docs/reports/BENCHMARK_VS_TURBOPUFFER_SCALING_2026-06-04.md](docs/reports/BENCHMARK_VS_TURBOPUFFER_SCALING_2026-06-04.md); reproduce with `make bench-compare-tpuf` (offline gate: `./scripts/verify-op-scaling-comparison.sh`).\n\n### Large-dataset comparison harness\n\nApples-to-apples **openpuffer vs turbopuffer** on a shared synthetic workload (L1 default: **100k × 128-dim**). Workloads, result JSON, and operator scripts: [benchmarks/README.md](benchmarks/README.md). Program plan: [docs/PLAN_LARGE_DATASET_BENCHMARK.md](docs/PLAN_LARGE_DATASET_BENCHMARK.md).\n\n**Operator handoff (offline harness complete, live G3–G5 pending):** [docs/reports/LARGE_DATASET_HARNESS_HANDOFF.md](docs/reports/LARGE_DATASET_HARNESS_HANDOFF.md).\n\n**Milestone tag** `large-dataset-harness-v1` (`c7f66a3`, annotated 2026-06-04) — offline harness only; measured `large-aws-l1.json` / `tpuf-l1.json` still require EC2 + AWS S3 + `TURBOPUFFER_API_KEY`:\n\n```bash\ngit fetch --tags\ngit checkout large-dataset-harness-v1   # or: git show large-dataset-harness-v1\n```\n\n**Makefile targets** (same as `./scripts/verify-large-benchmark-program.sh` and operator preflights):\n\n| Target | Purpose |\n|--------|---------|\n| `make bench-verify` | Offline harness gate — pytest, schemas, L1–L3 dry-runs, `@spec` facts (CI dispatch) |\n| `make bench-dry-run` | Harness dry-run only — no pytest/cargo/facts; no cloud spend |\n| `make bench-g2-minio` | Optional G2 MinIO correctness gates (Docker; slow) |\n| `make bench-preflight` | G3+G4+overlap preflights (offline default) |\n\n```bash\nmake bench-verify                                    # before any cloud spend\nmake bench-verify VERIFY_FLAGS=\"--with-g2\"           # + MinIO G2 (Docker parity with CI)\nmake bench-dry-run                                   # L1–L3 script dry-runs only\nmake bench-g2-minio                                  # G2 only (faster iteration)\nmake bench-preflight                                 # offline cost/deps/overlap checks\nmake bench-preflight PREFLIGHT_FLAGS=\"--live --tier l1\"   # EC2 live preflight\n```\n\nEC2 live runs, artifact `git add -f` policy, and post-live `@spec` activation: [benchmarks/README.md](benchmarks/README.md), [benchmarks/OPERATOR_RUNBOOK_QUICK.md](benchmarks/OPERATOR_RUNBOOK_QUICK.md).\n\n**Recall API** (ANN vs exhaustive on indexed namespace):\n\n```bash\ncurl -s -X POST \"http://127.0.0.1:8080/v1/namespaces/my-ns/recall\" \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"num\": 5, \"top_k\": 10, \"vector_field\": \"embedding\"}'\n# → {\"avg_recall\":0.9,\"avg_ann_count\":...,\"avg_exhaustive_count\":...}\n```\n\nSmoke test:\n\n```bash\ncurl -s http://127.0.0.1:8080/health\ncurl -s http://127.0.0.1:8080/v1/ready\ncurl -s \"http://127.0.0.1:8080/health?deep=1\"\n```\n\nMinIO console: http://127.0.0.1:9001 (`minioadmin` / `minioadmin`). Stop storage with `docker compose down` from the repo root.\n\n## Build\n\n```bash\ncargo build --release\n```\n\n## Run\n\nCreate your bucket first (or use `./scripts/dev-up.sh`), then:\n\n```bash\nopenpuffer serve \\\n  --listen 0.0.0.0:8080 \\\n  --s3-endpoint http://127.0.0.1:9000 \\\n  --s3-bucket openpuffer-dev \\\n  --s3-access-key minioadmin \\\n  --s3-secret-key minioadmin\n```\n\n### Configuration\n\n| Flag / env | Purpose |\n|------------|---------|\n| `--s3-endpoint`, `OPENPUFFER_S3_ENDPOINT` | S3 API URL |\n| `--s3-bucket`, `OPENPUFFER_S3_BUCKET` | Bucket name |\n| `--s3-region`, `OPENPUFFER_S3_REGION` | Region (default `us-east-1`) |\n| `--s3-access-key` / `--s3-secret-key` | Credentials |\n| `--cache-dir`, `OPENPUFFER_CACHE_DIR` | Index segment disk cache (default `/tmp/openpuffer-cache`; `\"\"` = memory-only / **cold S3 path**) |\n| `OPENPUFFER_COLD_MAX_KEYS_PER_ROUND` | Max S3 keys per cold round sub-batch (default 128) |\n| `OPENPUFFER_COLD_S3_CONCURRENCY` | In-flight parallel `GetObject` per cold sub-batch (default 32) |\n| `OPENPUFFER_WRITE_MAX_DELAY_MS` | Group-commit delay (default 1000) |\n| `OPENPUFFER_WRITE_MAX_BATCH_OPS` | Max ops per WAL batch (default 512) |\n| `OPENPUFFER_MAX_PINNED_NAMESPACES` | In-process warm view LRU (default 32) |\n| `--wal-corrupt-policy`, `OPENPUFFER_WAL_CORRUPT_POLICY` | WAL replay on CRC mismatch: `fail` (default) or `skip` — see [WAL corrupt policy](#wal-corrupt-policy) |\n| `--ann-version`, `OPENPUFFER_ANN_VERSION` | Index format: `2` (default) or `3` (SPFresh routing + L2 splits) |\n| `--ann-coarse-probe` / `--ann-fine-probe`, `OPENPUFFER_ANN_COARSE_PROBE` / `OPENPUFFER_ANN_FINE_PROBE` | ANN L0/L1 probe counts at index build (defaults 4 / 2) |\n| `--ann-rerank`, `OPENPUFFER_ANN_RERANK` | Exact re-rank over probed ANN pool (`1`/`true`; default off) |\n\n### Prometheus metrics\n\nBuild with `--features metrics` (`cargo build --release --features metrics`). The server exposes **`GET /metrics`** (Prometheus text).\n\nCold/ANN counters (increment on probed cold loads and vector ANN queries):\n\n| Metric | Meaning |\n|--------|---------|\n| `openpuffer_cold_s3_keys_fetched` | S3 object keys fetched on cold batch plans (each parallel sub-batch key counts once) |\n| `openpuffer_ann_probed_clusters` | Cluster segments selected by ANN probe planning per vector query |\n| `openpuffer_ann_probe_clamp_total` | Times on-disk probe widths were clamped to `OPENPUFFER_ANN_MAX_PROBE_CLUSTERS` at query time |\n\nAlso exported: `openpuffer_wal_commits_total`, `openpuffer_index_lag_segments`, `openpuffer_s3_get_total`, `openpuffer_query_duration_seconds`, `openpuffer_cold_query_duration_seconds`.\n\nPer-query JSON (`POST …/query`) reports the same cold signals as `performance.cold_s3_keys_fetched` and `performance.ann_probed_clusters` without the metrics feature.\n\n### Debug endpoints (integration builds)\n\nBuilt with `--features integration` (default for `./scripts/run-integration-s3.sh`). Not for production exposure.\n\n| Method | Path | Body | Purpose |\n|--------|------|------|---------|\n| GET | `/v1/debug/cache-stats` | — | `s3_get_count` from segment cache |\n| POST | `/v1/debug/cache-stats/reset` | — | Reset `s3_get_count` |\n| POST | `/v1/debug/namespaces/{name}/cold-plan` | Same JSON as query (`rank_by`, optional `consistency`) | Preview [`plan_cold_query`](src/s3_batch.rs): per-round key counts, `storage_roundtrips` estimate, per-field probe plan — **does not run the query** (fetches `meta.json` and L0 only when vector probes are present) |\n\nExample (cold cache, indexed namespace):\n\n```bash\ncurl -sS -X POST \"http://127.0.0.1:8080/v1/debug/namespaces/my-ns/cold-plan\" \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"rank_by\":[\"vector\",\"ANN\",\"embedding\",[1,0,0]],\"consistency\":\"eventual\"}' | jq .\n```\n\n### WAL corrupt policy\n\nv1 WAL segments on S3 use `[0x01][bincode WalEntry][crc32 LE]`. On replay, openpuffer verifies the CRC over the payload. If a segment is truncated, tampered, or has a bad checksum:\n\n| Policy | Flag / env | Behavior |\n|--------|------------|----------|\n| **`fail`** (default) | `--wal-corrupt-policy fail` or `OPENPUFFER_WAL_CORRUPT_POLICY=fail` | Namespace load aborts; queries return **500** with a turbopuffer-style `{\"error\":\"…\",\"status\":\"error\"}` mentioning corrupt WAL |\n| **`skip`** | `--wal-corrupt-policy skip` or `OPENPUFFER_WAL_CORRUPT_POLICY=skip` | Log the corrupt segment and **continue** replay; earlier segments stay applied; the corrupt segment’s writes are invisible |\n\nSet the policy at **process start** (`openpuffer serve`). Legacy WAL blobs without the `0x01` version byte still replay (no CRC on those segments).\n\n**Example (recovery after partial upload):**\n\n```bash\nopenpuffer serve \\\n  --wal-corrupt-policy skip \\\n  --s3-endpoint http://127.0.0.1:9000 \\\n  --s3-bucket openpuffer-dev \\\n  ...\n```\n\nIntegration coverage: `corrupt_wal_segment_on_minio_fail_and_skip_policies` in `tests/integration_s3.rs` (flips one CRC byte on S3, asserts fail → 500 and skip → doc from seq 1 only).\n\n### Operations guide\n\n1. **Cold start** — point at bucket; first write creates `meta.json` + `wal/00000001.bin`.\n2. **Indexing lag** — check `GET /v1/namespaces/{name}`: `index_cursor` should reach `wal_commit_seq`. Queries still return recent writes via WAL tail under strong consistency.\n3. **Warm a hot namespace** — `POST /v1/namespaces/{name}/warm` prefetches index objects and pins an in-memory view (fewer S3 round-trips on the same process).\n4. **Export** — `GET /v1/namespaces/{name}/export?limit=10000\u0026last_id=…` (or POST with JSON body) for a consistent snapshot at `wal_commit_seq`.\n5. **Multi-instance** — any number of stateless `serve` processes can share one bucket; per-namespace writes serialize via S3 CAS + in-process commit lock.\n6. **Restart** — no local durable state required; replay WAL from S3 on first query.\n\n## API\n\n| Method | Path | Purpose |\n|--------|------|---------|\n| GET | `/health` | Liveness (always `ok` unless `?deep=1`) |\n| GET | `/v1/ready` | Traffic readiness — S3 configured and reachable (`503` if not) |\n| GET | `/metrics` | Prometheus scrape (`--features metrics`) |\n| GET | `/v1/namespaces` | List namespaces + metadata |\n| GET | `/health?deep=1` | S3 probe (`HeadBucket` + `openpuffer/` read); `degraded` if down |\n| GET | `/v1/namespaces/{name}` | `approx_row_count`, `index_status`, `unindexed_bytes`, cursors |\n| GET/POST | `/v1/namespaces/{name}/export` | Paginated export (`last_id`, `limit`, `format=ndjson`) |\n| POST | `/v1/namespaces/{name}/warm` | Prefetch index + pin view |\n| POST | `/v1/namespaces/{name}/recall` | ANN vs exhaustive recall@k (`num`, `top_k`, optional `filters`, `vector_field`); response `avg_recall`, `avg_ann_count`, `avg_exhaustive_count` |\n| POST | `/v2/namespaces/{name}` | Write (upsert, patch, delete, `delete_by_filter`, `patch_by_filter`, `schema`) |\n| POST | `/v2/namespaces/{name}/query` | Vector / FTS / hybrid / filtered search |\n| DELETE | `/v2/namespaces/{name}` | Delete namespace prefix |\n\nQuery responses include `performance` (`candidates`, `candidates_ratio`, `exhaustive_search_count`, …) and optional headers `X-Openpuffer-Candidates`, `X-Openpuffer-Candidates-Fraction`.\n\n## Test\n\n### Test matrix\n\n| Suite | Command | Count | Docker | Notes |\n|-------|---------|-------|--------|-------|\n| **Unit** | `cargo test` | ~158 | No | Library + WAL/index logic |\n| **Integration (MinIO)** | `cargo test -F integration` | **51** | Yes | `tests/integration_s3.rs` — testcontainers MinIO; S3 Head/List/Get + WAL decode |\n| **Perf** | `cargo test -F perf` | 1 | Yes | 5k-doc ANN `candidates_ratio` regression |\n| **External S3** | `cargo test -F integration --test integration_external_s3 -- --ignored` | 1 | Optional | Compose MinIO or `OPENPUFFER_TEST_S3_*` |\n| **Large stress** | `cargo test --release -F large_stress --test stress_50k -- --ignored` | 3 | Yes | 50k warm + v3 cold probed mid-tier; not in default CI |\n\nTypical dev run (unit + integration + perf):\n\n```bash\ncargo test                              # unit (~158), no Docker\ncargo build --features integration      # build server binary for integration harness\ncargo test -F integration               # 51 MinIO scenarios (~60–70s)\ncargo test -F perf                      # ANN candidate_ratio on 5k docs\n```\n\n**S3 integration (requires Docker) — recommended:**\n\n```bash\n./scripts/run-integration-s3.sh\n```\n\nBuilds with `--features integration` and runs all **51** `integration_s3` tests against **real MinIO** (testcontainers). Tests assert **Head/List/Get** on `meta.json`, `wal/`, and `index/` (decode WAL, segment growth, copy key parity) — not HTTP-only mocks.\n\n### Optional 50k namespace stress (`large_stress`)\n\nNot part of the default matrix — `#[ignore]` so `cargo test -F integration` stays fast. **Nightly CI:** [`.github/workflows/nightly-stress.yml`](.github/workflows/nightly-stress.yml) (03:00 UTC + manual `workflow_dispatch`).\n\n```bash\ncargo build --release --features large_stress\ncargo test --release -F large_stress --test stress_50k -- --ignored --nocapture\n```\n\nUpserts **50k** docs in **5×10k** `upsert_columns` batches with **~1.1s** spacing (WAL rate limit), waits for `index_cursor == wal_commit_seq` (300s wall timeout). Tests:\n\n- `fifty_thousand_docs_indexed_query` — v2 default, warm ANN `candidates_ratio \u003c 0.2`\n- `fifty_thousand_docs_v3_cold_probed_validation` — `--ann-version 3`, strong cold probed path: `storage_roundtrips ≤ 4`, `recall@10 ≥ 0.86`, `candidates_ratio \u003c 0.2` (see [docs/BENCHMARKS.md](docs/BENCHMARKS.md) mid-tier)\n- `v3_cold_probed_wiring_at_2k` — fast 2k wiring for the same cold metrics\n\n**Use `--release`** — debug builds may not index 50k within 300s. On a typical dev machine (release): warm stress **~40–45s**; v3 cold gate **~1–2 min** including recall probes.\n\n### Testing against real S3\n\nTwo ways to hit a **real** S3-compatible endpoint (MinIO or AWS) — not mocks:\n\n| Mode | Command | Backend |\n|------|---------|---------|\n| **Default** | `./scripts/run-integration-s3.sh` | Ephemeral MinIO (testcontainers) |\n| **Compose MinIO** | `./scripts/run-integration-s3.sh external` | `docker-compose.test.yml` on `:9000` |\n| **Your bucket** | Set `OPENPUFFER_TEST_S3_*` env vars | Any S3-compatible API |\n\n**Compose external tests** (starts MinIO if `:9000` is not already healthy, creates `openpuffer-integration` bucket):\n\n```bash\n./scripts/run-integration-s3.sh external\n```\n\n**Manual env** (same variables the script sets; use for CI or a shared MinIO/AWS bucket):\n\n```bash\nexport OPENPUFFER_TEST_S3_ENDPOINT=http://127.0.0.1:9000\nexport OPENPUFFER_TEST_S3_BUCKET=openpuffer-integration\nexport OPENPUFFER_TEST_S3_ACCESS_KEY=minioadmin\nexport OPENPUFFER_TEST_S3_SECRET_KEY=minioadmin\n\ncargo test -F integration --test integration_external_s3 -- --ignored\n```\n\n**Serve against the same bucket** (after `external` or with your own endpoint):\n\n```bash\nexport OPENPUFFER_S3_ENDPOINT=http://127.0.0.1:9000\nexport OPENPUFFER_S3_BUCKET=openpuffer-integration\nexport OPENPUFFER_S3_ACCESS_KEY=minioadmin\nexport OPENPUFFER_S3_SECRET_KEY=minioadmin\n./scripts/dev-serve.sh\n```\n\nStop the test compose stack: `docker compose -f docker-compose.test.yml down`.\n\n### What integration tests assert on S3\n\n- **Head/List/Get** on `meta.json`, `wal/{seq:08}.bin`, and `index/*` (not HTTP-only)\n- **Decode** bincode `WalEntry` from `wal/*.bin` and compare doc ids to HTTP export\n- **Index layout**: `fts-*.bin`, `filter-*.bin`, `centroids-l0.bin`, `centroids-l1-*.bin` (non-zero size)\n- **Incremental growth**: FTS/filter segment sizes or `fts_segment_ids` / `filter_segment_ids` chains grow after a second WAL batch (`s3_fts_and_filter_segments_grow_on_minio`)\n- **Copy parity**: `copy_from_namespace` duplicates every source key under the dest prefix (`s3_copy_from_namespace_duplicates_all_keys`)\n- **No** legacy `docs/{id}.json` or `manifest.json`\n\n## License\n\nMIT OR Apache-2.0","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fav%2Fopenpuffer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fav%2Fopenpuffer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fav%2Fopenpuffer/lists"}