https://github.com/hallelx2/vectorless-bench
An advanced benchmarking suite for the Vectorless reasoning-based RAG engine
https://github.com/hallelx2/vectorless-bench
Last synced: 21 days ago
JSON representation
An advanced benchmarking suite for the Vectorless reasoning-based RAG engine
- Host: GitHub
- URL: https://github.com/hallelx2/vectorless-bench
- Owner: hallelx2
- Created: 2026-05-26T21:41:47.000Z (25 days ago)
- Default Branch: main
- Last Pushed: 2026-05-27T00:30:29.000Z (25 days ago)
- Last Synced: 2026-05-27T01:21:51.526Z (25 days ago)
- Language: Python
- Size: 69.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# vectorless-bench
An advanced benchmarking suite for [Vectorless](https://vectorless.store) — the
reasoning-based ("vectorless") RAG engine where **the LLM is the retriever**, not
an embedding model.
It exists to turn the engine's claims (deterministic, citation-exact, accurate on
specialized domains, no vector DB) into **numbers you can defend**, measured head
to head against the systems Vectorless is positioned against.
```bash
# runs today, no API keys, no services — proves the harness end to end
pip install -e .
vlbench run --config configs/smoke.yaml
```
---
## Why benchmarking *this* engine is different
Standard RAG benchmarks assume retrieval is free and instant, so they only score
accuracy. Vectorless retrieves by **calling an LLM over a document map**, so every
query has a real **token cost** and **latency**. That single fact reframes the
whole exercise:
> The headline metric is not precision@k. It is **quality per dollar** and
> **quality per second** — the efficiency frontier.
A system that wins on F1 while costing 50× is not a win. `vlbench` puts quality,
cost, and latency in the **first table of every report** so that trade-off is
impossible to hide.
Three things in the engine's own code shape the methodology (and silently corrupt
naive benchmarks):
1. **Section IDs are random `sec_`s**, regenerated on every ingest. Gold
labels therefore can't be IDs — they are *stable anchors* (heading path /
answer span / page) resolved to whatever each system returns. See
[`anchors.py`](src/vectorless_bench/anchors.py).
2. **Caching zeroes cost.** Both the llmgate cache and the retrieval cache return
`cost_usd=0` on a hit. Fair cost/latency requires a **cold cache** — run the
server with `retrieval.cache.enabled=false`. The run manifest records the
declared cache mode.
3. **Determinism is a claim, not a guarantee.** Temp=0 reduces but doesn't
eliminate provider nondeterminism, so `vlbench` *measures* it (rerun the same
query N times, report set-stability) instead of assuming it.
---
## What it measures — seven axes
| Axis | Metrics | What it tells you |
|---|---|---|
| **Retrieval quality** | precision/recall/F1@k, MRR, nDCG, hit@k | Did it fetch the right section? |
| **Citation exactness** | span-in-top1, **path-correct@1** | Can it point at the exact passage/heading? |
| **Near-miss** | sibling near-miss rate | Did it grab the *wrong fiscal year / wrong drug* (the vector failure mode)? |
| **Cost** | $/query, tokens/query, calls/query, **$/correct**, **quality per $1k** | The price of being right |
| **Latency** | p50 / p95 / p99, ingest time | Cold-cache, end to end |
| **Determinism** | exact-match + mean Jaccard across reruns | Is the published determinism claim real? |
| **Robustness** | abstention on no-answer, by-domain, by-answer-type | Does it over-retrieve when the answer isn't there? |
`path-correct@1` and `near-miss` are **structural** metrics: chunk systems (vector
RAG, BM25) score 0 on path-correctness by construction — that gap *is* the
differentiator the whitepaper argues for, made measurable.
---
## Systems compared
| System | What it is | Deps |
|---|---|---|
| `vectorless` | the engine under test, via the Python SDK | `vectorless-sdk` + a running server |
| `vector_rag` | pgvector + OpenAI embeddings + cosine top-k (the ROADMAP baseline) | `[vector]` + Postgres/pgvector |
| `pageindex` | the real upstream [PageIndex](https://github.com/VectifyAI/PageIndex) — *their* tree builder (`page_index`/`md_to_tree`) + their reasoning retrieval, priced on our table | clone of PageIndex + `[llm]` |
| `full_context` | stuff the whole doc in the prompt — the quality **ceiling** + cost worst case | `[llm]` |
| `bm25` | lexical floor; free, no API, strong on exact-term lookups | `[bm25]` |
| `mock` | deterministic fake for harness CI — no services | none |
All LLM-using systems are priced from the **same table** the engine uses
([`pricing.py`](src/vectorless_bench/pricing.py), mirrored from `llmgate/pricing`),
so cost is apples-to-apples. Each baseline is a *fair* representative (standard
chunking, optional reranker hook), not a strawman.
---
## Datasets
- **`fixtures`** — a tiny in-repo curated set (finance + medicine) with stable
anchors, a no-answer item, and a sibling near-miss trap. Seeds the
"curated golden set" and powers the smoke test. Runs in seconds.
- **`financebench`** — the public 150-question [FinanceBench](https://github.com/patronus-ai/financebench)
set over real 10-Ks. QA loads from HuggingFace; fetch the source PDFs with
`python scripts/download_financebench.py`. Questions whose document text is
missing are skipped (not failed), so a partial corpus still produces a valid run.
Add your own by subclassing `Dataset` (see
[`datasets/base.py`](src/vectorless_bench/datasets/base.py)) and emitting
`Question`s with `GoldAnchor`s. The only rule: **gold is stable anchors, never
engine IDs.**
---
## Running the real benchmark (FinanceBench)
```bash
pip install -e ".[all]"
cp .env.example .env # fill in keys + DSN
# 1. fetch the source filings
python scripts/download_financebench.py
# 2. start a Vectorless server with caches OFF (fair cold-cache), then:
vlbench run --config configs/financebench.yaml
# 3. re-render the report from raw records any time
vlbench report runs/ --k 5
```
Each run writes a self-contained directory:
```
runs//
records.jsonl one scored (system, question, repeat) row each
results.json aggregated per-system summary
report.md the human report (frontier + per-axis tables)
report.html self-contained HTML report (frontier scatter + tables) — open this
pareto.csv quality vs cost vs latency, for plotting
setup.json per-system ingest time + cost
manifest.json repro: git sha, models, price fingerprint, cache mode, seed
```
---
## Running on a VM (bundle → run → view)
Real runs are long and need keys + Postgres, so the supported path is a Docker
bundle you run on a cloud VM, with results shipped to GCS for viewing. One command:
```bash
PROJECT= BUCKET=gs:// ./deploy/gcp/run_on_gce.sh # provision, run, upload, delete VM
BUCKET=gs:// RUN_ID= ./deploy/gcp/fetch_results.sh # download + open report.html
```
Or locally with Docker (bundles pgvector + the real PageIndex repo):
```bash
docker compose build
docker compose run --rm --entrypoint python bench scripts/download_financebench.py
docker compose run --rm bench run --config configs/financebench.yaml --out /results --limit 10
```
Full details and prerequisites: [`deploy/README.md`](deploy/README.md).
---
## Validity controls (what makes the numbers credible)
- **Cost is never reported alone** — always beside quality, plus `$/correct`.
- **Cold-cache** enforced/declared and recorded in the manifest.
- **Gold defined independently** of any system's output (human/strong-model, then
verified), and matched leniently on surface form, strictly on substance.
- **Retrieval quality scored separately from answer quality** — Vectorless is a
retriever, so retrieval is the primary axis; the optional LLM-judge answer axis
uses one judge model for all systems, blind to which system produced the answer.
- **Determinism uses real reruns**, not an assumption.
- **Bootstrap CIs** on primary quality so A-vs-B gaps come with uncertainty.
- **Reproducibility manifest** on every run, including a price-book fingerprint.
---
## Architecture
```
src/vectorless_bench/
schema.py dataclasses (Question, GoldAnchor, RetrievalResult, Usage, ...)
pricing.py engine-mirrored price book + token counting
anchors.py gold matching: the single definition of "right thing retrieved"
metrics/ retrieval.py · citation.py · aggregate.py (+ score_question)
retrievers/ base + registry + vectorless, vector_rag, pageindex,
full_context, bm25, mock
datasets/ base + fixtures + financebench
judge.py optional LLM-as-judge answer axis
runner.py orchestrator -> records.jsonl + manifest
report.py records -> results.json + report.md + report.html + pareto.csv
cli.py `vlbench run | report | systems`
Dockerfile, docker-compose.yml the runnable bundle (+ pgvector + real PageIndex)
deploy/gcp/ run_on_gce.sh · fetch_results.sh · startup-script.sh
```
Core (schema, pricing, anchors, metrics, runner, report, mock, fixtures) has
**zero third-party dependencies** and is fully unit-tested; everything that needs
a network or heavy library is an optional extra, imported lazily by the system
that uses it. `pytest` runs the whole harness, including a real end-to-end run,
with no keys.
### Extending
- **New retriever:** implement `setup(corpus)` + `retrieve(question, k, cold) ->
RetrievalResult`, then `register("name", factory)` in `retrievers/registry.py`.
- **New dataset:** subclass `Dataset`, return a corpus + gold-anchored questions.
- **New metric:** add to `metrics/` and surface it in `report.py`.
---
## Roadmap
This closes the deferred item in the engine's own `ROADMAP.md`
("Benchmarks vs. traditional RAG … publish in benchmarks/README.md"):
- [x] **Phase 0** — harness, metrics, mock + fixtures, cold-cache control, reports (this repo)
- [x] **Phase 1** — vector RAG, BM25, full-context, real-PageIndex baselines; cost/latency frontier
- [x] **Phase 1.5** — Docker bundle + GCE run-and-view-results flow; HTML report
- [ ] **Phase 2** — full FinanceBench run on a VM with a live engine
- [ ] **Phase 3** — curated finance/law/medicine golden set; CUAD + multi-hop; CI regression gate; published leaderboard
## License
MIT.