{"id":50391011,"url":"https://github.com/hallelx2/vectorless-bench","last_synced_at":"2026-05-30T18:01:35.637Z","repository":{"id":360565863,"uuid":"1250679271","full_name":"hallelx2/vectorless-bench","owner":"hallelx2","description":"An advanced benchmarking suite for the Vectorless reasoning-based RAG engine","archived":false,"fork":false,"pushed_at":"2026-05-27T00:30:29.000Z","size":71,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-27T01:21:51.526Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hallelx2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-26T21:41:47.000Z","updated_at":"2026-05-27T00:30:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hallelx2/vectorless-bench","commit_stats":null,"previous_names":["hallelx2/vectorless-bench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/hallelx2/vectorless-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fvectorless-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fvectorless-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fvectorless-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fvectorless-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hallelx2","download_url":"https://codeload.github.com/hallelx2/vectorless-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fvectorless-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33703065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-30T18:01:34.825Z","updated_at":"2026-05-30T18:01:35.610Z","avatar_url":"https://github.com/hallelx2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vectorless-bench\n\nAn advanced benchmarking suite for [Vectorless](https://vectorless.store) — the\nreasoning-based (\"vectorless\") RAG engine where **the LLM is the retriever**, not\nan embedding model.\n\nIt exists to turn the engine's claims (deterministic, citation-exact, accurate on\nspecialized domains, no vector DB) into **numbers you can defend**, measured head\nto head against the systems Vectorless is positioned against.\n\n```bash\n# runs today, no API keys, no services — proves the harness end to end\npip install -e .\nvlbench run --config configs/smoke.yaml\n```\n\n---\n\n## Why benchmarking *this* engine is different\n\nStandard RAG benchmarks assume retrieval is free and instant, so they only score\naccuracy. Vectorless retrieves by **calling an LLM over a document map**, so every\nquery has a real **token cost** and **latency**. That single fact reframes the\nwhole exercise:\n\n\u003e The headline metric is not precision@k. It is **quality per dollar** and\n\u003e **quality per second** — the efficiency frontier.\n\nA system that wins on F1 while costing 50× is not a win. `vlbench` puts quality,\ncost, and latency in the **first table of every report** so that trade-off is\nimpossible to hide.\n\nThree things in the engine's own code shape the methodology (and silently corrupt\nnaive benchmarks):\n\n1. **Section IDs are random `sec_\u003cuuid\u003e`s**, regenerated on every ingest. Gold\n   labels therefore can't be IDs — they are *stable anchors* (heading path /\n   answer span / page) resolved to whatever each system returns. See\n   [`anchors.py`](src/vectorless_bench/anchors.py).\n2. **Caching zeroes cost.** Both the llmgate cache and the retrieval cache return\n   `cost_usd=0` on a hit. Fair cost/latency requires a **cold cache** — run the\n   server with `retrieval.cache.enabled=false`. The run manifest records the\n   declared cache mode.\n3. **Determinism is a claim, not a guarantee.** Temp=0 reduces but doesn't\n   eliminate provider nondeterminism, so `vlbench` *measures* it (rerun the same\n   query N times, report set-stability) instead of assuming it.\n\n---\n\n## What it measures — seven axes\n\n| Axis | Metrics | What it tells you |\n|---|---|---|\n| **Retrieval quality** | precision/recall/F1@k, MRR, nDCG, hit@k | Did it fetch the right section? |\n| **Citation exactness** | span-in-top1, **path-correct@1** | Can it point at the exact passage/heading? |\n| **Near-miss** | sibling near-miss rate | Did it grab the *wrong fiscal year / wrong drug* (the vector failure mode)? |\n| **Cost** | $/query, tokens/query, calls/query, **$/correct**, **quality per $1k** | The price of being right |\n| **Latency** | p50 / p95 / p99, ingest time | Cold-cache, end to end |\n| **Determinism** | exact-match + mean Jaccard across reruns | Is the published determinism claim real? |\n| **Robustness** | abstention on no-answer, by-domain, by-answer-type | Does it over-retrieve when the answer isn't there? |\n\n`path-correct@1` and `near-miss` are **structural** metrics: chunk systems (vector\nRAG, BM25) score 0 on path-correctness by construction — that gap *is* the\ndifferentiator the whitepaper argues for, made measurable.\n\n---\n\n## Systems compared\n\n| System | What it is | Deps |\n|---|---|---|\n| `vectorless` | the engine under test, via the Python SDK | `vectorless-sdk` + a running server |\n| `vector_rag` | pgvector + OpenAI embeddings + cosine top-k (the ROADMAP baseline) | `[vector]` + Postgres/pgvector |\n| `pageindex` | the real upstream [PageIndex](https://github.com/VectifyAI/PageIndex) — *their* tree builder (`page_index`/`md_to_tree`) + their reasoning retrieval, priced on our table | clone of PageIndex + `[llm]` |\n| `full_context` | stuff the whole doc in the prompt — the quality **ceiling** + cost worst case | `[llm]` |\n| `bm25` | lexical floor; free, no API, strong on exact-term lookups | `[bm25]` |\n| `mock` | deterministic fake for harness CI — no services | none |\n\nAll LLM-using systems are priced from the **same table** the engine uses\n([`pricing.py`](src/vectorless_bench/pricing.py), mirrored from `llmgate/pricing`),\nso cost is apples-to-apples. Each baseline is a *fair* representative (standard\nchunking, optional reranker hook), not a strawman.\n\n---\n\n## Datasets\n\n- **`fixtures`** — a tiny in-repo curated set (finance + medicine) with stable\n  anchors, a no-answer item, and a sibling near-miss trap. Seeds the\n  \"curated golden set\" and powers the smoke test. Runs in seconds.\n- **`financebench`** — the public 150-question [FinanceBench](https://github.com/patronus-ai/financebench)\n  set over real 10-Ks. QA loads from HuggingFace; fetch the source PDFs with\n  `python scripts/download_financebench.py`. Questions whose document text is\n  missing are skipped (not failed), so a partial corpus still produces a valid run.\n\nAdd your own by subclassing `Dataset` (see\n[`datasets/base.py`](src/vectorless_bench/datasets/base.py)) and emitting\n`Question`s with `GoldAnchor`s. The only rule: **gold is stable anchors, never\nengine IDs.**\n\n---\n\n## Running the real benchmark (FinanceBench)\n\n```bash\npip install -e \".[all]\"\ncp .env.example .env            # fill in keys + DSN\n\n# 1. fetch the source filings\npython scripts/download_financebench.py\n\n# 2. start a Vectorless server with caches OFF (fair cold-cache), then:\nvlbench run --config configs/financebench.yaml\n\n# 3. re-render the report from raw records any time\nvlbench report runs/\u003cstamp\u003e --k 5\n```\n\nEach run writes a self-contained directory:\n\n```\nruns/\u003cstamp\u003e/\n  records.jsonl    one scored (system, question, repeat) row each\n  results.json     aggregated per-system summary\n  report.md        the human report (frontier + per-axis tables)\n  report.html      self-contained HTML report (frontier scatter + tables) — open this\n  pareto.csv       quality vs cost vs latency, for plotting\n  setup.json       per-system ingest time + cost\n  manifest.json    repro: git sha, models, price fingerprint, cache mode, seed\n```\n\n---\n\n## Running on a VM (bundle → run → view)\n\nReal runs are long and need keys + Postgres, so the supported path is a Docker\nbundle you run on a cloud VM, with results shipped to GCS for viewing. One command:\n\n```bash\nPROJECT=\u003cgcp\u003e BUCKET=gs://\u003cbucket\u003e ./deploy/gcp/run_on_gce.sh   # provision, run, upload, delete VM\nBUCKET=gs://\u003cbucket\u003e RUN_ID=\u003cname\u003e ./deploy/gcp/fetch_results.sh # download + open report.html\n```\n\nOr locally with Docker (bundles pgvector + the real PageIndex repo):\n\n```bash\ndocker compose build\ndocker compose run --rm --entrypoint python bench scripts/download_financebench.py\ndocker compose run --rm bench run --config configs/financebench.yaml --out /results --limit 10\n```\n\nFull details and prerequisites: [`deploy/README.md`](deploy/README.md).\n\n---\n\n## Validity controls (what makes the numbers credible)\n\n- **Cost is never reported alone** — always beside quality, plus `$/correct`.\n- **Cold-cache** enforced/declared and recorded in the manifest.\n- **Gold defined independently** of any system's output (human/strong-model, then\n  verified), and matched leniently on surface form, strictly on substance.\n- **Retrieval quality scored separately from answer quality** — Vectorless is a\n  retriever, so retrieval is the primary axis; the optional LLM-judge answer axis\n  uses one judge model for all systems, blind to which system produced the answer.\n- **Determinism uses real reruns**, not an assumption.\n- **Bootstrap CIs** on primary quality so A-vs-B gaps come with uncertainty.\n- **Reproducibility manifest** on every run, including a price-book fingerprint.\n\n---\n\n## Architecture\n\n```\nsrc/vectorless_bench/\n  schema.py        dataclasses (Question, GoldAnchor, RetrievalResult, Usage, ...)\n  pricing.py       engine-mirrored price book + token counting\n  anchors.py       gold matching: the single definition of \"right thing retrieved\"\n  metrics/         retrieval.py · citation.py · aggregate.py (+ score_question)\n  retrievers/      base + registry + vectorless, vector_rag, pageindex,\n                   full_context, bm25, mock\n  datasets/        base + fixtures + financebench\n  judge.py         optional LLM-as-judge answer axis\n  runner.py        orchestrator -\u003e records.jsonl + manifest\n  report.py        records -\u003e results.json + report.md + report.html + pareto.csv\n  cli.py           `vlbench run | report | systems`\nDockerfile, docker-compose.yml   the runnable bundle (+ pgvector + real PageIndex)\ndeploy/gcp/      run_on_gce.sh · fetch_results.sh · startup-script.sh\n```\n\nCore (schema, pricing, anchors, metrics, runner, report, mock, fixtures) has\n**zero third-party dependencies** and is fully unit-tested; everything that needs\na network or heavy library is an optional extra, imported lazily by the system\nthat uses it. `pytest` runs the whole harness, including a real end-to-end run,\nwith no keys.\n\n### Extending\n\n- **New retriever:** implement `setup(corpus)` + `retrieve(question, k, cold) -\u003e\n  RetrievalResult`, then `register(\"name\", factory)` in `retrievers/registry.py`.\n- **New dataset:** subclass `Dataset`, return a corpus + gold-anchored questions.\n- **New metric:** add to `metrics/` and surface it in `report.py`.\n\n---\n\n## Roadmap\n\nThis closes the deferred item in the engine's own `ROADMAP.md`\n(\"Benchmarks vs. traditional RAG … publish in benchmarks/README.md\"):\n\n- [x] **Phase 0** — harness, metrics, mock + fixtures, cold-cache control, reports (this repo)\n- [x] **Phase 1** — vector RAG, BM25, full-context, real-PageIndex baselines; cost/latency frontier\n- [x] **Phase 1.5** — Docker bundle + GCE run-and-view-results flow; HTML report\n- [ ] **Phase 2** — full FinanceBench run on a VM with a live engine\n- [ ] **Phase 3** — curated finance/law/medicine golden set; CUAD + multi-hop; CI regression gate; published leaderboard\n\n## License\n\nMIT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhallelx2%2Fvectorless-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhallelx2%2Fvectorless-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhallelx2%2Fvectorless-bench/lists"}