{"id":50455132,"url":"https://github.com/unshdee/proofrag","last_synced_at":"2026-06-01T02:00:57.657Z","repository":{"id":361720368,"uuid":"1255073151","full_name":"unshDee/proofrag","owner":"unshDee","description":"Point your agent at your docs and your RAG app; get a golden test set + an LLM-as-judge \u0026 retrieval scorecard, in one command.","archived":false,"fork":false,"pushed_at":"2026-06-01T00:26:55.000Z","size":2307,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-01T02:00:49.023Z","etag":null,"topics":["agent-skills","ci","claude","claude-code","codex","evaluation","llm","llm-as-judge","python","rag","rag-evaluation","retrieval"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/unshDee.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-31T11:20:49.000Z","updated_at":"2026-06-01T00:29:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/unshDee/proofrag","commit_stats":null,"previous_names":["unshdee/proofrag"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/unshDee/proofrag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unshDee%2Fproofrag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unshDee%2Fproofrag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unshDee%2Fproofrag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unshDee%2Fproofrag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/unshDee","download_url":"https://codeload.github.com/unshDee/proofrag/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unshDee%2Fproofrag/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33756581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-skills","ci","claude","claude-code","codex","evaluation","llm","llm-as-judge","python","rag","rag-evaluation","retrieval"],"created_at":"2026-06-01T02:00:49.202Z","updated_at":"2026-06-01T02:00:57.646Z","avatar_url":"https://github.com/unshDee.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# proofrag\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/proofrag/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/proofrag?color=2563eb\u0026label=pypi\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/proofrag/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/pyversions/proofrag\" alt=\"Python\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/unshDee/proofrag/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/unshDee/proofrag/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pepy.tech/project/proofrag\"\u003e\u003cimg src=\"https://static.pepy.tech/badge/proofrag/month\" alt=\"Downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-green.svg\" alt=\"License: MIT\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**Point your agent at your docs and your RAG app. Get a golden test set, an\nLLM-as-judge + retrieval scorecard, and a CI gate — in one command.**\n\nEvaluation is the #1 unmet pain in production RAG/LLM work, and the hardest part\nis building a good test set in the first place. `proofrag` generates one from\n*your own corpus*, judges your system on it, and emits a shareable HTML scorecard.\nIt's an [Agent Skill](https://agentskills.io) (works in Claude Code, Codex, Cursor)\n**and** a plain Python CLI — wrapping the eval loop, not reinventing the metrics.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/demo.gif\" alt=\"proofrag — generate a golden set, judge, and score in one loop\" width=\"820\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cem\u003e…and the scorecard it produces:\u003c/em\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/scorecard.png\" alt=\"RAG eval scorecard\" width=\"760\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\u003cem\u003eSee a scorecard in 5 seconds — no API key needed:\u003c/em\u003e\u003c/p\u003e\n\n```bash\npipx install \"proofrag[anthropic]\"        # or: pip install / uv tool install / uvx\nproofrag demo --out scorecard.html \u0026\u0026 open scorecard.html\n```\n\n\u003e Use `[openai]` instead of `[anthropic]` for an OpenAI-compatible or local (Ollama) backend.\n\u003e No install? Run it ad-hoc: `uvx \"proofrag[anthropic]\" demo`.\n\n## Install as an Agent Skill\n\n`proofrag` is a skill (the [agentskills.io](https://agentskills.io) open standard) backed\nby a real CLI — so any agent can run *\"evaluate my RAG\"* and get a reproducible scorecard.\n\n**Claude Code (plugin):**\n```\n/plugin marketplace add unshDee/proofrag\n/plugin install proofrag@proofrag\n```\nThen ask *\"evaluate my RAG\"* (auto-triggered) or type `/proofrag`.\n\n**Claude Code (manual)** — `cp -r skills/proofrag ~/.claude/skills/`\n**Codex / other agents** — `cp -r skills/proofrag .agents/skills/`\n\nThe skill drives the `proofrag` CLI; install it with `uv tool install \"proofrag[anthropic]\"`\n(or `pipx install`, or run ad-hoc via `uvx`). See [AGENTS.md](AGENTS.md) for details.\n\n## Why this exists\n\n\u003e \"Running evals aren't the problem — the problem is acquiring or building a\n\u003e high-quality, non-contaminated dataset.\"\n\nMost RAG systems reach production with no evals because writing a balanced golden\nset by hand is tedious. So teams ship prompt and model changes blind. This closes\nthat loop: **change something → re-run → see if quality moved → gate the merge.**\n\n## The loop\n\n```bash\n# 1. Generate a golden set from YOUR docs (questions + gold answers + gold contexts)\nproofrag generate --corpus ./docs --out goldenset.jsonl --n 20\n\n# 2. Run your RAG over each question -\u003e predictions.jsonl  (one line per question)\n#    {\"id\": \"q000\", \"answer\": \"...\", \"retrieved_contexts\": [\"...\", \"...\"]}\n#    See examples/docs-rag/naive_rag.py for a runnable driver.\n\n# 3. Judge: groundedness, correctness, completeness, citation quality + retrieval metrics\nproofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json\n\n# 4. Shareable HTML scorecard\nproofrag report --results results.json --out scorecard.html\n```\n\nRun the whole thing end-to-end against the bundled example:\n\n```bash\nuv sync --extra anthropic \u0026\u0026 export ANTHROPIC_API_KEY=...\nuv run proofrag generate --corpus examples/docs-rag/corpus --out goldenset.jsonl --n 8\nuv run python examples/docs-rag/naive_rag.py --goldenset goldenset.jsonl --corpus examples/docs-rag/corpus --out predictions.jsonl\nuv run proofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl --out results.json\nuv run proofrag report --results results.json --out scorecard.html\n```\n\n## CI gate\n\nTwo kinds of gate. An **absolute** floor:\n\n```bash\nproofrag evaluate --goldenset goldenset.jsonl --predictions predictions.jsonl \\\n  --out results.json --fail-under 0.7      # non-zero exit if overall score drops below 0.7\n```\n\n…and a **regression** gate against a committed baseline (a known-good results.json):\n\n```bash\nproofrag diff --baseline baseline.json --candidate results.json --tolerance 0.02\n# prints a per-metric delta table; exits 1 if any metric dropped \u003e tolerance.\n# Refuses to compare across different judge models unless --allow-judge-mismatch.\n```\n\n### GitHub Action\n\nDrop proofrag into any repo's CI in a few lines — it installs the CLI, evaluates,\nwrites the scorecard, and gates on both the floor and the baseline:\n\n```yaml\n- uses: unshDee/proofrag@v0\n  env:\n    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}\n  with:\n    goldenset: eval/goldenset.jsonl\n    predictions: predictions.jsonl     # produced by your RAG earlier in the job\n    baseline: eval/baseline.json        # optional regression gate\n    fail-under: \"0.7\"                   # optional absolute gate\n```\n\nFull runnable workflow (with artifact upload): [`examples/ci/proofrag-eval.yml`](examples/ci/proofrag-eval.yml).\n\n## What makes it different\n\n- **Golden set from your corpus** — the wedge. Difficulty tiers: single-doc,\n  multi-doc, and *unanswerable* (so you catch hallucination-instead-of-refusal).\n- **Retriever vs generator split** — rank-aware retrieval metrics (Recall@k,\n  Precision@k, NDCG@k, MRR) separate \"the context never arrived / ranked too low\"\n  from \"the model fluffed it.\" Lexical by default; `--semantic` for embedding match.\n- **Pinned, fingerprinted judge** — every scorecard records its judge model, so you\n  never compare scores produced by different judges.\n- **Cheap \u0026 portable** — defaults to a small model; Anthropic, OpenAI, or local/Ollama\n  (`OPENAI_BASE_URL`). Self-contained HTML, zero JS, zero external assets.\n- **Agent-native** — drop it in as a skill and say *\"evaluate my RAG\"*; the agent\n  wires your pipeline to the kit.\n\n## Configuration\n\n| Env | Default | Purpose |\n|-----|---------|---------|\n| `ANTHROPIC_API_KEY` | — | Anthropic backend (default) |\n| `OPENAI_API_KEY` / `OPENAI_BASE_URL` | — | OpenAI-compatible / local |\n| `PROOFRAG_PROVIDER` | auto | `anthropic` or `openai` |\n| `PROOFRAG_MODEL` | Haiku / gpt-4o-mini | judge \u0026 generator model |\n| `PROOFRAG_EMBED_MODEL` | text-embedding-3-small | embeddings for `--semantic` retrieval match |\n\n## Contributing\n\nIssues and PRs welcome — see [CONTRIBUTING.md](CONTRIBUTING.md). MIT licensed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funshdee%2Fproofrag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funshdee%2Fproofrag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funshdee%2Fproofrag/lists"}