https://github.com/auraoneai/judge-bench
Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.
https://github.com/auraoneai/judge-bench
ai-evaluation benchmark evals llm-as-judge
Last synced: 6 days ago
JSON representation
Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.
- Host: GitHub
- URL: https://github.com/auraoneai/judge-bench
- Owner: auraoneai
- License: mit
- Created: 2026-05-12T01:32:31.000Z (22 days ago)
- Default Branch: main
- Last Pushed: 2026-05-12T07:18:39.000Z (22 days ago)
- Last Synced: 2026-05-12T08:12:12.006Z (22 days ago)
- Topics: ai-evaluation, benchmark, evals, llm-as-judge
- Language: Python
- Homepage: https://auraone.ai/open
- Size: 30.3 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# judge-bench
`judge-bench` runs synthetic diagnostics for LLM-as-judge reliability: position bias, verbosity bias, self-preference, paraphrase stability, anchoring, and calibration. It emits JSON, Markdown, and plot-ready summaries.
## Quickstart
```bash
pip install judge-bench
judge-bench run --backend openai --model gpt-4o --probes position_bias --dry-run
judge-bench run --backend local --probes all --pairs 20 --cache-dir .judge-bench-cache --output report.json
```
Provider backends call their public APIs directly with standard environment variables:
`OPENAI_API_KEY` for `--backend openai`, `ANTHROPIC_API_KEY` for `--backend anthropic`,
and `GEMINI_API_KEY` for `--backend google`. Non-dry runs require `--confirm-cost`.
Repeated judge calls are cached by `(backend family, model, prompt, response_a, response_b)` under `.judge-bench-cache` so paid backends do not re-run the same synthetic diagnostic pair. Use `--cache-dir` to isolate or share caches across runs.
Each run writes JSON, Markdown, and plot artifacts next to the requested output path: `.md`, `.plots.json`, `.svg`, and `.png` when `matplotlib` is installed.
The local backend can run against local model servers without API spend:
```bash
judge-bench run --backend local --model ollama:llama3.1 --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8000/v1 judge-bench run --backend local --model vllm:meta-llama/Llama-3.1-8B-Instruct --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8080 judge-bench run --backend local --model hf:mistral --probes position_bias --output report.json
```
Supported local modes are `ollama:` for Ollama `/api/generate`, `vllm:` for OpenAI-compatible `/chat/completions`, and `hf:` or `transformers:` for Hugging Face text generation. `JUDGE_BENCH_LOCAL_BACKEND`, `JUDGE_BENCH_LOCAL_URL`, and `JUDGE_BENCH_LOCAL_API_KEY` can override mode, endpoint, and bearer token. If no local mode is selected, `local-judge` uses a deterministic lexical heuristic for offline smoke tests.
## What This Is Not
This is not a benchmark, leaderboard, or claim of model superiority. All bundled pairs are synthetic and disclosed as such.