https://github.com/auraoneai/judge-bench

Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.
https://github.com/auraoneai/judge-bench

ai-evaluation benchmark evals llm-as-judge

Last synced: 6 days ago
JSON representation

Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.

Host: GitHub
URL: https://github.com/auraoneai/judge-bench
Owner: auraoneai
License: mit
Created: 2026-05-12T01:32:31.000Z (22 days ago)
Default Branch: main
Last Pushed: 2026-05-12T07:18:39.000Z (22 days ago)
Last Synced: 2026-05-12T08:12:12.006Z (22 days ago)
Topics: ai-evaluation, benchmark, evals, llm-as-judge
Language: Python
Homepage: https://auraone.ai/open
Size: 30.3 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

# judge-bench

`judge-bench` runs synthetic diagnostics for LLM-as-judge reliability: position bias, verbosity bias, self-preference, paraphrase stability, anchoring, and calibration. It emits JSON, Markdown, and plot-ready summaries.

## Quickstart

```bash
pip install judge-bench
judge-bench run --backend openai --model gpt-4o --probes position_bias --dry-run
judge-bench run --backend local --probes all --pairs 20 --cache-dir .judge-bench-cache --output report.json
```

Provider backends call their public APIs directly with standard environment variables:
`OPENAI_API_KEY` for `--backend openai`, `ANTHROPIC_API_KEY` for `--backend anthropic`,
and `GEMINI_API_KEY` for `--backend google`. Non-dry runs require `--confirm-cost`.

Repeated judge calls are cached by `(backend family, model, prompt, response_a, response_b)` under `.judge-bench-cache` so paid backends do not re-run the same synthetic diagnostic pair. Use `--cache-dir` to isolate or share caches across runs.
Each run writes JSON, Markdown, and plot artifacts next to the requested output path: `.md`, `.plots.json`, `.svg`, and `.png` when `matplotlib` is installed.

The local backend can run against local model servers without API spend:

```bash
judge-bench run --backend local --model ollama:llama3.1 --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8000/v1 judge-bench run --backend local --model vllm:meta-llama/Llama-3.1-8B-Instruct --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8080 judge-bench run --backend local --model hf:mistral --probes position_bias --output report.json
```

Supported local modes are `ollama:` for Ollama `/api/generate`, `vllm:` for OpenAI-compatible `/chat/completions`, and `hf:` or `transformers:` for Hugging Face text generation. `JUDGE_BENCH_LOCAL_BACKEND`, `JUDGE_BENCH_LOCAL_URL`, and `JUDGE_BENCH_LOCAL_API_KEY` can override mode, endpoint, and bearer token. If no local mode is selected, `local-judge` uses a deterministic lexical heuristic for offline smoke tests.

## What This Is Not

This is not a benchmark, leaderboard, or claim of model superiority. All bundled pairs are synthetic and disclosed as such.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/auraoneai/judge-bench

Awesome Lists containing this project

README