An open API service indexing awesome lists of open source software.

https://github.com/auraoneai/judge-bench

Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.
https://github.com/auraoneai/judge-bench

ai-evaluation benchmark evals llm-as-judge

Last synced: 6 days ago
JSON representation

Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.

Awesome Lists containing this project

README

          

# judge-bench

`judge-bench` runs synthetic diagnostics for LLM-as-judge reliability: position bias, verbosity bias, self-preference, paraphrase stability, anchoring, and calibration. It emits JSON, Markdown, and plot-ready summaries.

## Quickstart

```bash
pip install judge-bench
judge-bench run --backend openai --model gpt-4o --probes position_bias --dry-run
judge-bench run --backend local --probes all --pairs 20 --cache-dir .judge-bench-cache --output report.json
```

Provider backends call their public APIs directly with standard environment variables:
`OPENAI_API_KEY` for `--backend openai`, `ANTHROPIC_API_KEY` for `--backend anthropic`,
and `GEMINI_API_KEY` for `--backend google`. Non-dry runs require `--confirm-cost`.

Repeated judge calls are cached by `(backend family, model, prompt, response_a, response_b)` under `.judge-bench-cache` so paid backends do not re-run the same synthetic diagnostic pair. Use `--cache-dir` to isolate or share caches across runs.
Each run writes JSON, Markdown, and plot artifacts next to the requested output path: `.md`, `.plots.json`, `.svg`, and `.png` when `matplotlib` is installed.

The local backend can run against local model servers without API spend:

```bash
judge-bench run --backend local --model ollama:llama3.1 --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8000/v1 judge-bench run --backend local --model vllm:meta-llama/Llama-3.1-8B-Instruct --probes position_bias --output report.json
JUDGE_BENCH_LOCAL_URL=http://localhost:8080 judge-bench run --backend local --model hf:mistral --probes position_bias --output report.json
```

Supported local modes are `ollama:` for Ollama `/api/generate`, `vllm:` for OpenAI-compatible `/chat/completions`, and `hf:` or `transformers:` for Hugging Face text generation. `JUDGE_BENCH_LOCAL_BACKEND`, `JUDGE_BENCH_LOCAL_URL`, and `JUDGE_BENCH_LOCAL_API_KEY` can override mode, endpoint, and bearer token. If no local mode is selected, `local-judge` uses a deterministic lexical heuristic for offline smoke tests.

## What This Is Not

This is not a benchmark, leaderboard, or claim of model superiority. All bundled pairs are synthetic and disclosed as such.