An open API service indexing awesome lists of open source software.

https://github.com/haizelabs/awesome-llm-judges

⚖️ Awesome LLM Judges ⚖️
https://github.com/haizelabs/awesome-llm-judges

List: awesome-llm-judges

Last synced: 9 days ago
JSON representation

⚖️ Awesome LLM Judges ⚖️

Awesome Lists containing this project

README

        

# ⚖️ **Awesome LLM Judges** ⚖️

_This repo curates recent research on LLM Judges for automated evaluation._

> [!TIP]
> ⚖️ Check out [Verdict](https://verdict.haizelabs.com) — our in-house library for hassle-free implementations of the papers below!

---

## 📚 Table of Contents

- [🌱 Starter](#-starter)
- [🎭 Ensemble](#-ensemble)
- [🤔 Debate](#-debate)
- [🎯 Finetuned Models](#-finetuned-models)
- [🌀 Hallucination](#-hallucination)
- [🏆 Generative Reward Models](#-generative-reward-models)
- [🛡️ Safety](#️-safety)
- [🛑 Content Moderation](#-content-moderation)
- [🔍 Scalable Oversight](#-scalable-oversight)
- [👨‍⚖️ Evaluating Judges](#-evaluating-judges)
- [⚖️ Biases](#-biases)
- [🤖 Agents](#-agents)
- [✨ Contributing](#-contributing)

---

## 🌱 Starter

- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)
- [G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)
- [Benchmarking Foundation Models with Language-Model-as-an-Examiner](https://arxiv.org/abs/2306.04181)

---

## 🎭 Multi-Judge

- [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)

### 🤔 Debate

- [ScaleEval: Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate](https://arxiv.org/abs/2401.16788)
- [ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate](https://arxiv.org/abs/2308.07201)
- [Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate](https://arxiv.org/abs/2305.13160)
- [Debating with More Persuasive LLMs Leads to More Truthful Answers](https://arxiv.org/abs/2402.06782)

---

## 🎯 Finetuned Models

- [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491)
- [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)
- [JudgeLM: Fine-tuned Large Language Models are Scalable Judges](https://arxiv.org/abs/2310.17631)

### 🌀 Hallucination

- [HALU-J: Critique-Based Hallucination Judge](https://arxiv.org/abs/2407.12943)
- [MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents](https://aclanthology.org/2024.emnlp-main.499/)
- [Lynx: An Open Source Hallucination Evaluation Model](https://arxiv.org/abs/2407.08488)

### 🏆 Generative Reward Models

- [Generative Verifiers: Reward Modeling as Next-Token Prediction](https://arxiv.org/abs/2408.15240)
- [Critique-out-Loud Reward Models](https://arxiv.org/abs/2408.11791)

---

## 🛡️ Safety

### 🛑 Content Moderation

- [A STRONGREJECT for Empty Jailbreaks (Sections C.4 & C.5)](https://arxiv.org/pdf/2402.10260)
- [OR-Bench: An Over-Refusal Benchmark for Large Language Models (Sections A.3 & A.11)](https://arxiv.org/abs/2405.20947)
- [WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495)
- [Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://arxiv.org/abs/2312.06674)

### 🔍 Scalable Oversight

- [On Scalable Oversight with Weak LLMs Judging Strong LLMs](https://arxiv.org/abs/2407.04622)
- [Debate Helps Supervise Unreliable Experts](https://arxiv.org/abs/2311.08702)
- [Great Models Think Alike and this Undermines AI Oversight](https://arxiv.org/abs/2502.04313)
- [LLM Critics Help Catch LLM Bugs](https://arxiv.org/abs/2407.00215)

---

## 👨‍⚖️ Judging the Judges: Meta-Evaluation

- [JudgeBench: A Benchmark for Evaluating LLM-based Judges](https://arxiv.org/abs/2410.12784)
- [RewardBench: Evaluating Reward Models for Language Modeling](https://arxiv.org/abs/2403.13787)
- [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/abs/2310.07641)
- [Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge](https://arxiv.org/abs/2407.19594)
- [From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge](https://arxiv.org/abs/2411.16594)
- [Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences](https://arxiv.org/abs/2404.12272)
- [Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference](https://arxiv.org/abs/2501.00560)
- [The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs](https://arxiv.org/abs/2501.10970)
- [ReIFE: Re-evaluating Instruction-Following Evaluation](https://arxiv.org/abs/2410.07069)

### ⚖️ Biases

- [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926)
- [Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions](https://arxiv.org/abs/2308.11483)
- [Large Language Models are Inconsistent and Biased Evaluators](https://arxiv.org/abs/2405.01724)
- [Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges](https://arxiv.org/abs/2406.12624)

---

## 🤖 Agents

_🚧 Coming Soon -- Stay tuned!_

---

## ✨ Contributing

Have a paper to add? Found a mistake? 🧐

- Open a pull request or submit an issue! Contributions are welcome. 🙌
- Questions? Reach out to [[email protected]](mailto:[email protected]).