https://github.com/haizelabs/awesome-llm-judges

⚖️ Awesome LLM Judges ⚖️
https://github.com/haizelabs/awesome-llm-judges

Last synced: 9 days ago
JSON representation

⚖️ Awesome LLM Judges ⚖️

Host: GitHub
URL: https://github.com/haizelabs/awesome-llm-judges
Owner: haizelabs
Created: 2024-11-23T20:03:53.000Z (7 months ago)
Default Branch: master
Last Pushed: 2025-02-21T10:01:04.000Z (4 months ago)
Last Synced: 2025-02-21T11:19:52.975Z (4 months ago)
Homepage:
Size: 25.4 KB
Stars: 73
Watchers: 4
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

ultimate-awesome - awesome-llm-judges - ⚖️ Awesome LLM Judges ⚖️. (Other Lists / Julia Lists)

README

        # ⚖️ **Awesome LLM Judges** ⚖️

_This repo curates recent research on LLM Judges for automated evaluation._

> [!TIP]

> ⚖️ Check out [Verdict](https://verdict.haizelabs.com) — our in-house library for hassle-free implementations of the papers below!

---

## 📚 Table of Contents

- [🌱 Starter](#-starter)

- [🎭 Ensemble](#-ensemble)

  - [🤔 Debate](#-debate)

- [🎯 Finetuned Models](#-finetuned-models)

  - [🌀 Hallucination](#-hallucination)

  - [🏆 Generative Reward Models](#-generative-reward-models)

- [🛡️ Safety](#️-safety)

  - [🛑 Content Moderation](#-content-moderation)

  - [🔍 Scalable Oversight](#-scalable-oversight)

- [👨‍⚖️ Evaluating Judges](#-evaluating-judges)

  - [⚖️ Biases](#-biases)

- [🤖 Agents](#-agents)

- [✨ Contributing](#-contributing)

---

## 🌱 Starter

- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)

- [G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)

- [Benchmarking Foundation Models with Language-Model-as-an-Examiner](https://arxiv.org/abs/2306.04181)

---

## 🎭 Multi-Judge

- [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)

### 🤔 Debate

- [ScaleEval: Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate](https://arxiv.org/abs/2401.16788)

- [ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate](https://arxiv.org/abs/2308.07201)

- [Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate](https://arxiv.org/abs/2305.13160)

- [Debating with More Persuasive LLMs Leads to More Truthful Answers](https://arxiv.org/abs/2402.06782)

---

## 🎯 Finetuned Models

- [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491)

- [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)

- [JudgeLM: Fine-tuned Large Language Models are Scalable Judges](https://arxiv.org/abs/2310.17631)

### 🌀 Hallucination

- [HALU-J: Critique-Based Hallucination Judge](https://arxiv.org/abs/2407.12943)

- [MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents](https://aclanthology.org/2024.emnlp-main.499/)

- [Lynx: An Open Source Hallucination Evaluation Model](https://arxiv.org/abs/2407.08488)

### 🏆 Generative Reward Models

- [Generative Verifiers: Reward Modeling as Next-Token Prediction](https://arxiv.org/abs/2408.15240)

- [Critique-out-Loud Reward Models](https://arxiv.org/abs/2408.11791)

---

## 🛡️ Safety

### 🛑 Content Moderation

- [A STRONGREJECT for Empty Jailbreaks (Sections C.4 & C.5)](https://arxiv.org/pdf/2402.10260)

- [OR-Bench: An Over-Refusal Benchmark for Large Language Models (Sections A.3 & A.11)](https://arxiv.org/abs/2405.20947)

- [WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495)

- [Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://arxiv.org/abs/2312.06674)

### 🔍 Scalable Oversight

- [On Scalable Oversight with Weak LLMs Judging Strong LLMs](https://arxiv.org/abs/2407.04622)

- [Debate Helps Supervise Unreliable Experts](https://arxiv.org/abs/2311.08702)

- [Great Models Think Alike and this Undermines AI Oversight](https://arxiv.org/abs/2502.04313)

- [LLM Critics Help Catch LLM Bugs](https://arxiv.org/abs/2407.00215)

---

## 👨‍⚖️ Judging the Judges: Meta-Evaluation

- [JudgeBench: A Benchmark for Evaluating LLM-based Judges](https://arxiv.org/abs/2410.12784)

- [RewardBench: Evaluating Reward Models for Language Modeling](https://arxiv.org/abs/2403.13787)

- [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/abs/2310.07641)

- [Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge](https://arxiv.org/abs/2407.19594)

- [From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge](https://arxiv.org/abs/2411.16594)

- [Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences](https://arxiv.org/abs/2404.12272)

- [Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference](https://arxiv.org/abs/2501.00560)

- [The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs](https://arxiv.org/abs/2501.10970)

- [ReIFE: Re-evaluating Instruction-Following Evaluation](https://arxiv.org/abs/2410.07069)

### ⚖️ Biases

- [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926)

- [Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions](https://arxiv.org/abs/2308.11483)

- [Large Language Models are Inconsistent and Biased Evaluators](https://arxiv.org/abs/2405.01724)

- [Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges](https://arxiv.org/abs/2406.12624)

---

## 🤖 Agents

_🚧 Coming Soon -- Stay tuned!_

---

## ✨ Contributing

Have a paper to add? Found a mistake? 🧐

- Open a pull request or submit an issue! Contributions are welcome. 🙌

- Questions? Reach out to [[email protected]](mailto:[email protected]).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/haizelabs/awesome-llm-judges

Awesome Lists containing this project

README