https://github.com/haizelabs/awesome-llm-judges
⚖️ Awesome LLM Judges ⚖️
https://github.com/haizelabs/awesome-llm-judges
List: awesome-llm-judges
Last synced: 9 days ago
JSON representation
⚖️ Awesome LLM Judges ⚖️
- Host: GitHub
- URL: https://github.com/haizelabs/awesome-llm-judges
- Owner: haizelabs
- Created: 2024-11-23T20:03:53.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2025-02-21T10:01:04.000Z (4 months ago)
- Last Synced: 2025-02-21T11:19:52.975Z (4 months ago)
- Homepage:
- Size: 25.4 KB
- Stars: 73
- Watchers: 4
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ultimate-awesome - awesome-llm-judges - ⚖️ Awesome LLM Judges ⚖️. (Other Lists / Julia Lists)
README
# ⚖️ **Awesome LLM Judges** ⚖️
_This repo curates recent research on LLM Judges for automated evaluation._
> [!TIP]
> ⚖️ Check out [Verdict](https://verdict.haizelabs.com) — our in-house library for hassle-free implementations of the papers below!---
## 📚 Table of Contents
- [🌱 Starter](#-starter)
- [🎭 Ensemble](#-ensemble)
- [🤔 Debate](#-debate)
- [🎯 Finetuned Models](#-finetuned-models)
- [🌀 Hallucination](#-hallucination)
- [🏆 Generative Reward Models](#-generative-reward-models)
- [🛡️ Safety](#️-safety)
- [🛑 Content Moderation](#-content-moderation)
- [🔍 Scalable Oversight](#-scalable-oversight)
- [👨⚖️ Evaluating Judges](#-evaluating-judges)
- [⚖️ Biases](#-biases)
- [🤖 Agents](#-agents)
- [✨ Contributing](#-contributing)---
## 🌱 Starter
- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)
- [G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)
- [Benchmarking Foundation Models with Language-Model-as-an-Examiner](https://arxiv.org/abs/2306.04181)---
## 🎭 Multi-Judge
- [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)
### 🤔 Debate
- [ScaleEval: Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate](https://arxiv.org/abs/2401.16788)
- [ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate](https://arxiv.org/abs/2308.07201)
- [Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate](https://arxiv.org/abs/2305.13160)
- [Debating with More Persuasive LLMs Leads to More Truthful Answers](https://arxiv.org/abs/2402.06782)---
## 🎯 Finetuned Models
- [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491)
- [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)
- [JudgeLM: Fine-tuned Large Language Models are Scalable Judges](https://arxiv.org/abs/2310.17631)### 🌀 Hallucination
- [HALU-J: Critique-Based Hallucination Judge](https://arxiv.org/abs/2407.12943)
- [MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents](https://aclanthology.org/2024.emnlp-main.499/)
- [Lynx: An Open Source Hallucination Evaluation Model](https://arxiv.org/abs/2407.08488)### 🏆 Generative Reward Models
- [Generative Verifiers: Reward Modeling as Next-Token Prediction](https://arxiv.org/abs/2408.15240)
- [Critique-out-Loud Reward Models](https://arxiv.org/abs/2408.11791)---
## 🛡️ Safety
### 🛑 Content Moderation
- [A STRONGREJECT for Empty Jailbreaks (Sections C.4 & C.5)](https://arxiv.org/pdf/2402.10260)
- [OR-Bench: An Over-Refusal Benchmark for Large Language Models (Sections A.3 & A.11)](https://arxiv.org/abs/2405.20947)
- [WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs](https://arxiv.org/abs/2406.18495)
- [Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://arxiv.org/abs/2312.06674)### 🔍 Scalable Oversight
- [On Scalable Oversight with Weak LLMs Judging Strong LLMs](https://arxiv.org/abs/2407.04622)
- [Debate Helps Supervise Unreliable Experts](https://arxiv.org/abs/2311.08702)
- [Great Models Think Alike and this Undermines AI Oversight](https://arxiv.org/abs/2502.04313)
- [LLM Critics Help Catch LLM Bugs](https://arxiv.org/abs/2407.00215)---
## 👨⚖️ Judging the Judges: Meta-Evaluation
- [JudgeBench: A Benchmark for Evaluating LLM-based Judges](https://arxiv.org/abs/2410.12784)
- [RewardBench: Evaluating Reward Models for Language Modeling](https://arxiv.org/abs/2403.13787)
- [Evaluating Large Language Models at Evaluating Instruction Following](https://arxiv.org/abs/2310.07641)
- [Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge](https://arxiv.org/abs/2407.19594)
- [From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge](https://arxiv.org/abs/2411.16594)
- [Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences](https://arxiv.org/abs/2404.12272)
- [Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference](https://arxiv.org/abs/2501.00560)
- [The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs](https://arxiv.org/abs/2501.10970)
- [ReIFE: Re-evaluating Instruction-Following Evaluation](https://arxiv.org/abs/2410.07069)### ⚖️ Biases
- [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926)
- [Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions](https://arxiv.org/abs/2308.11483)
- [Large Language Models are Inconsistent and Biased Evaluators](https://arxiv.org/abs/2405.01724)
- [Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges](https://arxiv.org/abs/2406.12624)---
## 🤖 Agents
_🚧 Coming Soon -- Stay tuned!_
---
## ✨ Contributing
Have a paper to add? Found a mistake? 🧐
- Open a pull request or submit an issue! Contributions are welcome. 🙌
- Questions? Reach out to [[email protected]](mailto:[email protected]).