Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-LLMs-Evaluation-Papers
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers
Last synced: 1 day ago
JSON representation
-
Related Surveys for LLMs Evaluation
-
Papers
-
:books:Knowledge and Capability Evaluation
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - iitd/jeebench)]
- [Paper
- [Paper
- [Paper
- [Paper - zentroa/This-is-not-a-Dataset)][[Source](https://huggingface.co/datasets/HiTZ/This-is-not-a-dataset)]
- [Paper - school-math)]
- [Paper
- [Paper - deepmind/narrativeqa)]
- [Paper
- [Paper
- [Paper - research-datasets/natural-questions)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - solvers)]
- [Paper
- [Paper
- [Paper - research-datasets/timedial)]
- [Paper
- [Paper
- [Paper
- [Paper - Semantic-Plausibility-NAACL18/)]
- [Paper
- [Paper
- [Paper
- [Paper - evaluation)]
- [Paper
- [Paper
- [Paper - mll/multiNLI/)]
- [Paper
- [Paper - dataset)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - dataset)]
- [Paper
- [Paper
- [Paper - research/google-research/tree/master/logic_inference_dataset)]
- [Paper - LILY/FOLIO)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - school-math)]
- [Paper
- [Paper - iitd/jeebench)]
- [Paper - llm)]
- [Paper
- [Paper
- [Paper
- [Paper - Song793/RestGPT)]
- [Paper - can.github.io/)]
- [Paper
- [Paper - nlp/WebShop)] [[Source](https://webshop-pnlp.github.io/)]
- [Paper
- [Paper
- [Paper - chen/ToolQA)]
- [Paper - pytorch)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - llm-benchmark)]
- [Paper
- [Paper
- [Paper
- [Paper - rlhf-pytorch)]
- [Paper
- [Paper - research/google-research/tree/master/code_as_policies)]
- [Paper
- [Paper
- [Paper - zh15/NeQA)]
- [Paper
- [Paper
- [Paper
- [Paper - zentroa/This-is-not-a-Dataset)][[Source](https://huggingface.co/datasets/HiTZ/This-is-not-a-dataset)]
- [Paper - inf.mpg.de/)]
- [Paper
- [Paper
-
:earth_americas:Evaluation Organization
- [Paper - lab/M3KE)]
- [Paper
- [GitHub
- [GitHub - blue)
- [Paper - sys/FastChat/tree/main/fastchat/llm_judge)]
- [Paper - lab/M3KE)]
- [Paper - bench)]
- [Paper - sys/FastChat/tree/main/fastchat/llm_judge)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - li/CMMLU)]
- [Paper
- [Paper - NLP-SG/M3Exam)]
- [Paper - Eval)]
- [Paper - crfm/helm)]
- [Paper - Fathom/GPT-Fathom)]
- [Paper - lab/InstructEvalImpact)] [[GitHub](https://github.com/declare-lab/instruct-eval)]
- [Paper - Lab/CLEVA)]
- [Paper - NLP-SG/M3Exam)]
- [Paper - lab/InstructEvalImpact)] [[GitHub](https://github.com/declare-lab/instruct-eval)]
- [Paper
- [Paper - bench)]
-
:triangular_ruler:Alignment Evaluation
- [Paper - NLP/mic)]
- [Paper - dialog)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - dataset)]
- [Paper - zh15/FactCCX)]
- [Paper - LILY/SummEval)]
- [Paper
- [Paper
- [Paper
- [Paper - NLP/factool)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - NLP/FaithDial)]
- [Paper
- [Paper
- [Github
- [Paper - Eval/FairEval)]
- [Paper
- [Paper
- [Paper
- [Paper - chemistry-101)]
- [Paper
- [Paper
- [Paper
- [Paper - NLP/mic)]
- [Paper
- [Paper - dialog)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - Understanding-Gender-Bias-in-Neural-Relation-Extraction)]
- [Paper
- [Paper
- [Paper
- [Paper - salt/implicit-hate)]
- [Paper - Generated-Hate-Speech-Dataset)]
- [Paper - data)]
- [Paper - zhou/CDial-Bias)]
- [Paper - PM)]
- [Paper - rottger/hatecheck-data)]
- [Paper
- [Paper - mll/crows-pairs)] [[Source](https://huggingface.co/datasets/crows_pairs)]
- [Paper
- [Paper - science/bold)] [[Source](https://huggingface.co/datasets/AlexaAI/bold)]
- [Paper
- [Paper
- [Paper
- [Paper - mll/bbq)]
- [Paper
- [Paper - LIT/MultilingualBias)]
- [Paper
- [Paper
- [Paper
- [Paper - biases)]
- [Paper
- [Paper
- [Paper
- [Paper - br)]
- [Paper
- [Paper
- [Paper - NLP/chain-of-thought-bias)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - squared)]
- [Paper - dataset)]
- [Paper
- [Paper - research-datasets/xsum_hallucination_annotations)]
- [Paper - zh15/FactCCX)]
- [Paper - LILY/SummEval)]
- [Paper
- [Paper
- [Paper
- [Paper - datasets)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - three/fib)]
- [Paper - NLP/factool)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - NLP/FaithDial)]
- [Paper - LieDetector)]
- [Paper - Instruction)]
- [Paper - lab/HallusionBench)]
- [Paper
- [Paper - coai/Safety-Prompts)] [[Source](http://coai.cs.tsinghua.edu.cn/leaderboard/)]
- [Paper
- [GitHub
- [Paper
- [Paper
- [Paper - Eval/FairEval)]
- [Paper
- [Paper
- [Paper
- [Paper - instruct)]
- [Paper - PM)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - SA.html)]
- [Paper - Understanding-Gender-Bias-in-Neural-Relation-Extraction)]
- [Paper - salt/implicit-hate)]
- [Paper - Generated-Hate-Speech-Dataset)]
- [Paper - data)]
- [Paper - zhou/CDial-Bias)]
- [Paper
- [Paper - ml-bias-analysis)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
:closed_lock_with_key:Safety Evaluation
- [Paper - formatted-problems)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - science/recode)]
- [Paper - formatted-problems)]
- [Paper - math)]
- [Paper
- [Paper
- [Paper - Waver-In-Judgements)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - bench)] [[Source](https://xingyaoww.github.io/mint-bench/)]
- [Paper
- [Paper
- [Paper
- [Paper
-
:syringe::woman_judge::computer::moneybag:Specialized LLMs Evaluation
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - reasoning)]
- [Paper
- [Paper
- [Paper
- [Paper - teacher-test)]
- [Paper - shot-teacher-feedback)]
- [Paper - LLM-Studies/#/)]
- [Paper
- [Paper - passes-the-bar)]
- [Paper
- [Paper - takes-the-bar-exam)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - statutes)]
- [Paper - compass/LawBench)]
- [Paper - CompEval-Legal)]
- [Paper - LMs)]
- [Paper
- [Paper
- [Paper - nlp/swe-bench)] [[Source](https://swe-bench.github.io/)]
- [Paper - benchmark.github.io/)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper - nlp/intercode)] [[Source](https://intercode-benchmark.github.io/)]
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
- [Paper
-
-
LLM Leaderboards
-
Markups
Programming Languages
Sub Categories