Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-foundation-model-leaderboards
A curated list of awesome leaderboard-oriented resources for foundation models
https://github.com/SAILResearch/awesome-foundation-model-leaderboards
Last synced: about 4 hours ago
JSON representation
-
Model Ranking
-
Comprehensive
- SuperCLUE
- FlagEval
- GenAI Arena
- Generative AI Leaderboards - performing generative AI models based on various metrics. |
- Vals AI - world legal tasks. |
- Vellum LLM Leaderboard - source LLMs. |
- MEGA-Bench - Bench is a benchmark for multimodal evaluation with diverse tasks across 8 application types, 7 input formats, 6 output formats, and 10 multimodal skills, spanning single-image, multi-image, and video tasks. |
-
Text
- ACLUE
- African Languages LLM Eval Leaderboard
- AgentBoard - turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates. |
- AGIEval - centric benchmark to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. |
- Aiera Leaderboard - based Q&A, and financial sentiment tagging. |
- AIR-Bench - Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models. |
- AI Energy Score Leaderboard
- ai-benchmarks - benchmarks contains a handful of evaluation results for the response latency of popular AI services. |
- AlignBench - dimensional benchmark for evaluating LLMs' alignment in Chinese. |
- ANGO - oriented Chinese language model evaluation benchmark. |
- Arabic Tokenizers Leaderboard
- Arena-Hard-Auto - Hard-Auto is a benchmark for instruction-tuned LLMs. |
- AutoRace
- Auto Arena - battles to evaluate their performance. |
- Auto-J - J hosts evaluation results on the pairwise response comparison and critique generation tasks. |
- BABILong
- BBL - bench Lite) is a small subset of 24 diverse JSON tasks from BIG-bench. It is designed to provide a canonical measure of model performance, while being far cheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench. |
- BeHonest - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs. |
- BenBench
- BenCzechMark
- CyberMetric
- CzechBench
- BiGGen-Bench - Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks. |
- BotChat - round chatting capabilities of LLMs through a proxy task. |
- CaselawQA
- CFLUE
- Ch3Ef - annotated samples across 12 domains and 46 tasks based on the hhh principle. |
- Chain-of-Thought Hub - of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs. |
- Chatbot Arena
- ChemBench
- CLEM Leaderboard - optimized LLMs as conversational agents. |
- CLEVA
- Chinese Large Model Leaderboard
- CMB - level medical benchmark in Chinese. |
- CMMLU
- CMMMU - level subject knowledge and deliberate reasoning in a Chinese context. |
- CommonGen
- Compression Rate Leaderboard
- CopyBench
- CoTaEval
- ConvRe
- CriticEval
- CS-Bench - Bench is a bilingual benchmark designed to evaluate LLMs' performance across 26 computer science subfields, focusing on knowledge and reasoning. |
- C-Eval - Eval is a Chinese evaluation suite for LLMs. |
- CUTE
- Decentralized Arena Leaderboard - defined dimensions, including mathematics, logic, and science. |
- DecodingTrust
- Domain LLM Leaderboard - specific LLMs. |
- Enterprise Scenarios leaderboard - world enterprise use cases. |
- EQ-Bench - Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs. |
- European LLM Leaderboard
- EvalGPT.ai
- Eval Arena - level analysis and pairwise comparisons. |
- Factuality Leaderboard
- FanOutQA - hop, multi-document benchmark for LLMs using English Wikipedia as its knowledge base. |
- FastEval - following and chat language models on various benchmarks with fast inference and detailed performance insights. |
- FinEval
- Fine-tuning Leaderboard - tuning Leaderboard is a platform to rank and showcase models that have been fine-tuned using open-source datasets or frameworks. |
- Flames
- FollowBench - level fine-grained constraints following benchmark to evaluate the instruction-following capability of LLMs. |
- Forbidden Question Dataset
- FuseReviews - form question-answering and summarization. |
- GAIA
- GTBench - theoretic tasks, e.g., board and card games. |
- Guerra LLM AI Leaderboard
- Hallucinations Leaderboard
- HalluQA
- GAVIE - 4-assisted benchmark for evaluating hallucination in LMMs by scoring accuracy and relevancy without relying on human-annotated groundtruth. |
- GPT-Fathom - Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. |
- GrailQA - scale, high-quality benchmark for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). |
- Hebrew LLM Leaderboard
- Hughes Hallucination Evaluation Model leaderboard
- Icelandic LLM leaderboard - language tasks. |
- IFEval
- IL-TUR - TUR is a benchmark for evaluating language models on monolingual and multilingual tasks focused on understanding and reasoning over Indian legal documents. |
- Indic LLM Leaderboard
- InstructEval
- Italian LLM-Leaderboard - Leaderboard tracks and compares LLMs in Italian-language tasks. |
- Japanese Chatbot Arena
- Japanese Language Model Financial Evaluation Harness
- Japanese LLM Roleplay Benchmark
- JMED-LLM - LLM (Japanese Medical Evaluation Dataset for Large Language Models) is a benchmark for evaluating LLMs in the medical field of Japanese. |
- JMMMU
- JustEval - grained evaluation of LLMs. |
- KoLA
- LaMP
- Language Model Council
- LawBench
- La Leaderboard
- LogicKor
- LongICL Leaderboard - context learning evaluations for LLMs. |
- LooGLE
- LAiW
- Large Language Model Assessment in English Contexts
- Large Language Model Assessment in the Chinese Context
- LIBRA
- LibrAI-Eval GenAI Leaderboard - Eval GenAI Leaderboard focuses on the balance between the LLM’s capability and safety in English. |
- BotChat - round chatting capabilities of LLMs through a proxy task. |
- Chinese SimpleQA
-
-
Table of Contents
- Demo Leaderboard
- Demo Leaderboard Backend
- Leaderboard Explorer
- Open LLM Leaderboard Renamer - llm-leaderboard-renamer helps users rename their models in Open LLM Leaderboard easily. |
- Open LLM Leaderboard Results PR Opener
- Open LLM Leaderboard Scraper
- Progress Tracker - source LLMs over time as scored by the [LMSYS Chatbot Arena](https://lmarena.ai/?leaderboard). |
- AIcrowd
- AI Hub - world problems, with a focus on innovation and collaboration. |
- AI Studio - driven tasks, allowing users to develop and showcase their AI skills. |
- Allen Institute for AI
- Codabench - source platform for benchmarking AI models, enabling customizable, user-driven challenges across various AI domains. |
- DataFountain - related problems. |
- Dynabench
- Eval AI
- Hilti - relevant applications. |
- InsightFace
- Kaggle
- Robust Reading Competition - world environments. |
- Tianchi
- Kaggle Competition Creation
Programming Languages
Categories
Sub Categories
Keywords
llm
4
large-language-models
4
benchmark
3
evaluation
3
nlp
2
chatgpt
2
foundation-models
1
evaluation-metrics
1
multi-level
1
instruction-following
1
constraints
1
python
1
natural-language-processing
1
generative-ai
1
fact-checking
1
leakage-detection
1
dataset
1
benchmarks
1
honesty
1
long-context
1
acl2024
1
instruct-tuning
1
question-answering
1
pytorch
1
vqa
1
vision-and-language
1
vision
1
vicuna
1
prompt-engineering
1
object-detection
1
multimodal
1
llava
1
llama
1
iclr2024
1
iclr
1
hallucination
1
gpt-4
1
gpt
1
alignment
1