Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-foundation-model-leaderboards

A curated list of awesome leaderboard-oriented resources for foundation models
https://github.com/SAILResearch/awesome-foundation-model-leaderboards

Last synced: 1 day ago
JSON representation

Model Ranking
- Comprehensive
  - SuperCLUE
  - FlagEval
  - GenAI Arena
  - Generative AI Leaderboards - performing generative AI models based on various metrics. |
  - Vals AI - world legal tasks. |
  - Vellum LLM Leaderboard - source LLMs. |
- Text
  - ACLUE
  - African Languages LLM Eval Leaderboard
  - AgentBoard - turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates. |
  - AGIEval - centric benchmark to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. |
  - Aiera Leaderboard - based Q&A, and financial sentiment tagging. |
  - AIR-Bench - Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models. |
  - AI Energy Score Leaderboard
  - ai-benchmarks - benchmarks contains a handful of evaluation results for the response latency of popular AI services. |
  - AlignBench - dimensional benchmark for evaluating LLMs' alignment in Chinese. |
  - ANGO - oriented Chinese language model evaluation benchmark. |
  - Arabic Tokenizers Leaderboard
  - Arena-Hard-Auto - Hard-Auto is a benchmark for instruction-tuned LLMs. |
  - AutoRace
  - Auto Arena - battles to evaluate their performance. |
  - Auto-J - J hosts evaluation results on the pairwise response comparison and critique generation tasks. |
  - BABILong
  - BBL - bench Lite) is a small subset of 24 diverse JSON tasks from BIG-bench. It is designed to provide a canonical measure of model performance, while being far cheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench. |
  - BeHonest - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs. |
  - BenBench
  - BenCzechMark
  - CyberMetric
  - CzechBench
  - BiGGen-Bench - Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks. |
  - BotChat - round chatting capabilities of LLMs through a proxy task. |
  - CaselawQA
  - CFLUE
  - Ch3Ef - annotated samples across 12 domains and 46 tasks based on the hhh principle. |
  - Chain-of-Thought Hub - of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs. |
  - Chatbot Arena
  - ChemBench
  - CLEM Leaderboard - optimized LLMs as conversational agents. |
  - CLEVA
  - Chinese Large Model Leaderboard
  - CMB - level medical benchmark in Chinese. |
  - CMMLU
  - CMMMU - level subject knowledge and deliberate reasoning in a Chinese context. |
  - CommonGen
  - Compression Rate Leaderboard
  - CopyBench
  - CoTaEval
  - ConvRe
  - CriticEval
  - CS-Bench - Bench is a bilingual benchmark designed to evaluate LLMs' performance across 26 computer science subfields, focusing on knowledge and reasoning. |
  - C-Eval - Eval is a Chinese evaluation suite for LLMs. |
  - CUTE
  - Decentralized Arena Leaderboard - defined dimensions, including mathematics, logic, and science. |
  - DecodingTrust
  - Domain LLM Leaderboard - specific LLMs. |
  - Enterprise Scenarios leaderboard - world enterprise use cases. |
  - EQ-Bench - Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs. |
  - European LLM Leaderboard
  - EvalGPT.ai
  - Eval Arena - level analysis and pairwise comparisons. |
  - Factuality Leaderboard
  - FanOutQA - hop, multi-document benchmark for LLMs using English Wikipedia as its knowledge base. |
  - FastEval - following and chat language models on various benchmarks with fast inference and detailed performance insights. |
  - FinEval
  - Fine-tuning Leaderboard - tuning Leaderboard is a platform to rank and showcase models that have been fine-tuned using open-source datasets or frameworks. |
  - Flames
  - FollowBench - level fine-grained constraints following benchmark to evaluate the instruction-following capability of LLMs. |
  - Forbidden Question Dataset
  - FuseReviews - form question-answering and summarization. |
  - GAIA
  - GTBench - theoretic tasks, e.g., board and card games. |
  - Guerra LLM AI Leaderboard
  - Hallucinations Leaderboard
  - HalluQA
  - GAVIE - 4-assisted benchmark for evaluating hallucination in LMMs by scoring accuracy and relevancy without relying on human-annotated groundtruth. |
  - GPT-Fathom - Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. |
  - GrailQA - scale, high-quality benchmark for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). |
  - Hebrew LLM Leaderboard
  - Hughes Hallucination Evaluation Model leaderboard
  - Icelandic LLM leaderboard - language tasks. |
  - IFEval
  - IL-TUR - TUR is a benchmark for evaluating language models on monolingual and multilingual tasks focused on understanding and reasoning over Indian legal documents. |
  - Indic LLM Leaderboard
  - InstructEval
  - Italian LLM-Leaderboard - Leaderboard tracks and compares LLMs in Italian-language tasks. |
  - Japanese Chatbot Arena
  - Japanese Language Model Financial Evaluation Harness
  - Japanese LLM Roleplay Benchmark
  - JMED-LLM - LLM (Japanese Medical Evaluation Dataset for Large Language Models) is a benchmark for evaluating LLMs in the medical field of Japanese. |
  - JMMMU
  - JustEval - grained evaluation of LLMs. |
  - KoLA
  - LaMP
  - Language Model Council
  - LawBench
  - La Leaderboard
  - LogicKor
  - LongICL Leaderboard - context learning evaluations for LLMs. |
  - LooGLE
  - LAiW
  - Large Language Model Assessment in English Contexts
  - Large Language Model Assessment in the Chinese Context
  - LIBRA
  - LibrAI-Eval GenAI Leaderboard - Eval GenAI Leaderboard focuses on the balance between the LLM’s capability and safety in English. |
  - BotChat - round chatting capabilities of LLMs through a proxy task. |
Table of Contents
- Demo Leaderboard
- Demo Leaderboard Backend
- Leaderboard Explorer
- Open LLM Leaderboard Renamer - llm-leaderboard-renamer helps users rename their models in Open LLM Leaderboard easily. |
- Open LLM Leaderboard Results PR Opener
- Open LLM Leaderboard Scraper
- Progress Tracker - source LLMs over time as scored by the [LMSYS Chatbot Arena](https://lmarena.ai/?leaderboard). |
- AIcrowd
- AI Hub - world problems, with a focus on innovation and collaboration. |
- AI Studio - driven tasks, allowing users to develop and showcase their AI skills. |
- Allen Institute for AI
- Codabench - source platform for benchmarking AI models, enabling customizable, user-driven challenges across various AI domains. |
- DataFountain - related problems. |
- Dynabench
- Eval AI
- Hilti - relevant applications. |
- InsightFace
- Kaggle
- Robust Reading Competition - world environments. |
- Tianchi

Programming Languages

Python 20 Jupyter Notebook 4 JavaScript 2 Shell 1

Ecosyste.ms: Awesome

awesome-foundation-model-leaderboards

Model Ranking

Comprehensive

Text

Table of Contents