Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-llm-eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
https://github.com/onejune2018/awesome-llm-eval
Last synced: 3 days ago
JSON representation
-
Datasets-or-Benchmark
-
通用
- CRUXEval - 4结合了思维链(CoT),在输入预测和输出预测上的pass@1分别达到了75%和81%。该基准测试暴露了开源和闭源模型之间的差距。GPT-4未能完全通过CRUXEval,提供了对其局限性和改进方向的见解 (2024-01-05)|
- LLMBar - 10-11) |
- Zhujiu
- InfoQ评测
- lmsys排名榜
- MT-bench - bench旨在测试多轮对话和遵循指令的能力,包含80个高质量的多轮问题,涵盖常见用例并侧重于具有挑战性的问题,以区分不同的模型。它包括8个常见类别的用户提示,以指导其构建:写作、角色扮演、提取、推理、数学、编程等 |
- MultiNLI
- HellaSwag
- LAMBADA
- OpenAI Moderation API
- GLUE Benchmark
- C_Eval - 4、ChatGPT、Claude、LLaMA、Moss 等多个模型的性能。|
- AGIEval
- Safety Eval 安全大模型评测
- JioNLP-LLM评测数据集
- KoLA - oriented LLM Assessment benchmark(KoLA)由清华大学知识工程组(THU-KEG)托管,旨在通过进行认真的设计,考虑数据、能力分类和评估指标,来精心测试LLMs的全球知识。这个基准测试包含80个高质量的多轮问题 |
- TrustLLM
- RewardBench - bench),[Code](https://github.com/allenai/reward-bench) 和 [Dataset](https://hf.co/datasets/allenai/reward-bench) (2024-03-20)|
- LLM-Uncertainty-Bench - 01-22)|
- CommonGen-Eval - lite数据集对LLM进行评估的研究,使用了GPT-4模型进行评估,比较了不同模型的性能,并列出了排行榜上的模型结果 (2024-01-04) |
- felm - grained annotation at the segment level, which includes reference links, identified error types, and the reasons behind these errors as provided by our annotators. (2023-10-03) |
- just-eval - 12-05) |
- EQ-Bench - 12-20) |
- MLAgentBench - 10-05) |
- UltraEval - 11-24) |
- HalluQA - hard部分69条,knowledge部分206条,每个问题平均有2.8个正确答案和错误答案标注。为了提高HalluQA的可用性,作者设计了一个使用GPT-4担任评估者的评测方法。具体来说,把幻觉的标准以及作为参考的正确答案以指令的形式输入给GPT-4,让GPT-4判断模型的回复有没有出现幻觉 (2023-11-08) |
- BAMBOO - 10-11) |
- Do-Not-Answer - 4相媲美的结果 |
- Zhujiu
- ChatEval
- Z-Bench
- zeno-build
- CMMLU
- llm-benchmark
- SQUAD
- LogiQA
- CoQA
- ParlAI
- LIT
- Adversarial NLI (ANLI)
- XieZhi - choice questions spanning 516 diverse disciplines and four difficulty levels. 新的领域知识综合评估基准测试:Xiezhi。对于多选题,Xiezhi涵盖了516种不同学科中的220,000个独特问题,其中涵盖了13个学科。作者还提出了Xiezhi-Specialty和Xiezhi-Interdiscipline,每个都含有15k个问题。使用Xiezhi基准测试评估了47种先进的LLMs的性能|
- MMCU
- Gaokao
- GAOKAO-Bench - bench是一个以中国高考题目为数据集,测评大模型语言理解能力、逻辑推理能力的测评框架 |
- Safety Eval 安全大模型评测
- SuperCLUE
- BIG-Bench-Hard - Bench任务,我们称之为BIG-Bench Hard(BBH)。这些任务是以前的语言模型评估未能超越平均人工评分者的任务 |
- BIG-bench
- KoLA - oriented LLM Assessment benchmark(KoLA)由清华大学知识工程组(THU-KEG)托管,旨在通过进行认真的设计,考虑数据、能力分类和评估指标,来精心测试LLMs的全球知识。这个基准测试包含80个高质量的多轮问题 |
- M3Exam
- Instruction_Following_Eval - 11-15) |
- DyVal - T5-large到GPT-3.5-Turbo和GPT-4的各种LLMs。实验表明,LLMs在DyVal生成的不同复杂度的评估样本中表现更差,突显了动态评估的重要性。作者还分析了不同提示方法的失败案例和结果。此外,DyVal生成的样本不仅是评估集,而且有助于微调,以提高LLMs在现有基准测试上的性能 (2024-04-20) |
- FMTI - 10-18) |
- TRACE - 10-05) |
- LLMEval²
- LucyEval
- MMLU
- Zhujiu
- LVEval - Eval是一个具备5个长度等级(16k、32k、64k、128k和256k)、最大文本测试长度达到256k的长文本评测基准。LV-Eval的平均文本长度达到102,380字,最小/最大文本长度为11,896/387,406字。LV-Eval主要有两类评测任务——单跳QA和多跳QA,共包含11个涵盖中英文的评测数据子集。LV-Eval设计时引入3个关键技术:干扰事实插入(Confusiong Facts Insertion,CFI)提高挑战性,关键词和短语替换(Keyword and Phrase Replacement,KPR)减少信息泄漏,以及基于关键词召回的评测指标(Answer Keywords,AK,指代结合答案关键词和字词黑名单的评价指标)提高评测数值客观性 (2024-02-06) |
- AlignBench - as-Judge),并且结合思维链(Chain-of-Thought)生成对模型回复的多维度分析和最终的综合评分,增强了评测的高可靠性和可解释性 (2023-12-01)|
- Psychometrics Eval - 10-19) |
- MMLU-Pro - Pro 是 MMLU 数据集的改进版本。MMLU 一直是多选知识数据集的参考。然而,最近的研究表明它既包含噪音(一些问题无法回答),又太容易(通过模型能力的进化和污染的增加)。MMLU-Pro 向模型提供十个选择而不是四个,要求在更多问题上进行推理,并经过专家审查以减少噪音量。它比原版质量更高且更难. MMLU-Pro减少了提示变化对模型性能的影响,这是其前身MMLU常见的问题。研究发现,在这个新基准上使用“Chain of Thought”推理的模型表现更好,这表明MMLU-Pro更适合评估人工智能的微妙推理能力. (2024-05-20) |
- ColossalEval
-
垂直领域
- PsyEval - 11-15) |
- RGB - Augmented Generation,RAG)的评测基准,分析了不同大型语言模型在RAG所需的4种基本能力(噪声稳健性、负面拒绝、信息整合和反事实稳健性)的性能,建立了中英文的“检索增强生成基准”(Retrieval-Augmented Generation Benchmark,RGB),根据所需的基本能力分为4个独立的测试集 (2023-09-04) |
- OpsEval - 10-02) |
- BLURB
- Fin-Eva - Eva Version 1.0,覆盖财富管理、保险、投资研究等多个金融场景以及金融专业主题学科,总评测题数目达到1.3w+。蚂蚁数据源包括各业务领域数据、互联网公开数据,经过数据脱敏、文本聚类、语料精筛、数据改写等处理过程后,结合金融领域专家的评审构建而成。上海财经大学数据源主要基于相关领域权威性考试的各类真题和模拟题对知识大纲的要求。蚂蚁部分涵盖金融认知、金融知识、金融逻辑、内容生成以及安全合规五大类能力33个子维度共8445个测评题; 上财部分涵盖金融,经济,会计和证书等四大领域,包括4661个问题,涵盖34个不同的学科。Fin-Eva Version 1.0 全部采用单选题这类有固定答案的问题,配合相应指令让模型输出标准格式 (2023-12-20) |
- GenMedicalEval - 4等其他模型,具有独特优势 (2023-12-08)|
- DebugBench - 4向源数据植入漏洞,并确保了严格的质量检查 (2024-01-09) |
- LAiW - 10-25)|
- LawBench - 09-28) |
- PPTC - Match评估系统,该系统根据预测文件而不是标签API序列来评估大语言模型是否完成指令,因此它支持各种LLM生成的API序列目前PPT生成存在三个方面的不足:多轮会话中的错误累积、长PPT模板处理和多模态感知问题 (2023-11-04) |
- LLMRec - 10-08)|
- SWE-bench - bench 是一个用于评估大型语言模型在从 GitHub 收集的实际软件问题上的基准测试。给定一个代码库和一个问题,语言模型的任务是生成一个能够解决所描述问题的补丁 |
- GSM8K - / *)以达到最终答案 |
-
RAG检索增强生成评估
- BERGEN - augmented GENeration), a library to benchmark RAG systems, focusing on question-answering (QA). Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in a RAG pipeline. BERGEN was designed to ease the reproducibility and integration of new datasets and models thanks to HuggingFace. (2024-05-31) |
- raga-llm-hub - llm-hub是一个全面的语言和学习模型(LLM)评估工具包。它拥有超过100个精心设计的评价指标,是允许开发者和组织有效评估和比较LLM的最具综合性的平台,并为LLM和检索增强生成(RAG)应用建立基本的防护措施。这些测试评估包括相关性与理解、内容质量、幻觉、安全与偏见、上下文相关性、防护措施以及漏洞扫描等多个方面,同时提供一系列基于指标的测试用于定量分析 (2024-03-10) |
- ARES - 文档-答案三元组。ARES培训流程包括三个步骤:(1)从领域内段落生成合成查询和答案。(2)通过在合成生成的训练数据上进行微调,为评分RAG系统准备LLM评委。(3)部署准备好的LLM评委以评估您的RAG系统在关键性能指标上的表现 (2023-09-27) |
- RGB - 09-04) |
- CRAG - answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. (2024-06-07) |
-
Agent能力
-
代码能力
- McEval - 4相比,在多语言的编程能力上仍然存在较大差距,绝大多数开源模型甚至无法超越GPT-3.5。此外测试也表明开源模型中如Codestral,DeepSeek-Coder, CodeQwen以及一些衍生模型也展现出优异的多语言能力. McEval is a massively multilingual code benchmark covering 40 programming languages with 16K test samples, which substantially pushes the limits of code LLMs in multilingual scenarios. The benchmark contains challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora McEval-Instruct. McEval leaderboard can be found [here](https://mceval.github.io/). (2024-06-11) |
- SuperCLUE-Agent - XL,这是一个大规模多语言代码生成基准测试,旨在填补这一缺陷。HumanEval-XL在23种自然语言和12种编程语言之间建立了联系,包含22,080个提示,平均每个提示有8.33个测试用例。通过确保跨多种NL和PLs的平行数据,HumanEval-XL为多语言LLMs提供了一个全面的评估平台,允许评估对不同NLs的理解。这项工作是填补多语言代码生成领域NL泛化评估空白的开创性步骤。 (2024-02-26) |
-
多模态-跨模态
- ChartVLM - 02-19) |
- ReForm-Eval - Eval是一个用于综合评估大视觉语言模型的基准数据集。ReForm-Eval通过对已有的、不同任务形式的多模态基准数据集进行重构,构建了一个具有统一且适用于大模型评测形式的基准数据集。所构建的ReForm-Eval具有如下特点:构建了横跨8个评估维度,并为每个维度提供足量的评测数据(平均每个维度4000余条);具有统一的评测问题形式(包括单选题和文本生成问题);方便易用,评测方法可靠高效,且无需依赖ChatGPT等外部服务;高效地利用了现存的数据资源,无需额外的人工标注,并且可以进一步拓展到更多数据集上 (2023-10-24) |
- LVLM-eHub - Modality Arena"是一个用于大型多模态模型的评估平台。在Fastchat之后,两个匿名模型在视觉问答任务上进行并排比较,"Multi-Modality Arena"允许你在提供图像输入的同时,对视觉-语言模型进行并排基准测试。支持MiniGPT-4,LLaMA-Adapter V2,LLaVA,BLIP-2等多种模型 |
-
长上下文
- InfiniteBench - 4, Claude 2 等。(4)真实场景与合成场景: InfiniteBench 既包含真实场景数据,探测大模型在处理实际问题的能力;也包含合成数据,为测试数据拓展上下文窗口提供了便捷. InfiniteBench is the first LLM benchmark featuring an average data length surpassing 100K tokens. InfiniteBench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in InfiniteBench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. (2024-03-19) |
-
推理速度
- llmperf - 11-03)|
- llm-inference-benchmark
- llm-inference-bench
- GPU-Benchmarks-on-LLM-Inference - inch M1 Max MacBook Pro, M2 Ultra Mac Studio, 14-inch M3 MacBook Pro and 16-inch M3 Max MacBook Pro. |
- llm-analysis
-
量化压缩
- LLM-QBench - QBench is a Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends. (2024-05-09) |
-
-
Demos
-
量化压缩
-
多模态-跨模态
-
LLM推理
-
-
LLM-List
-
Pre-trained-LLM
- Paper
- Paper
- Paper
- Paper
- Paper
- Paper
- Paper
- Paper
- Paper
- api - 08 | [Paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf) |
- api - 05 | [Paper](https://arxiv.org/pdf/2205.01068.pdf) |
- Paper
- ckpt - 05 | [Paper](https://arxiv.org/pdf/2205.05131v1.pdf) |
- ckpt - 04 | [Paper](https://arxiv.org/pdf/2104.12369.pdf) |
- ckpt - 10 | [Paper](https://jmlr.org/papers/v21/20-074.html) |
- api - 10 | [Paper](https://arxiv.org/pdf/2012.00413.pdf) |
- ckpt - 09 | [Github](https://github.com/BlinkDL/RWKV-LM) |
- ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
- ckpt - 06 | [Blog](https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6) |
- ckpt - 04 | [Paper](https://arxiv.org/pdf/2204.06745.pdf) |
- api - 05 | [Paper](https://arxiv.org/pdf/2005.14165.pdf) |
-
Instruction-finetuned-LLM
- Paper
- Paper
- api - 03 | [Paper](https://arxiv.org/pdf/2203.02155.pdf) |
- ckpt - 11| [Paper](https://arxiv.org/pdf/2211.09085.pdf)|
- ckpt - 03 | [Blog](https://www.yitay.net/blog/flan-ul2-20b)|
- ckpt - 10|[Paper](https://arxiv.org/pdf/2210.11416.pdf)|
- ckpt - 10|[Paper](https://arxiv.org/pdf/2110.08207.pdf)|
- demo - 03|[Github](https://github.com/tatsu-lab/stanford_alpaca)|
- ckpt - 3 |-|
-
Aligned-LLM
-
Open-LLM
- Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. [Alpaca.cpp](https://github.com/antimatter15/alpaca.cpp) [Alpaca-LoRA](https://github.com/tloen/alpaca-lora)
- Koala - A Dialogue Model for Academic Research
- StackLLaMA - A hands-on guide to train LLaMA with RLHF.
- T5 - Text-to-Text Transfer Transformer
- T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
- OPT - Open Pre-trained Transformer Language Models.
- UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
- YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
- Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
- Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
- Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
- GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
- GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
- Palmyra - Palmyra Base was primarily pre-trained with English text.
- Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.
- PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.
- HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
- Flan-Alpaca - Instruction Tuning from Humans and Machines.
- Baize - Baize is an open-source chat model trained with [LoRA](https://github.com/microsoft/LoRA). It uses 100k dialogs generated by letting ChatGPT chat with itself.
- Cabrita - A portuguese finetuned instruction LLaMA.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
- Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
- GPTQ-for-LLaMA - 4 bits quantization of [LLaMA](https://arxiv.org/abs/2302.13971) using [GPTQ](https://arxiv.org/abs/2210.17323).
- GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
- BELLE - Be Everyone's Large Language model Engine
- RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
- GLM - GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
- ChatGLM-6B - ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型,基于 [General Language Model (GLM)](https://github.com/THUDM/GLM) 架构,具有 62 亿参数.
- ChatGLM2-6B - 开源中英双语对话模型 ChatGLM-6B 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM2-6B 引入了更长的上下文、更好的性能和更高效的推理.
- RWKV - Parallelizable RNN with Transformer-level LLM Performance.
- ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
- GPT-Neo - An implementation of model & data parallel [GPT3](https://arxiv.org/abs/2005.14165)-like models using the [mesh-tensorflow](https://github.com/tensorflow/mesh) library.
- Pythia - Interpreting Autoregressive Transformers Across Time and Scale
- OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
- h2oGPT
- Open-Assistant - a project meant to give everyone access to a great chat based large language model.
- BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
- BLOOM - BigScience Large Open-science Open-access Multilingual Language Model [BLOOM-LoRA](https://github.com/linhduongtuan/BLOOM-LORA)
-
-
Others
-
Popular-LLM
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Efficient Finetuning of Quantized LLMs --- 低资源的大语言模型量化训练/部署方案 - 旨在构建和开源遵循指令的baichuan/LLaMA/Pythia/GLM中文大模型微调训练方法,该方法可以在单个 Nvidia RTX-2080TI上进行训练,多轮聊天机器人可以在单个 Nvidia RTX-3090上进行上下文长度 2048的模型训练。使用bitsandbytes进行量化,并与Huggingface的PEFT和transformers库集成.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
-
-
Leaderboards
-
量化压缩
-
多模态-跨模态
-
Leaderboards for popular Provider (performance and cost, 2024-05-14)
- OpenAI - ai/generative-ai/pricing) | [Replicate](https://replicate.com/pricing) | [DeepSeek](https://www.deepseek.com/) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Anthropic](https://www.anthropic.com/api) | [Google](https://cloud.google.com/vertex-ai/generative-ai/pricing) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Cohere](https://cohere.com/command) | [Anthropic](https://www.anthropic.com/api) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Replicate](https://replicate.com/pricing) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [OpenAI](https://openai.com/pricing) | | [Groq](https://wow.groq.com/) | [OpenAI](https://openai.com/pricing) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Anthropic](https://www.anthropic.com/api) | [Groq](https://wow.groq.com/) | [Anthropic](https://www.anthropic.com/api) | [Anthropic](https://www.anthropic.com/api) | Microsoft | Microsoft | [Mistral](https://docs.mistral.ai/platform/pricing/) |
- Llama 3 70B - V2](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat) | [Mixtral 8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) | Claude 3 Sonnet | Gemini 1.5 Flash | Mistral Large | [Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus) | Claude 3 Haiku | Mistral Small | [Llama 3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | GPT-3.5 Turbo | | Llama 3 70B (Groq) | GPT-4 | Mistral Medium | Claude 2.0 | Mixtral 8x7B (Groq) | Claude 2.1 | Claude Instant | [Phi-Medium 4k](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) | [Phi-3-Small 8k](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) | Mistral 7B |
-
-
Papers
-
多模态-跨模态
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/ACL-2021-blue alt="img" style="zoom:100%; vertical-align: middle" /> - acl.36),<br> by *Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu et al.*
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier and Julius Berner*
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - based Text Summarization**](https://doi.org/10.48550/arXiv.2302.08081),<br> by *Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen and Wei Cheng*
- <img src=https://img.shields.io/badge/CoRR-2022-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/COLING-2022-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Linguistic Models Commonsense Knowledge Bases?**](https://aclanthology.org/2022.coling-1.491),<br> by *Hsiu-Yu Yang and Carina Silberer*
- <img src=https://img.shields.io/badge/CoRR-2022-blue alt="img" style="zoom:100%; vertical-align: middle" /> - 3 a Psychopath? Evaluating Large Language Models from a Psychological
- <img src=https://img.shields.io/badge/EMNLP-2022-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Diverse Commonsense Probing on Multilingual Pre-Trained
- <img src=https://img.shields.io/badge/EMNLP-2022-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/ACL_Findings-2021-blue alt="img" style="zoom:100%; vertical-align: middle" /> - acl.322), [<img src=https://img.shields.io/badge/Code-skyblue alt="img" style="zoom:100%; vertical-align: middle" />](https://github.com/wangpf3/LM-for-CommonsenseInference)<br> by *Peifeng Wang, Filip Ilievski, Muhao Chen and Xiang Ren*
- <img src=https://img.shields.io/badge/EMNLP-2021-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - of-distribution
-
Leaderboards for popular Provider (performance and cost, 2024-05-14)
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - tuned
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2022-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2020-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2020-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/ICLR-2020-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/EMNLP-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Eval: NLG Evaluation using GPT-4 with Better Human Alignment**](https://www.microsoft.com/en-us/research/publication/gpteval-nlg-evaluation-using-gpt-4-with-better-human-alignment/),<br> by *Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu*
- <img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Purpose Natural Language Processing Task Solver?**](https://arxiv.org/abs/2302.06476),<br> by *Qin, Chengwei, Zhang, Aston, Zhang, Zhuosheng, Chen, Jiaao, Yasunaga, Michihiro and Yang, Diyi*
- <img src=https://img.shields.io/badge/CoRR-2021-blue alt="img" style="zoom:100%; vertical-align: middle" />
- <img src=https://img.shields.io/badge/CoRR-2021-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Trained Models for User Feedback Analysis in Software
-
-
LLMOps
-
Popular-LLM
- TrueFoundry - prem)基础设施上部署LLMOps工具,包括Vector DBs、嵌入式服务器等,包括部署、微调、跟踪提示和提供完整的数据安全性和最佳GPU管理的开源LLM模型。使用最佳的软件工程实践在生产规模上训练和启动您的LLM应用程序。 |
- Weights & Biases (Prompts)
- Byzer-LLM - llm.svg?style=social) | Byzer-LLM是一套大模型基础设施,支持预训练,微调,部署,serving等各种和大模型相关的能力。Byzer-Retrieval则是专门为大模型开发的存储基础设施,支持各种数据源批量导入,实时单条更新,支持全文检索,向量检索,混合检索,方便Byzer-LLM使用数据。Byzer-SQL/Python则提供了易于使用的人机交互API,极低门槛去使用上述产品。 |
- agenta - AI/agenta.svg?style=social) | 用于构建强大LLM应用的LLMOps平台。轻松尝试和评估不同提示、模型和工作流,以构建稳健的应用程序。 |
- Arize-Phoenix - ai/phoenix.svg?style=social) | 用于LLMs、视觉、语言和表格模型的ML可观测性。 |
- BudgetML
- deeplake
- Dify
- Dstack
- GPTCache
- Haystack - ai/haystack.svg?style=social) | 快速构建带有LLM代理、语义搜索、问答等功能的应用程序。 |
- LangKit
- LLMApp - app.svg?style=social) | LLM App是一个Python库,帮助您以几行代码构建实时的LLM启用数据管道。 |
- LLMFlows - stoyanov/llmflows.svg?style=social) | LLMFlows是一个用于构建简单、明确和透明的LLM应用程序的框架,例如聊天机器人、问答系统和代理。 |
- magentic
- Pezzo 🕹️
- prompttools
- xTuring
- ZenML - io/zenml.svg?style=social) | 用于编排、实验和部署生产级ML解决方案的开源框架,具有内置的`langchain`和`llama_index`集成。 |
-
-
Courses
-
Popular-LLM
- Full+Stack+LLM+Bootcamp - LLM相关学习/应用资源集.
- 大语言模型课程notebooks集-Large Language Model Course - Course with a roadmap and notebooks to get into Large Language Models (LLMs).
-
-
Other-Awesome-Lists
-
Popular-LLM
- Awesome-llm-tools - Curated list of useful LLM tool.
- Chain-of-Thought Hub - Measuring LLMs' Reasoning Performance
- Awesome-Efficient-LLM - A curated list for Efficient Large Language Models.
- Awesome-production-machine-learning - A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning.
- Awesome-LLM-Compression - A curated list for Efficient LLM Compression.
- Awesome-Multimodal-Large-Language-Models - A curated list of Multimodal Large Language Models.
- Awesome-LLMOps - An awesome & curated list of the best LLMOps tools for developers.
- Awesome-MLops - An awesome list of references for MLOps - Machine Learning Operations.
- Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model.
- awesome-chatgpt-prompts-zh - A Chinese collection of prompt examples to be used with the ChatGPT model.
- Awesome ChatGPT - Curated list of resources for ChatGPT and GPT-3 from OpenAI.
- Chain-of-Thoughts Papers - A trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models.
- Instruction-Tuning-Papers - A trend starts from `Natrural-Instruction` (ACL 2022), `FLAN` (ICLR 2022) and `T0` (ICLR 2022).
- LLM Reading List - A paper & resource list of large language models.
- Awesome GPT - A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
- Awesome GPT-3 - a collection of demos and articles about the [OpenAI GPT-3 API](https://openai.com/blog/openai-api/).
-
-
Licenses
-
Popular-LLM
- ![MIT license - license.org/)
- MIT License
- ![CC BY-NC-SA 4.0 - nc-sa/4.0/)
- Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
-
-
Tools
- OpenCompass
- lighteval - 02-08) |
- alpaca_eval - lab开发的一个用于评估LLM的工具,可以使用多种语言、领域和任务进行测试,并提供可解释性、鲁棒性和可信度等指标. |
- Arthur Bench - 10-06) |
- llm-benchmarker-suite - source effort aims to address fragmentation and ambiguity in LLM benchmarking. The suite provides a structured methodology, a collection of diverse benchmarks, and toolkits to streamline assessing LLM performance. By offering a common platform, this project seeks to promote collaboration, transparency, and quality research in NLP. (2023-07-06) |
- EVAL
- lm-evaluation - 1 Technical [Paper](https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1), with current support for running the tasks through both the AI21 Studio API and OpenAI's GPT3 API. |
- phasellm
- LLMZoo
- HELM
- auto-evaluator - evaluator是Langchain开发的一个用于评估问答系统的轻量级工具,可以自动生成问题和答案,并计算模型的准确率、召回率和F1分数等指标. |
- PandaLM
- CONNER
- lm-evaluation-harness - evaluation-harness是EleutherAI开发的一个用于评估大型语言模型(LLM)的工具,可以测试模型在不同任务和数据集上的性能和泛化能力. |
- LeaderboardFinder - 04-02) |
- LLM Comparator - 02-16) |
- LeaderboardFinder - 04-02) |
- prometheus-eval - 4 的判断。此外,它能够处理直接评估和成对排序两种格式,并配合使用者定义的评估标准。在四个直接评估基准和四个成对排序基准上,PROMETHEUS 2 在所有测试的开源评估语言模型中,与人类和专有语言模型评判者取得最高的相关性和一致性 (2024-05-04) |
- athina-ai - ai是一个开源库,提供即插即用的预设评估(preset evals)/模块化、可扩展的框架来编写和运行评估,帮助工程师通过评估驱动的开发来系统性地提高他们的大型语言模型的可靠性和性能,athina-ai提供了一个系统,用于评估驱动的开发,克服了传统工作流程的限制,允许快速实验和具有一致指标的可定制评估器 |
- autoevals - graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs. |
-
News
- 2023/08/03 - 7B).
- 2023/10/20
-
Frameworks-for-Training
-
Popular-LLM
- Accelerate - 🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
- Apache MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
- Caffe - A fast open framework for deep learning.
- ColossalAI - An integrated large-scale model training system with efficient parallelization techniques.
- DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
- Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
- Kedro - org/kedro.svg?style=social) - Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code.
- Keras - team/keras.svg?style=social) - Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow.
- LightGBM - A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
- MegEngine - MegEngine is a fast, scalable and easy-to-use deep learning framework, with auto-differentiation.
- metric-learn - learn-contrib/metric-learn.svg?style=social) - Metric Learning Algorithms in Python.
- MindSpore - ai/mindspore.svg?style=social) - MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
- Oneflow - Inc/oneflow.svg?style=social) - OneFlow is a performance-centered and open-source deep learning framework.
- PaddlePaddle - Machine Learning Framework from Industrial Practice.
- PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration.
- XGBoost - Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library.
- scikit-learn - learn/scikit-learn.svg?style=social) - Machine Learning in Python.
- TensorFlow - An Open Source Machine Learning Framework for Everyone.
- VectorFlow - A minimalist neural network library optimized for sparse data and single machine environments.
-
-
引用
-
Popular-LLM
-
Programming Languages
Categories
Sub Categories
Keywords
llm
31
machine-learning
29
chatgpt
23
large-language-models
22
deep-learning
20
python
19
language-model
15
evaluation
13
ai
12
llmops
11
pytorch
9
mlops
9
gpt-3
8
llms
8
gpt
8
benchmark
8
gpt-4
8
chatbot
7
openai
7
nlp
7
prompt-engineering
6
transformers
6
data-science
6
llama
6
llm-evaluation
6
rag
5
ml
5
tensorflow
5
instruction-following
5
rlhf
5
neural-network
4
evaluation-framework
4
langchain
4
in-context-learning
4
gpu
4
foundation-models
4
natural-language-processing
4
chinese
4
gpt-2
3
alpaca
3
gpt4
3
llm-eval
3
generative-ai
3
prompt
3
llm-inference
3
distributed
3
inference
3
vector-database
3
vector-search
3
leaderboard
3