Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

awesome-llm-eval

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.
https://github.com/onejune2018/awesome-llm-eval

2024/02/08
2024/01/15 - shen/LAiW).
2023/11/15 - nlp/LLMBar) for the evaluation of Instruction Following ability of LLMs.
2023/10/20
2023/09/25 - AI.
2023/09/20 - AIFLM-Lab/FinEval) and [SuperCLUE-Safety](https://github.com/CLUEbenchmark/SuperCLUE-Safety) from CLUEbenchmark.
2023/09/18
2023/08/03 - 7B).
2023/06/28
prometheus-eval - 4 的判断。此外,它能够处理直接评估和成对排序两种格式，并配合使用者定义的评估标准。在四个直接评估基准和四个成对排序基准上，PROMETHEUS 2 在所有测试的开源评估语言模型中，与人类和专有语言模型评判者取得最高的相关性和一致性 (2024-05-04) |
athina-ai - ai是一个开源库，提供即插即用的预设评估（preset evals）/模块化、可扩展的框架来编写和运行评估，帮助工程师通过评估驱动的开发来系统性地提高他们的大型语言模型的可靠性和性能，athina-ai提供了一个系统，用于评估驱动的开发，克服了传统工作流程的限制，允许快速实验和具有一致指标的可定制评估器 |
LeaderboardFinder - 04-02) |
lighteval - 02-08) |
LLM Comparator - 02-16) |
Arthur Bench - 10-06) |
llm-benchmarker-suite - source effort aims to address fragmentation and ambiguity in LLM benchmarking. The suite provides a structured methodology, a collection of diverse benchmarks, and toolkits to streamline assessing LLM performance. By offering a common platform, this project seeks to promote collaboration, transparency, and quality research in NLP. (2023-07-06) |
autoevals - graded evaluation for a variety of subjective tasks including fact checking, safety, and more. Many of these evaluations are adapted from OpenAI's excellent evals project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs. |
EVAL
lm-evaluation-harness - evaluation-harness是EleutherAI开发的一个用于评估大型语言模型（LLM）的工具，可以测试模型在不同任务和数据集上的性能和泛化能力. |
lm-evaluation - 1 Technical [Paper](https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1), with current support for running the tasks through both the AI21 Studio API and OpenAI's GPT3 API. |
OpenCompass
phasellm
LLMZoo
HELM
auto-evaluator - evaluator是Langchain开发的一个用于评估问答系统的轻量级工具，可以自动生成问题和答案，并计算模型的准确率、召回率和F1分数等指标. |
PandaLM
FlagEval
alpaca_eval - lab开发的一个用于评估LLM的工具，可以使用多种语言、领域和任务进行测试，并提供可解释性、鲁棒性和可信度等指标. |
CONNER
TrustLLM
DyVal - T5-large到GPT-3.5-Turbo和GPT-4的各种LLMs。实验表明，LLMs在DyVal生成的不同复杂度的评估样本中表现更差，突显了动态评估的重要性。作者还分析了不同提示方法的失败案例和结果。此外，DyVal生成的样本不仅是评估集，而且有助于微调，以提高LLMs在现有基准测试上的性能 (2024-04-20) |
RewardBench - bench)，[Code](https://github.com/allenai/reward-bench) 和 [Dataset](https://hf.co/datasets/allenai/reward-bench) (2024-03-20)|
LVEval - Eval是一个具备5个长度等级（16k、32k、64k、128k和256k）、最大文本测试长度达到256k的长文本评测基准。LV-Eval的平均文本长度达到102,380字，最小/最大文本长度为11,896/387,406字。LV-Eval主要有两类评测任务——单跳QA和多跳QA，共包含11个涵盖中英文的评测数据子集。LV-Eval设计时引入3个关键技术：干扰事实插入（Confusiong Facts Insertion，CFI）提高挑战性，关键词和短语替换（Keyword and Phrase Replacement，KPR）减少信息泄漏，以及基于关键词召回的评测指标（Answer Keywords，AK，指代结合答案关键词和字词黑名单的评价指标）提高评测数值客观性 (2024-02-06) |
LLM-Uncertainty-Bench - 01-22)|
Psychometrics Eval - 10-19) |
CommonGen-Eval - lite数据集对LLM进行评估的研究，使用了GPT-4模型进行评估，比较了不同模型的性能，并列出了排行榜上的模型结果 (2024-01-04) |
felm - grained annotation at the segment level, which includes reference links, identified error types, and the reasons behind these errors as provided by our annotators. (2023-10-03) |
just-eval - 12-05) |
EQ-Bench - 12-20) |
CRUXEval - 4结合了思维链(CoT)，在输入预测和输出预测上的pass@1分别达到了75％和81％。该基准测试暴露了开源和闭源模型之间的差距。GPT-4未能完全通过CRUXEval，提供了对其局限性和改进方向的见解 (2024-01-05)|
MLAgentBench - 10-05) |
AlignBench - as-Judge），并且结合思维链（Chain-of-Thought）生成对模型回复的多维度分析和最终的综合评分，增强了评测的高可靠性和可解释性 (2023-12-01)|
UltraEval - 11-24) |
Instruction_Following_Eval - 11-15） |
LLMBar - 10-11) |
HalluQA - hard部分69条，knowledge部分206条，每个问题平均有2.8个正确答案和错误答案标注。为了提高HalluQA的可用性，作者设计了一个使用GPT-4担任评估者的评测方法。具体来说，把幻觉的标准以及作为参考的正确答案以指令的形式输入给GPT-4，让GPT-4判断模型的回复有没有出现幻觉（2023-11-08） |
llmperf - 11-03）|
FMTI - 10-18） |
LLMBar - 10-11) |
BAMBOO - 10-11) |
TRACE - 10-05) |
ColossalEval
LLMEval²
Do-Not-Answer - 4相媲美的结果 |
LucyEval
Zhujiu
ChatEval
FlagEval
InfoQ评测
COT评估
Z-Bench
zeno-build
lmsys排名榜
HF开源LLM排行榜
AlpacaEval - 4、Claude或ChatGPT基于自动注释者的参考Davinci003响应进行比较，从而产生上述的胜率 |
llm-benchmark
Leaderboard
SQUAD
MultiNLI
LogiQA
HellaSwag
LAMBADA
CoQA
ParlAI
LIT
Adversarial NLI (ANLI)
AlpacaEval - 4、Claude或ChatGPT基于自动注释者的参考Davinci003响应进行比较，从而产生上述的胜率 |
OpenAI Evals
EleutherAI LM Eval
OpenAI Moderation API
GLUE Benchmark
MT-bench - bench旨在测试多轮对话和遵循指令的能力，包含80个高质量的多轮问题，涵盖常见用例并侧重于具有挑战性的问题，以区分不同的模型。它包括8个常见类别的用户提示，以指导其构建：写作、角色扮演、提取、推理、数学、编程等 |
XieZhi - choice questions spanning 516 diverse disciplines and four difficulty levels. 新的领域知识综合评估基准测试：Xiezhi。对于多选题，Xiezhi涵盖了516种不同学科中的220,000个独特问题，其中涵盖了13个学科。作者还提出了Xiezhi-Specialty和Xiezhi-Interdiscipline，每个都含有15k个问题。使用Xiezhi基准测试评估了47种先进的LLMs的性能|
C_Eval - 4、ChatGPT、Claude、LLaMA、Moss 等多个模型的性能。|
AGIEval
MMCU
CMMLU
MMLU
Gaokao
GAOKAO-Bench - bench是一个以中国高考题目为数据集，测评大模型语言理解能力、逻辑推理能力的测评框架 |
Safety Eval 安全大模型评测
SuperCLUE
BIG-Bench-Hard - Bench任务，我们称之为BIG-Bench Hard（BBH）。这些任务是以前的语言模型评估未能超越平均人工评分者的任务 |
BIG-bench
JioNLP-LLM评测数据集
promptbench
KoLA - oriented LLM Assessment benchmark（KoLA）由清华大学知识工程组（THU-KEG）托管，旨在通过进行认真的设计，考虑数据、能力分类和评估指标，来精心测试LLMs的全球知识。这个基准测试包含80个高质量的多轮问题 |
M3Exam
Fin-Eva - Eva Version 1.0，覆盖财富管理、保险、投资研究等多个金融场景以及金融专业主题学科，总评测题数目达到1.3w+。蚂蚁数据源包括各业务领域数据、互联网公开数据，经过数据脱敏、文本聚类、语料精筛、数据改写等处理过程后，结合金融领域专家的评审构建而成。上海财经大学数据源主要基于相关领域权威性考试的各类真题和模拟题对知识大纲的要求。蚂蚁部分涵盖金融认知、金融知识、金融逻辑、内容生成以及安全合规五大类能力33个子维度共8445个测评题；上财部分涵盖金融，经济，会计和证书等四大领域，包括4661个问题，涵盖34个不同的学科。Fin-Eva Version 1.0 全部采用单选题这类有固定答案的问题，配合相应指令让模型输出标准格式 (2023-12-20) |
GenMedicalEval - 4等其他模型，具有独特优势 (2023-12-08)|
DebugBench - 4向源数据植入漏洞，并确保了严格的质量检查 (2024-01-09) |
OpenFinData - 01-04)|
LAiW - 10-25)|
LawBench - 09-28) |
PsyEval - 11-15） |
PPTC - Match评估系统，该系统根据预测文件而不是标签API序列来评估大语言模型是否完成指令，因此它支持各种LLM生成的API序列目前PPT生成存在三个方面的不足：多轮会话中的错误累积、长PPT模板处理和多模态感知问题 (2023-11-04) |
RGB - Augmented Generation，RAG）的评测基准，分析了不同大型语言模型在RAG所需的4种基本能力（噪声稳健性、负面拒绝、信息整合和反事实稳健性）的性能，建立了中英文的“检索增强生成基准”（Retrieval-Augmented Generation Benchmark，RGB），根据所需的基本能力分为4个独立的测试集 (2023-09-04) |
LLMRec - 10-08)|
LAiW - 10-25)|
OpsEval - 10-02) |
SWE-bench - bench 是一个用于评估大型语言模型在从 GitHub 收集的实际软件问题上的基准测试。给定一个代码库和一个问题，语言模型的任务是生成一个能够解决所描述问题的补丁 |
BLURB
GSM8K - / *）以达到最终答案 |
CRAG - answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. (2024-06-07) |
raga-llm-hub - llm-hub是一个全面的语言和学习模型（LLM）评估工具包。它拥有超过100个精心设计的评价指标，是允许开发者和组织有效评估和比较LLM的最具综合性的平台，并为LLM和检索增强生成（RAG）应用建立基本的防护措施。这些测试评估包括相关性与理解、内容质量、幻觉、安全与偏见、上下文相关性、防护措施以及漏洞扫描等多个方面，同时提供一系列基于指标的测试用于定量分析 (2024-03-10) |
ARES - 文档-答案三元组。ARES培训流程包括三个步骤：(1)从领域内段落生成合成查询和答案。(2)通过在合成生成的训练数据上进行微调，为评分RAG系统准备LLM评委。(3)部署准备好的LLM评委以评估您的RAG系统在关键性能指标上的表现 (2023-09-27) |
RGB - 09-04) |
tvalmetrics - 4）来评分RAG应用的输出的不同方面。Tonic Validate Metrics中的指标使用这些对象和LLM辅助评估来回答有关RAG应用的问题。（1）答案相似度分数： RAG答案与答案应该是多么匹配？（2）检索精度：检索的上下文是否与问题相关？（3）增强精度：答案中是否包含与问题相关的检索上下文？（4）增强准确度：检索上下文在答案中的占比如何？（5）答案一致性（二进制）：答案是否包含来自检索上下文之外的任何信息？（6）检索k-召回：对于前k个上下文向量，在检索上下文是前k个上下文向量的子集的情况下，检索上下文是否是回答问题的前k个上下文向量中所有相关上下文？ (2023-11-11) |
SuperCLUE-Agent - Agent，一个聚焦于Agent能力的多维度基准测试，包括3大核心能力、10大基础任务，可以用于评估大语言模型在核心Agent能力上的表现，包括工具使用、任务规划和长短期记忆能力。经过对16个支持中文的大语言模型的测评发现：在Agent的核心基础能力中文任务上，GPT4模型大幅领先；同时，代表性国内模型，包括开源和闭源模型，已经较为接近GPT3.5水平 (2023-10-20) |
AgentBench
AgentBench
ToolBench
McEval - 4相比，在多语言的编程能力上仍然存在较大差距，绝大多数开源模型甚至无法超越GPT-3.5。此外测试也表明开源模型中如Codestral，DeepSeek-Coder, CodeQwen以及一些衍生模型也展现出优异的多语言能力. McEval is a massively multilingual code benchmark covering 40 programming languages with 16K test samples, which substantially pushes the limits of code LLMs in multilingual scenarios. The benchmark contains challenging code completion, understanding, and generation evaluation tasks with finely curated massively multilingual instruction corpora McEval-Instruct. McEval leaderboard can be found [here](https://mceval.github.io/). (2024-06-11) |
SuperCLUE-Agent - Agent，一个聚焦于Agent能力的多维度基准测试，包括3大核心能力、10大基础任务，可以用于评估大语言模型在核心Agent能力上的表现，包括工具使用、任务规划和长短期记忆能力。经过对16个支持中文的大语言模型的测评发现：在Agent的核心基础能力中文任务上，GPT4模型大幅领先；同时，代表性国内模型，包括开源和闭源模型，已经较为接近GPT3.5水平 (2023-10-20) |
ChartVLM - 02-19) |
ReForm-Eval - Eval是一个用于综合评估大视觉语言模型的基准数据集。ReForm-Eval通过对已有的、不同任务形式的多模态基准数据集进行重构，构建了一个具有统一且适用于大模型评测形式的基准数据集。所构建的ReForm-Eval具有如下特点：构建了横跨8个评估维度，并为每个维度提供足量的评测数据（平均每个维度4000余条）；具有统一的评测问题形式（包括单选题和文本生成问题）；方便易用，评测方法可靠高效，且无需依赖ChatGPT等外部服务；高效地利用了现存的数据资源，无需额外的人工标注，并且可以进一步拓展到更多数据集上 (2023-10-24) |
LVLM-eHub - Modality Arena"是一个用于大型多模态模型的评估平台。在Fastchat之后，两个匿名模型在视觉问答任务上进行并排比较，"Multi-Modality Arena"允许你在提供图像输入的同时，对视觉-语言模型进行并排基准测试。支持MiniGPT-4，LLaMA-Adapter V2，LLaVA，BLIP-2等多种模型 |
InfiniteBench - 4, Claude 2 等。(4)真实场景与合成场景: InfiniteBench 既包含真实场景数据，探测大模型在处理实际问题的能力；也包含合成数据，为测试数据拓展上下文窗口提供了便捷. InfiniteBench is the first LLM benchmark featuring an average data length surpassing 100K tokens. InfiniteBench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in InfiniteBench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. (2024-03-19) |
llm-analysis
llm-inference-benchmark
llm-inference-bench
GPU-Benchmarks-on-LLM-Inference - inch M1 Max MacBook Pro, M2 Ultra Mac Studio, 14-inch M3 MacBook Pro and 16-inch M3 Max MacBook Pro. |
LLM-QBench - QBench is a Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and it is also an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends. (2024-05-09) |
Chat Arena: anonymous models side-by-side and vote for which one is better - 开源AI大模型“匿名”竞技场！你在这里可以成为一名裁判，给两个事先不知道名字的模型回答打分，评分后将给出他们的真实身份。目前已经“参赛”的选手包括Vicuna、Koala、OpenAssistant (oasst)、Dolly、ChatGLM、StableLM、Alpaca、LLaMA等。
image
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
[Source
OpenAI - ai/generative-ai/pricing) | [Replicate](https://replicate.com/pricing) | [DeepSeek](https://www.deepseek.com/) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Anthropic](https://www.anthropic.com/api) | [Google](https://cloud.google.com/vertex-ai/generative-ai/pricing) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Cohere](https://cohere.com/command) | [Anthropic](https://www.anthropic.com/api) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Replicate](https://replicate.com/pricing) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [OpenAI](https://openai.com/pricing) | | [Groq](https://wow.groq.com/) | [OpenAI](https://openai.com/pricing) | [Mistral](https://docs.mistral.ai/platform/pricing/) | [Anthropic](https://www.anthropic.com/api) | [Groq](https://wow.groq.com/) | [Anthropic](https://www.anthropic.com/api) | [Anthropic](https://www.anthropic.com/api) | Microsoft | Microsoft | [Mistral](https://docs.mistral.ai/platform/pricing/) |
Llama 3 70B - V2](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat) | [Mixtral 8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) | Claude 3 Sonnet | Gemini 1.5 Flash | Mistral Large | [Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus) | Claude 3 Haiku | Mistral Small | [Llama 3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | GPT-3.5 Turbo | | Llama 3 70B (Groq) | GPT-4 | Mistral Medium | Claude 2.0 | Mixtral 8x7B (Groq) | Claude 2.1 | Claude Instant | [Phi-Medium 4k](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) | [Phi-3-Small 8k](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) | Mistral 7B |
<img src=https://img.shields.io/badge/EMNLP-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - han Chiang, Hungyi Li*
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Eval: NLG Evaluation using GPT-4 with Better Human Alignment**](https://www.microsoft.com/en-us/research/publication/gpteval-nlg-evaluation-using-gpt-4-with-better-human-alignment/), by *Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu*
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/EMNLP-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Seng Chua, Kam-Fai Wong*
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Purpose Natural Language Processing Task Solver?**](https://arxiv.org/abs/2302.06476), by *Qin, Chengwei, Zhang, Aston, Zhang, Zhuosheng, Chen, Jiaao, Yasunaga, Michihiro and Yang, Diyi*
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier and Julius Berner*
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - based Text Summarization**](https://doi.org/10.48550/arXiv.2302.08081), by *Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen and Wei Cheng*
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - of-distribution
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" /> - tuned
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2023-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2022-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2022-blue alt="img" style="zoom:100%; vertical-align: middle" /> - to-SQL Capabilities of Large Language Models**](https://doi.org/10.48550/arXiv.2204.00498), by *Nitarshan Rajkumar, Raymond Li and Dzmitry Bahdanau*
<img src=https://img.shields.io/badge/COLING-2022-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Linguistic Models Commonsense Knowledge Bases?**](https://aclanthology.org/2022.coling-1.491), by *Hsiu-Yu Yang and Carina Silberer*
<img src=https://img.shields.io/badge/CoRR-2022-blue alt="img" style="zoom:100%; vertical-align: middle" /> - 3 a Psychopath? Evaluating Large Language Models from a Psychological
<img src=https://img.shields.io/badge/EMNLP-2022-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Diverse Commonsense Probing on Multilingual Pre-Trained
<img src=https://img.shields.io/badge/EMNLP-2022-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2022-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2021-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/ACL-2021-blue alt="img" style="zoom:100%; vertical-align: middle" /> - acl.36), by *Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu et al.*
<img src=https://img.shields.io/badge/CoRR-2021-blue alt="img" style="zoom:100%; vertical-align: middle" /> - Trained Models for User Feedback Analysis in Software
<img src=https://img.shields.io/badge/ACL_Findings-2021-blue alt="img" style="zoom:100%; vertical-align: middle" /> - acl.322), [<img src=https://img.shields.io/badge/Code-skyblue alt="img" style="zoom:100%; vertical-align: middle" />](https://github.com/wangpf3/LM-for-CommonsenseInference) by *Peifeng Wang, Filip Ilievski, Muhao Chen and Xiang Ren*
<img src=https://img.shields.io/badge/EMNLP-2021-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2020-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/CoRR-2020-blue alt="img" style="zoom:100%; vertical-align: middle" />
<img src=https://img.shields.io/badge/ICLR-2020-blue alt="img" style="zoom:100%; vertical-align: middle" />
Paper
Paper
Paper
Paper
api - 08 | [Paper](https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf) |
api - 05 | [Paper](https://arxiv.org/pdf/2205.01068.pdf) |
api - 11 | [Paper](https://arxiv.org/pdf/2211.05100.pdf) |
api - 05 | [Paper](https://arxiv.org/pdf/2005.14165.pdf) |
Paper
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
api - 10 | [Paper](https://arxiv.org/pdf/2012.00413.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
Paper
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
api - 03 | [Paper](https://arxiv.org/pdf/2203.02155.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
ckpt - 10 | [Paper](https://arxiv.org/pdf/2210.02414.pdf) |
demo - 03|[Github](https://github.com/tatsu-lab/stanford_alpaca)|
Blog
demo - x9IvKno0A4sk30) | 2022-11 | [Blog](https://openai.com/blog/chatgpt/) |
Paper
demo - 03 | [Blog](https://www.anthropic.com/index/introducing-claude) |
LLaMA - A foundational, 65-billion-parameter large language model. [LLaMA.cpp](https://github.com/ggerganov/llama.cpp) [Lit-LLaMA](https://github.com/Lightning-AI/lit-llama)
Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. [Alpaca.cpp](https://github.com/antimatter15/alpaca.cpp) [Alpaca-LoRA](https://github.com/tloen/alpaca-lora)
Flan-Alpaca - Instruction Tuning from Humans and Machines.
Baize - Baize is an open-source chat model trained with [LoRA](https://github.com/microsoft/LoRA). It uses 100k dialogs generated by letting ChatGPT chat with itself.
Cabrita - A portuguese finetuned instruction LLaMA.
Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
GPTQ-for-LLaMA - 4 bits quantization of [LLaMA](https://arxiv.org/abs/2302.13971) using [GPTQ](https://arxiv.org/abs/2210.17323).
GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
Koala - A Dialogue Model for Academic Research
BELLE - Be Everyone's Large Language model Engine
StackLLaMA - A hands-on guide to train LLaMA with RLHF.
RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
Chimera - Latin Phoenix.
BLOOM - BigScience Large Open-science Open-access Multilingual Language Model [BLOOM-LoRA](https://github.com/linhduongtuan/BLOOM-LORA)
BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
Phoenix
T5 - Text-to-Text Transfer Transformer
T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
OPT - Open Pre-trained Transformer Language Models.
UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
GLM - GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
ChatGLM-6B - ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型，基于 [General Language Model (GLM)](https://github.com/THUDM/GLM) 架构，具有 62 亿参数.
ChatGLM2-6B - 开源中英双语对话模型 ChatGLM-6B 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM2-6B 引入了更长的上下文、更好的性能和更高效的推理.
RWKV - Parallelizable RNN with Transformer-level LLM Performance.
ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
StableLM - Stability AI Language Models.
YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
GPT-Neo - An implementation of model & data parallel [GPT3](https://arxiv.org/abs/2005.14165)-like models using the [mesh-tensorflow](https://github.com/tensorflow/mesh) library.
GPT-J - A 6 billion parameter, autoregressive text generation model trained on [The Pile](https://pile.eleuther.ai/).
Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
Pythia - Interpreting Autoregressive Transformers Across Time and Scale
Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
Palmyra - Palmyra Base was primarily pre-trained with English text.
Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.
h2oGPT
PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.
MOSS - MOSS是一个支持中英双语和多种插件的开源对话语言模型.
Open-Assistant - a project meant to give everyone access to a great chat based large language model.
HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
Baichuan - An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
Qwen - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. (20230803)
Accelerate - 🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
Apache MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
Caffe - A fast open framework for deep learning.
ColossalAI - An integrated large-scale model training system with efficient parallelization techniques.
DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Jax - Autograd and XLA for high-performance machine learning research.
Kedro - org/kedro.svg?style=social) - Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code.
Keras - team/keras.svg?style=social) - Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow.
LightGBM - A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
MegEngine - MegEngine is a fast, scalable and easy-to-use deep learning framework, with auto-differentiation.
metric-learn - learn-contrib/metric-learn.svg?style=social) - Metric Learning Algorithms in Python.
MindSpore - ai/mindspore.svg?style=social) - MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
Oneflow - Inc/oneflow.svg?style=social) - OneFlow is a performance-centered and open-source deep learning framework.
PaddlePaddle - Machine Learning Framework from Industrial Practice.
PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch Lightning - AI/lightning.svg?style=social) - Deep learning framework to train, deploy, and ship AI products Lightning fast.
XGBoost - Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library.
scikit-learn - learn/scikit-learn.svg?style=social) - Machine Learning in Python.
TensorFlow - An Open Source Machine Learning Framework for Everyone.
VectorFlow - A minimalist neural network library optimized for sparse data and single machine environments.
Byzer-LLM - llm.svg?style=social) | Byzer-LLM是一套大模型基础设施，支持预训练，微调，部署，serving等各种和大模型相关的能力。Byzer-Retrieval则是专门为大模型开发的存储基础设施，支持各种数据源批量导入，实时单条更新，支持全文检索，向量检索，混合检索，方便Byzer-LLM使用数据。Byzer-SQL/Python则提供了易于使用的人机交互API，极低门槛去使用上述产品。 |
agenta - AI/agenta.svg?style=social) | 用于构建强大LLM应用的LLMOps平台。轻松尝试和评估不同提示、模型和工作流，以构建稳健的应用程序。 |
Arize-Phoenix - ai/phoenix.svg?style=social) | 用于LLMs、视觉、语言和表格模型的ML可观测性。 |
BudgetML
CometLLM - ml/comet-llm.svg?style=social) | 100%开源的LLMOps平台，用于记录、管理和可视化LLM提示和链条。跟踪提示模板、提示变量、提示持续时间、令牌使用等其他元数据。评分提示输出并在单个UI中可视化聊天历史。 |
deeplake
Dify
Dstack
Embedchain
GPTCache
Haystack - ai/haystack.svg?style=social) | 快速构建带有LLM代理、语义搜索、问答等功能的应用程序。 |
langchain
LangFlow - ai/langflow.svg?style=social) | 通过拖放组件和聊天界面轻松实验和原型化LangChain流程的无忧方式。 |
LangKit
LiteLLM 🚅
LlamaIndex
LLMApp - app.svg?style=social) | LLM App是一个Python库，帮助您以几行代码构建实时的LLM启用数据管道。 |
LLMFlows - stoyanov/llmflows.svg?style=social) | LLMFlows是一个用于构建简单、明确和透明的LLM应用程序的框架，例如聊天机器人、问答系统和代理。 |
LLMonitor
magentic
Pezzo 🕹️
promptfoo
prompttools
TrueFoundry - prem）基础设施上部署LLMOps工具，包括Vector DBs、嵌入式服务器等，包括部署、微调、跟踪提示和提供完整的数据安全性和最佳GPU管理的开源LLM模型。使用最佳的软件工程实践在生产规模上训练和启动您的LLM应用程序。 |
ReliableGPT 💪
Weights & Biases (Prompts)
xTuring
ZenML - io/zenml.svg?style=social) | 用于编排、实验和部署生产级ML解决方案的开源框架，具有内置的`langchain`和`llama_index`集成。 |
大语言模型课程notebooks集-Large Language Model Course - Course with a roadmap and notebooks to get into Large Language Models (LLMs).
Full+Stack+LLM+Bootcamp - LLM相关学习/应用资源集.
Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
Efficient Finetuning of Quantized LLMs --- 低资源的大语言模型量化训练/部署方案 - 旨在构建和开源遵循指令的baichuan/LLaMA/Pythia/GLM中文大模型微调训练方法，该方法可以在单个 Nvidia RTX-2080TI上进行训练，多轮聊天机器人可以在单个 Nvidia RTX-3090上进行上下文长度 2048的模型训练。使用bitsandbytes进行量化，并与Huggingface的PEFT和transformers库集成.
Awesome LLM - A curated list of papers about large language models.
Awesome-Efficient-LLM - A curated list for Efficient Large Language Models.
Awesome-production-machine-learning - A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning.
Awesome-marketing-datascience - Curated list of useful LLM / Analytics / Datascience resources.
Awesome-llm-tools - Curated list of useful LLM tool.
Awesome-LLM-Compression - A curated list for Efficient LLM Compression.
Awesome-Multimodal-Large-Language-Models - A curated list of Multimodal Large Language Models.
Awesome-LLMOps - An awesome & curated list of the best LLMOps tools for developers.
Awesome-MLops - An awesome list of references for MLOps - Machine Learning Operations.
Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model.
awesome-chatgpt-prompts-zh - A Chinese collection of prompt examples to be used with the ChatGPT model.
Awesome ChatGPT - Curated list of resources for ChatGPT and GPT-3 from OpenAI.
Chain-of-Thoughts Papers - A trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models.
Instruction-Tuning-Papers - A trend starts from `Natrural-Instruction` (ACL 2022), `FLAN` (ICLR 2022) and `T0` (ICLR 2022).
LLM Reading List - A paper & resource list of large language models.
Reasoning using Language Models - Collection of papers and resources on Reasoning using Language Models.
Chain-of-Thought Hub - Measuring LLMs' Reasoning Performance
Awesome GPT - A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
Awesome GPT-3 - a collection of demos and articles about the [OpenAI GPT-3 API](https://openai.com/blog/openai-api/).
![MIT license - license.org/)
MIT License
![CC BY-NC-SA 4.0 - nc-sa/4.0/)
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
MPG

Programming Languages

Python 105 Jupyter Notebook 11 C++ 9 TypeScript 4 HTML 3 Rust 1 D 1 C 1 Shell 1 JavaScript 1

Keywords

llm 29 machine-learning 27 large-language-models 22 chatgpt 22 deep-learning 19 python 18 language-model 14 ai 13 evaluation 10 pytorch 9 gpt-4 8 gpt 8 mlops 8 llms 8 gpt-3 8 chatbot 7 openai 7 llmops 7 benchmark 7 llama 6 data-science 6 nlp 6 vector-database 5 transformers 5 instruction-following 5 ml 5 rag 5 prompt-engineering 5 tensorflow 5 gpu 4 in-context-learning 4 rlhf 4 neural-network 4 langchain 4 evaluation-framework 4 foundation-models 4 chinese 4 natural-language-processing 4 llm-evaluation 4 vector-search 3 prompt 3 inference 3 generative-ai 3 llama2 3 embeddings 3 alpaca 3 llm-inference 3 leaderboard 3 gpt-2 3 lora 3