https://github.com/nick7nlp/Counting-Stars
Counting-Stars (★)
https://github.com/nick7nlp/Counting-Stars
evaluation-metrics large-language-model long-context
Last synced: 5 months ago
JSON representation
Counting-Stars (★)
- Host: GitHub
- URL: https://github.com/nick7nlp/Counting-Stars
- Owner: nick7nlp
- Created: 2024-03-13T01:49:27.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-23T03:25:15.000Z (over 1 year ago)
- Last Synced: 2024-05-23T04:28:17.803Z (over 1 year ago)
- Topics: evaluation-metrics, large-language-model, long-context
- Language: Jupyter Notebook
- Homepage: https://arxiv.org/pdf/2403.11802.pdf
- Size: 129 MB
- Stars: 57
- Watchers: 3
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - nick7nlp/Counting-Stars - Stars是一个用于评估长文本上下文大型语言模型 (LLM) 的多证据、位置感知且可扩展的基准测试。它通过多证据获取和多证据推理两项任务来评估 LLM,其中包含大量证据,并允许灵活调整证据在上下文中的位置。该基准测试可扩展至任意长度的上下文和任意数量的证据。实验结果表明,Gemini 1.5 Pro 在整体表现上最佳,而 GPT-4 Turbo 在各种任务中的表现最为稳定。该项目还提供了中文和英文版本的 Counting-Stars 数据集,并包含对不同 LLM 在该基准测试上的评估结果。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
![]()
A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models
In this work, we propose **a multi-evidence, position-aware, and scalable benchmark** for evaluating long-context LLMs, named **Counting-Stars**, which evaluates long-context LLMs by using two tasks: multi-evidence acquisition and multi-evidence reasoning.
- **Multi-evidence**: *Counting-Stars is the most evidence-intensive evaluation in known long-context benchmarks*.
- **Position-aware**: *The position of the evidence in the context can be adjusted as desired and tested in a targeted manner*.
- **Scalable**: *Both the context length and the amount of evidence can be expanded arbitrarily*.Based on the Counting-Stars test, we conduct experiments to evaluate long-context LLMs (i.e., GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1). Experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while the performance of GPT-4 Turbo is the most stable across various tasks. Furthermore, our analysis of these LLMs, which are extended to handle long-context scenarios, indicates that there is potential for improvement as the length of the input context and the intricacy of the tasks are increasing.
> Please find more details of this work in the [paper](https://arxiv.org/pdf/2403.11802).
## NoteWe'd like to encourage you to test the Counting-Stars using
- Me-Acq. (EN) means the English version of Multi-evidence Acquisition in the Counting-Stars.
- ```Counting_Stars_EN_acquisition_128000_32_32.jsonl```
- Me-Acq. (ZH) means the Chinese version of Multi-evidence Acquisition in the Counting-Stars.
- ```Counting_Stars_ZH_acquisition_128000_32_32.jsonl```
- Me-Rea. (EN) means the English version of Multi-evidence Reasoning in the Counting-Stars.
- ```Counting_Stars_EN_reasoning_128000_32_32.jsonl```
- Me-Rea. (ZH) means the Chinese version of Multi-evidence Reasoning in the Counting-Stars.
- ```Counting_Stars_ZH_reasoning_128000_32_32.jsonl```, the 128K English and Chinese versions of the Counting-Stars.
|Rank|Models|Claimed Length|Me-Acq.(ZH)|Me-Acq.(EN)|Me-Rea.(ZH)|Me-Rea.(EN)|Avg.|
|----|----|----|----|----|----|----|----|
|1| Gemini 1.5 Pro|1M|0.775|0.833|0.575|0.371|0.639|
|2| GPT-4 Turbo (1106)|128K|0.697|0.718|0.473|0.651|0.635|
|3| Claude3 Opus|200K|0.807|0.705|0.488|0.374|0.594|
|4| GPT-4 Turbo (0125)|128K|0.663|0.662|0.386|0.610|0.580|
|5| Moonshot-v1|200K|0.606|0.559|0.344|0.460|0.492|
|6| GLM-4|128K|0.682|0.389|0.475|0.179|0.431|
|-| Claude3 Sonnet|200K|0.788|-|-|-|-|
|-| Claude3 Haiku|200K|0.698|-|-|-|-|
|-| Baichuan3-Turbo|128K|0.759|0.490|-|-|-|## Task Description
![]()
## Evaluation Results
![]()
![]()
> Visualization of the results on the Chinese version of the Counting-Stars-32-(Multi-evidence Acquisition).
![]()
> Visualization of the results on the Chinese version of the Counting-Stars-32-(Multi-evidence Reasoning).
## Cite
If you use this benchmark, please cite this paper
```
@misc{song2024countingstars,
title={Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models},
author={Mingyang Song and Mao Zheng and Xuan Luo},
year={2024},
eprint={2403.11802},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```## CONTACT
For any questions, feel free to create an issue, and we will try our best to solve it. \
**If the problem is more urgent**, you can email me simultaneously (I check email almost daily).
```
NAME: Mingyang Song
EMAIL: nickmysong@tencent.com
```
Our visualization code is built on the source code from [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack). Thanks for their work.