https://github.com/nick7nlp/Counting-Stars

Counting-Stars (★)
https://github.com/nick7nlp/Counting-Stars

evaluation-metrics large-language-model long-context

Last synced: 5 months ago
JSON representation

Counting-Stars (★)

Host: GitHub
URL: https://github.com/nick7nlp/Counting-Stars
Owner: nick7nlp
Created: 2024-03-13T01:49:27.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-05-23T03:25:15.000Z (over 1 year ago)
Last Synced: 2024-05-23T04:28:17.803Z (over 1 year ago)
Topics: evaluation-metrics, large-language-model, long-context
Language: Jupyter Notebook
Homepage: https://arxiv.org/pdf/2403.11802.pdf
Size: 129 MB
Stars: 57
Watchers: 3
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

StarryDivineSky - nick7nlp/Counting-Stars - Stars是一个用于评估长文本上下文大型语言模型 (LLM) 的多证据、位置感知且可扩展的基准测试。它通过多证据获取和多证据推理两项任务来评估 LLM，其中包含大量证据，并允许灵活调整证据在上下文中的位置。该基准测试可扩展至任意长度的上下文和任意数量的证据。实验结果表明，Gemini 1.5 Pro 在整体表现上最佳，而 GPT-4 Turbo 在各种任务中的表现最为稳定。该项目还提供了中文和英文版本的 Counting-Stars 数据集，并包含对不同 LLM 在该基准测试上的评估结果。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

          


  





  A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models



In this work, we propose **a multi-evidence, position-aware, and scalable benchmark** for evaluating long-context LLMs, named **Counting-Stars**, which evaluates long-context LLMs by using two tasks: multi-evidence acquisition and multi-evidence reasoning. 

- **Multi-evidence**: *Counting-Stars is the most evidence-intensive evaluation in known long-context benchmarks*.

- **Position-aware**: *The position of the evidence in the context can be adjusted as desired and tested in a targeted manner*.

- **Scalable**: *Both the context length and the amount of evidence can be expanded arbitrarily*.

Based on the Counting-Stars test, we conduct experiments to evaluate long-context LLMs (i.e., GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1). Experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while the performance of GPT-4 Turbo is the most stable across various tasks. Furthermore, our analysis of these LLMs, which are extended to handle long-context scenarios, indicates that there is potential for improvement as the length of the input context and the intricacy of the tasks are increasing.

> Please find more details of this work in the [paper](https://arxiv.org/pdf/2403.11802).

## Note

We'd like to encourage you to test the Counting-Stars using

- Me-Acq. (EN) means the English version of Multi-evidence Acquisition in the Counting-Stars.

  - ```Counting_Stars_EN_acquisition_128000_32_32.jsonl```

- Me-Acq. (ZH) means the Chinese version of Multi-evidence Acquisition in the Counting-Stars.

  - ```Counting_Stars_ZH_acquisition_128000_32_32.jsonl```

- Me-Rea. (EN) means the English version of Multi-evidence Reasoning in the Counting-Stars.

  - ```Counting_Stars_EN_reasoning_128000_32_32.jsonl```

- Me-Rea. (ZH) means the Chinese version of Multi-evidence Reasoning in the Counting-Stars.

  - ```Counting_Stars_ZH_reasoning_128000_32_32.jsonl```

, the 128K English and Chinese versions of the Counting-Stars.

|Rank|Models|Claimed Length|Me-Acq.(ZH)|Me-Acq.(EN)|Me-Rea.(ZH)|Me-Rea.(EN)|Avg.|

|----|----|----|----|----|----|----|----|

|1| Gemini 1.5 Pro|1M|0.775|0.833|0.575|0.371|0.639|

|2| GPT-4 Turbo (1106)|128K|0.697|0.718|0.473|0.651|0.635|

|3| Claude3 Opus|200K|0.807|0.705|0.488|0.374|0.594|

|4| GPT-4 Turbo (0125)|128K|0.663|0.662|0.386|0.610|0.580|

|5| Moonshot-v1|200K|0.606|0.559|0.344|0.460|0.492|

|6| GLM-4|128K|0.682|0.389|0.475|0.179|0.431|

|-| Claude3 Sonnet|200K|0.788|-|-|-|-|

|-| Claude3 Haiku|200K|0.698|-|-|-|-|

|-| Baichuan3-Turbo|128K|0.759|0.490|-|-|-|

## Task Description







## Evaluation Results













> Visualization of the results on the Chinese version of the Counting-Stars-32-(Multi-evidence Acquisition).







> Visualization of the results on the Chinese version of the Counting-Stars-32-(Multi-evidence Reasoning).

## Cite

If you use this benchmark, please cite this paper

```

@misc{song2024countingstars,

      title={Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models}, 

      author={Mingyang Song and Mao Zheng and Xuan Luo},

      year={2024},

      eprint={2403.11802},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```

## CONTACT

For any questions, feel free to create an issue, and we will try our best to solve it. \

**If the problem is more urgent**, you can email me simultaneously (I check email almost daily).

```

NAME: Mingyang Song

EMAIL: nickmysong@tencent.com

```

Our visualization code is built on the source code from [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack). Thanks for their work.