Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/liyucheng09/latesteval

Latest Evaluation Toolkit (LatestEval). Assessing the language models with latest, uncontaminated materials.
https://github.com/liyucheng09/latesteval

Last synced: 2 months ago
JSON representation

Latest Evaluation Toolkit (LatestEval). Assessing the language models with latest, uncontaminated materials.

Host: GitHub
URL: https://github.com/liyucheng09/latesteval
Owner: liyucheng09
Created: 2023-06-08T17:57:45.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-04-28T03:16:01.000Z (8 months ago)
Last Synced: 2024-04-28T06:59:00.390Z (8 months ago)
Language: Python
Homepage:
Size: 5.14 MB
Stars: 14
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

        


    



# "Uncheatable" LLMs Evaluation - LatestEval

Humans receive new test questions every exam, but LLMs? They've been evaluated with the same benchmarks for too long. Why not assess LLMs with fresh test just like we test our students? In this project, we introduce LatestEval, which automatically constructs language model benchmarks using the latest materials (e.g., arXiv, BBC, Wikipedia, etc.) to prevent "cheating" and data contamination.

**News!!**

- **15 Dec, 2023** - This project was accpeted by the main track of **AAAI 2024** :partying_face:! Check out the paper here: :point_right: [Dynamic Test Construction with Latest Materials](https://arxiv.org/abs/2312.12343).

# Key Features

1. We maintain a QA benchmark that updates every half month using the latest online resources (created in the past half month). This approach aims to avoid 1) LLMs being trained on the test set (cheating); and 2) the unintentional inclusion of test questions in the training dataset (data contamination).

2. We analyzed real Human-AI conversations to ensure the automated benchmark aligns well with real-life applications (see [paper](https://arxiv.org/abs/2312.12343) for more detail).

# The Benchmark

Access the latest benchmark dorectly at [Huggingface Hub](https://huggingface.co/LatestEval)!

- Latest benchmark of GitHub: [HF Hub](https://huggingface.co/datasets/LatestEval/github-latest)

- Latest benchmark of arXiv: [HF Hub](https://huggingface.co/datasets/LatestEval/arxiv-latest)

- Latest benchmark of BBC: [HF Hub](https://huggingface.co/datasets/LatestEval/bbc-latest)

- The Full benchmark with all sources: [HF Hub](https://huggingface.co/datasets/LatestEval/full-latest)

The benchmarks are created with latest materials, find these raw materials/documents at [Huggingface Hub](https://huggingface.co/RealTimeData)

# Evaluate your LLM on LatestEval

We will add LatestEval to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [OpenCompass](https://github.com/open-compass/opencompass). Stay tuned.

# Create benchmarks with your own data

1. Put your documents as `.txt` files under `./`.

2. Set your OpenAI key:

```

export OPENAI_API_KEY=

```

3. Simply run:

```

python data_processor.py --source customized --file_path  --num_docs 100

```

If you want to reproduce LatestEval on arXiv, BBC, GitHub:

```

python data_processor.py --source arxiv --num_docs 100

```

# Issue

Open an issue if you have any problems or want to discuss.

# Citation

If you find this project useful, consider cite this project:

```

@misc{li2023avoiding,

      title={Avoiding Data Contamination in Language Model Evaluation: Dynamic Test Construction with Latest Materials}, 

      author={Yucheng Li and Frank Guerin and Chenghua Lin},

      year={2023},

      eprint={2312.12343},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```