https://github.com/lm-sys/llm-decontaminator

Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
https://github.com/lm-sys/llm-decontaminator

Last synced: 3 months ago
JSON representation

Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"

Host: GitHub
URL: https://github.com/lm-sys/llm-decontaminator
Owner: lm-sys
License: apache-2.0
Created: 2023-10-17T04:06:33.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-12-20T22:33:26.000Z (over 1 year ago)
Last Synced: 2025-03-30T22:11:25.229Z (4 months ago)
Language: Python
Homepage:
Size: 10.5 MB
Stars: 300
Watchers: 3
Forks: 24
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - lm-sys/llm-decontaminator - rephraser：13B 模型在主要基准测试（MMLU/GSK-8K/HumanEval）中达到 GPT-4 性能！为了确保结果的有效性，我们遵循了 OpenAI 的去污方法，没有发现数据污染的证据。本文提出了一种基于更强LLM的去污器，并将其应用于现实世界的训练数据集（例如， the Stack、RedPajama），揭示了训练数据集与广泛使用的基准测试的显着重叠。现有的检测方法（例如，n-gram重叠，嵌入相似性）无法检测到这种污染。嵌入相似性方法很难将改写的问题与同一主题（高中美国历史）中的其他问题区分开来。而本文提出可以使用“LLM去污器”来量化数据集相对于基准的重新表述的样本。根据检测结果，您可以估计数据集中改写样本的污染情况，并将其从训练集中移除。该LLM净化器包括两个步骤：对于每个测试用例，“LLM去污器”使用嵌入相似性搜索识别相似度最高的前 k 个训练项。从这些项目中，“LLM去污器”生成 k 个潜在的改写对，每对都使用高级 LLM，例如 GPT-4 进行改写评估。结果表明，我们提出LLM的方法在去除改写样本方面明显优于现有方法。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-data-contamination - LLM Decontaminator
awesome-LLM-resources - LLM Decontaminator

README

        # LLM Decontaminator

| [Paper](https://arxiv.org/pdf/2311.04850.pdf) | [Blog](https://lmsys.org/blog/2023-11-14-llm-decontaminator/) |



In this package, you can use LLM decontaminator to quantify a dataset's rephrased samples relative to a benchmark.

Based on the detection results, you can estimate the contamination of rephrased samples in the dataset and remove them from the training set.

## Contents

- [Install](#install)

- [Detect](#detect)

    - [Pre-Process](#pre-process)

    - [End2End](#end2end)

- [Real-world dataset](#real-world-dataset)

- [Dataset and training code](#dataset-and-training-code)

- [F1 Score](#f1-score)

- [Citation](#citation)

## Install

~~~bash

git clone https://github.com/lm-sys/llm-decontaminator.git

cd llm-decontaminator

conda create -n llm-detect python=3.9 -y

conda activate llm-detect

pip install -r requirement.txt

~~~

## Detect

### Pre-Process

Please process the train set and test set into a jsonl format, with each line containing `{"text": data}`

~~~py

import json

from datasets import load_dataset

# Load dataset

dataset = load_dataset('bigcode/starcoderdata', data_dir="python", split="train", streaming=True)

# Extract up to 500,000 samples

subset_size = 500000

codes = [sample['content'] for _, sample in zip(range(subset_size), dataset)]

# Write to file

with open("starcoderdata.jsonl", "w") as fout:

    for code in codes:

        fout.write(json.dumps({"text": code}) + "\n")

~~~

### End2End

~~~bash

# export OPENAI_API_KEY=sk-xxx

# run llm-decontaminator

python3 main.py --train_path ./data/train/CodeAlpaca-20k.jsonl \

    --test_path ./data/test/HumanEval.jsonl \

    --output_path ./data/database/CodeAlpacaDB.jsonl \

    --data-type code \

    --top_k 1

~~~

## Contamination in Real-world Dataset

| Training Set                  | Benchmark | Train Set Size | Test Set Size | Rephrased Samples | Percentage (%) |

|-------------------------------|-----------|----------------|---------------|-------------------|----------------|

| The Stack (4G subset)         | HumanEval | 500k           | 164           | 31                | 18.9           |

| StarCoder-Data (2.4G subset)  | HumanEval | 500k           | 164           | 26                | 15.9           |

| CodeExercise-Python           | HumanEval | 27k            | 164           | 26                | 15.9           |

| CodeAlpaca                    | HumanEval | 20k            | 164           | 21                | 12.8           |

| RedPajama-Data-1T (16G subset)| HumanEval | 1625k          | 164           | 14                | 8.5            |

| Evol-Instruct-Code            | HumanEval | 78.3k          | 164           | 13                | 7.9            |

| rossetacode                   | HumanEval | 4.26k          | 164           | 4                 | 2.4            |

| MATHInstruct (before Sep 30)  | MATH Test | 262k           | 5000          | 769               | 15.            |

| MATH Train                    | MATH Test | 7.5k           | 5000          | 79                | 1.6            |

| FLAN CoT                      | MMLU      | 184k           | 14042         | 76                | 0.5            |

| WizardLM-Evol-Instruct        | MMLU      | 143k           | 14042         | 75                | 0.5            |

## Dataset and Training Code

Reproduce Llama-rephraser with this [document](train/README.md).

## F1 Score

Reproduce paper's Table 5 & 6

~~~bash

# MMLU

python3 f1score/mmlu/f1_emb.py

python3 f1score/mmlu/f1_llm.py

# HumanEval

python3 f1score/humaneval/f1_emb.py

python3 f1score/humaneval/f1_llm.py

~~~

Table 5:



Table 6:



## Citation

Please cite the following paper if you find the code or datasets helpful.

~~~

@misc{yang2023rethinking,

      title={Rethinking Benchmark and Contamination for Language Models with Rephrased Samples}, 

      author={Shuo Yang and Wei-Lin Chiang and Lianmin Zheng and Joseph E. Gonzalez and Ion Stoica},

      year={2023},

      eprint={2311.04850},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

~~~

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lm-sys/llm-decontaminator

Awesome Lists containing this project

README