{"id":13521500,"url":"https://github.com/lm-sys/llm-decontaminator","last_synced_at":"2025-08-11T21:44:31.824Z","repository":{"id":206352845,"uuid":"705972277","full_name":"lm-sys/llm-decontaminator","owner":"lm-sys","description":"Code for the paper \"Rethinking Benchmark and Contamination for Language Models with Rephrased Samples\"","archived":false,"fork":false,"pushed_at":"2023-12-20T22:33:26.000Z","size":11031,"stargazers_count":302,"open_issues_count":4,"forks_count":25,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-05-26T11:16:37.450Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lm-sys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-10-17T04:06:33.000Z","updated_at":"2025-05-16T07:54:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"b9aba8bc-0789-4eec-a27c-2833bca0cc07","html_url":"https://github.com/lm-sys/llm-decontaminator","commit_stats":null,"previous_names":["lm-sys/llm-decontaminator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lm-sys/llm-decontaminator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lm-sys%2Fllm-decontaminator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lm-sys%2Fllm-decontaminator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lm-sys%2Fllm-decontaminator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lm-sys%2Fllm-decontaminator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lm-sys","download_url":"https://codeload.github.com/lm-sys/llm-decontaminator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lm-sys%2Fllm-decontaminator/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269963116,"owners_count":24504298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-11T02:00:10.019Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T06:00:35.203Z","updated_at":"2025-08-11T21:44:31.754Z","avatar_url":"https://github.com/lm-sys.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","🧰 Resources","数据 Data"],"sub_categories":["大语言对话模型及数据","🛠️ Tools"],"readme":"# LLM Decontaminator\n\n| [Paper](https://arxiv.org/pdf/2311.04850.pdf) | [Blog](https://lmsys.org/blog/2023-11-14-llm-decontaminator/) |\n\n\u003cimg src=\"./assets/overview.png\" alt=\"img\" width=\"800\"/\u003e\n\nIn this package, you can use LLM decontaminator to quantify a dataset's rephrased samples relative to a benchmark.\nBased on the detection results, you can estimate the contamination of rephrased samples in the dataset and remove them from the training set.\n\n## Contents\n\n- [Install](#install)\n- [Detect](#detect)\n    - [Pre-Process](#pre-process)\n    - [End2End](#end2end)\n- [Real-world dataset](#real-world-dataset)\n- [Dataset and training code](#dataset-and-training-code)\n- [F1 Score](#f1-score)\n- [Citation](#citation)\n\n\n## Install\n\n~~~bash\ngit clone https://github.com/lm-sys/llm-decontaminator.git\ncd llm-decontaminator\nconda create -n llm-detect python=3.9 -y\nconda activate llm-detect\npip install -r requirement.txt\n~~~\n\n\n## Detect\n\n### Pre-Process\nPlease process the train set and test set into a jsonl format, with each line containing `{\"text\": data}`\n\n~~~py\nimport json\nfrom datasets import load_dataset\n\n# Load dataset\ndataset = load_dataset('bigcode/starcoderdata', data_dir=\"python\", split=\"train\", streaming=True)\n\n# Extract up to 500,000 samples\nsubset_size = 500000\ncodes = [sample['content'] for _, sample in zip(range(subset_size), dataset)]\n\n# Write to file\nwith open(\"starcoderdata.jsonl\", \"w\") as fout:\n    for code in codes:\n        fout.write(json.dumps({\"text\": code}) + \"\\n\")\n~~~\n\n### End2End\n\n~~~bash\n# export OPENAI_API_KEY=sk-xxx\n# run llm-decontaminator\npython3 main.py --train_path ./data/train/CodeAlpaca-20k.jsonl \\\n    --test_path ./data/test/HumanEval.jsonl \\\n    --output_path ./data/database/CodeAlpacaDB.jsonl \\\n    --data-type code \\\n    --top_k 1\n~~~\n\n## Contamination in Real-world Dataset\n\n\n\n| Training Set                  | Benchmark | Train Set Size | Test Set Size | Rephrased Samples | Percentage (%) |\n|-------------------------------|-----------|----------------|---------------|-------------------|----------------|\n| The Stack (4G subset)         | HumanEval | 500k           | 164           | 31                | 18.9           |\n| StarCoder-Data (2.4G subset)  | HumanEval | 500k           | 164           | 26                | 15.9           |\n| CodeExercise-Python           | HumanEval | 27k            | 164           | 26                | 15.9           |\n| CodeAlpaca                    | HumanEval | 20k            | 164           | 21                | 12.8           |\n| RedPajama-Data-1T (16G subset)| HumanEval | 1625k          | 164           | 14                | 8.5            |\n| Evol-Instruct-Code            | HumanEval | 78.3k          | 164           | 13                | 7.9            |\n| rossetacode                   | HumanEval | 4.26k          | 164           | 4                 | 2.4            |\n| MATHInstruct (before Sep 30)  | MATH Test | 262k           | 5000          | 769               | 15.            |\n| MATH Train                    | MATH Test | 7.5k           | 5000          | 79                | 1.6            |\n| FLAN CoT                      | MMLU      | 184k           | 14042         | 76                | 0.5            |\n| WizardLM-Evol-Instruct        | MMLU      | 143k           | 14042         | 75                | 0.5            |\n\n\n## Dataset and Training Code\n\nReproduce Llama-rephraser with this [document](train/README.md).\n\n## F1 Score\n\nReproduce paper's Table 5 \u0026 6\n\n~~~bash\n# MMLU\npython3 f1score/mmlu/f1_emb.py\npython3 f1score/mmlu/f1_llm.py\n\n# HumanEval\npython3 f1score/humaneval/f1_emb.py\npython3 f1score/humaneval/f1_llm.py\n~~~\n\nTable 5:\n\n\u003cimg src=\"./assets/MMLU-f1score.png\" alt=\"img\" width=\"800\"/\u003e\n\nTable 6:\n\n\u003cimg src=\"./assets/HumanEval-f1score.png\" alt=\"img\" width=\"400\"/\u003e\n\n\n## Citation\n\nPlease cite the following paper if you find the code or datasets helpful.\n~~~\n@misc{yang2023rethinking,\n      title={Rethinking Benchmark and Contamination for Language Models with Rephrased Samples}, \n      author={Shuo Yang and Wei-Lin Chiang and Lianmin Zheng and Joseph E. Gonzalez and Ion Stoica},\n      year={2023},\n      eprint={2311.04850},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n~~~","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flm-sys%2Fllm-decontaminator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flm-sys%2Fllm-decontaminator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flm-sys%2Fllm-decontaminator/lists"}