https://github.com/microsoft/mmlu-cf
A Contamination-free Multi-task Language Understanding Benchmark
https://github.com/microsoft/mmlu-cf
benchmark contamination llm mmlu
Last synced: 6 months ago
JSON representation
A Contamination-free Multi-task Language Understanding Benchmark
- Host: GitHub
- URL: https://github.com/microsoft/mmlu-cf
- Owner: microsoft
- License: mit
- Created: 2024-12-02T16:49:05.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-01-09T08:46:15.000Z (9 months ago)
- Last Synced: 2025-04-06T00:09:41.032Z (6 months ago)
- Topics: benchmark, contamination, llm, mmlu
- Homepage: https://arxiv.org/pdf/2412.15194
- Size: 4.82 MB
- Stars: 117
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Support: SUPPORT.md
Awesome Lists containing this project
README
# MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark



[📜 Paper] •
[🤗 HF Dataset] •
[🐱 GitHub]## 📢 News and Updates
[2024.12.01] 🔥We have initialized the repository.
[2024.12.16] 🔥We have added the evaluation results of Phi-4-14B and Llama-3.3-70B-Instruct.
[2024.12.20] 🔥We have released the validation dataset of MMLU-CF.
[2025.1.9] 🔥[OpenCompass](https://github.com/open-compass/opencompass) now support the MMLU-CF. Feel free to give it a try!## 1. The Motivation of MMLU-CF
- The open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose MMLU-CF.
- (a) An instance of leakage in MMLU. When questions are used as prompt from the MMLU, certain LLMs, due to their memorization capabilities, directly provide **choices identical to the original ones**. (b) When questions are used as prompt from the MMLU-CF, LLMs only provide guessed choices.
This indicates that the MMLU test set suffers from data contamination and memorization by some LLMs, while the proposed MMLU-CF avoids such leakage.
![]()
![]()
## 2. How to Evaluate Your Models on the MMLU-CF Validation/Test Set
#### (1) We perform automated testing only on Huggingface models. After following the steps outlined below and obtaining the validation set results from [OpenCompass](https://github.com/open-compass/opencompass), the test set results can then be accessed via GitHub Issues.
**Step 1**. **Validation set evaluation**: Obtaining the validation results for your model using LLM evaluation tools, [OpenCompass](https://github.com/open-compass/opencompass). The validation data will be automatically loaded from [Hugging Face](https://huggingface.co/datasets/microsoft/MMLU-CF).
- For a **5-shot** evaluation with Internlm 2.5:```
opencompass --models hf_internlm2_5_1_8b_chat --datasets mmlu_cf_few_shot --summarizer mmlu_cf
```- For a **0-shot** evaluation with Internlm 2.5:
```
opencompass --models hf_internlm2_5_1_8b_chat --datasets mmlu_cf_zero_shot --summarizer mmlu_cf
```
**Step 2**. **Test set evaluation**: With the validation results, submit a GitHub issue on the [MMLU-CF](https://github.com/) GitHub homepage to request the test set results. Please follow the format below:Example 1,
```
Title:
Test set evaluation Request - add HF model [microsoft/phi-4]
Content:
The result on validation set: 68.5%
```
Example 2,
![]()
**Notably**:
- Ensure you use the format with square brackets `[ ]` as shown. The model name **microsoft/phi-4** corresponds to the name on HuggingFace.
- We will automatically submit your model. The time to receive the results depends on the number of models being evaluated, but it typically takes **1-2 weeks**.#### (2) For API models, if OpenCompass updates the model interface, you can obtain the test set results by sending a temporary key to [Email](yangyu.huang@microsoft.com) after receiving the validation set results.
## 3. What is the Difference between MMLU-CF and MMLU
MMLU focuses on the breadth and reasoning without considering contamination prevention. We apply three decontamination rules to mitigate unintentional data leakage while collecting data from a broader domain. Meanwhile, our MMLU-CF benchmark maintains the test set closed-source to prevent malicious data leakage.
![]()
![]()
## 4. Leaderboard
Model
MMLU
MMLU-CF
5-shot
5-shot Test
5-shot Validation
5-shot Δ
0-shot Test
0-shot Validation
0-shot Δ
API
GPT-4o
88.0
73.4
73.4
+0.0
71.9
72.4
-0.5
GPT-4-Turbo
86.5
70.4
70.1
+0.3
68.9
68.7
+0.1
GPT-4o-mini
81.8
65.5
65.1
+0.4
66.0
65.3
+0.7
Gemini-1.5-Flash
78.7
64.8
64.9
-0.1
56.7
56.9
-0.2
GPT-3.5-Turbo
71.4
58.2
59.0
-0.8
57.2
58.1
-0.9
Large
Qwen2.5-72B-instruct
85.3
71.6
71.3
+0.3
70.6
70.4
+0.2
Llama-3-70B-instruct
82.0
68.9
68.8
+0.1
68.1
67.4
+0.7
Llama-3.3-70B-instruct
86.3
68.8
67.8
+1.0
67.6
67.5
+0.1
Llama-3.1-70B-instruct
86.0
68.7
68.1
+0.6
70.4
69.7
+0.7
Phi-3.5-MoE-instruct
78.9
64.6
64.5
+0.1
63.1
62.1
+1.0
Qwen2-72B-instruct
82.3
63.7
64.3
-0.6
62.4
62.5
-0.1
Mixtral-8x22B-instruct
76.2
62.8
62.5
+0.3
65.3
64.8
+0.5
Qwen1.5-72B-chat
75.6
59.8
60.2
-0.4
59.1
59.6
-0.5Llama-2-70B-chat
68.9
52.2
51.8
+0.4
51.2
50.9
+0.3Medium
Qwen2.5-32B-instruct
83.9
69.7
68.8
+0.9
68.9
68.8
+0.1Phi-4-14B
84.8
67.8
68.5
-0.7
68.5
69.4
-0.9Qwen2.5-14B-instruct
79.9
66.4
66.1
+0.3
67.0
66.0
+1.0Phi-3-medium-instruct
77.9
64.2
64.2
+0.0
62.5
62.7
-0.2Gemma2-27B
75.2
63.9
63.5
+0.4
64.2
64.0
+0.2Yi-1.5-34B-chat
76.8
61.3
60.5
+0.8
60.6
59.5
+1.1Mixtral-8x7B-instruct-v0.1
70.5
58.3
57.1
-1.2
58.9
58.5
+0.4Deepseek-v2-lite-chat
55.7
49.3
48.7
+0.6
48.2
47.7
+0.5Baichuan-2-13B-chat
57.3
48.3
48.6
-0.3
47.1
48.1
-1.0Llama-2-13B-chat
54.8
42.8
42.1
+0.7
44.8
44.6
+0.2Small
Qwen2.5-7B-instruct
75.4
61.3
60.4
+0.9
59.3
58.6
+0.7Qwen2-7B-instruct
70.5
58.1
57.9
+0.2
58.3
57.4
+0.9Glm-4-9B-chat
72.4
57.8
57.9
-0.1
58.6
58.7
-0.1Internlm-2.5-7B-chat
72.8
57.3
56.8
+0.5
57.9
56.9
+1.0Llama-3-8B-instruct
68.4
57.3
56.5
+0.8
56.4
55.4
+1.0Llama-3.1-8B-instruct
68.1
57.1
57.9
-0.8
56.1
56.1
+0.0Gemma-2-9B
71.3
53.7
53.3
+0.4
32.1
31.2
+0.9Yi-1.5-6B-chat
62.8
52.8
51.4
+1.4
52.2
51.9
+0.3Mistral-7B-instruct-v0.3
60.3
50.7
50.9
-0.2
51.1
50.9
+0.2Baichuan-2-7B-chat
52.9
44.5
43.9
+0.6
43.9
44.0
-0.1Llama-2-7B-chat
45.3
39.4
38.5
+0.9
41.9
40.9
+1.0Mini
Phi-3-mini-instruct (3.8B)
70.9
57.9
58.1
-0.2
58.2
57.5
+0.7Phi-3.5-mini-instruct (3.8B)
69.1
57.9
57.4
+0.5
58.3
57.7
+0.6Qwen2.5-3B-instruct
64.4
55.9
56.4
-0.5
54.3
53.9
+0.4Qwen2.5-1.5B-instruct
50.7
51.2
51.0
+0.2
50.7
50.4
+0.3Qwen2-1.5B-instruct
52.4
47.1
47.5
-0.4
45.2
44.5
+0.7Gemma-2-2B
51.3
43.9
42.4
+1.5
30.5
29.4
+0.9Qwen2.5-0.5B-instruct
24.1
41.9
41.1
+0.8
36.0
34.9
+1.1Internlm-2-chat-1.8b
47.1
40.5
39.4
+1.1
41.2
39.8
+1.4Qwen2-0.5B-instruct
37.9
38.3
38.3
+0.0
33.5
33.5
+0.0
## 5. Data Construction Pipeline

The pipeline involves (1) MCQ Collection to gather a diverse set of questions; (2) MCQ Cleaning to ensure quality; (3) Difficulty Sampling to ensure an appropriate difficulty distribution for questions; (4) LLMs checking: The LLMs, including GPT-4o, Gemini, and Claude, are reviewing the accuracy and safety of the data; and (5) Contamination-Free Processing to prevent data leakage and maintain dataset purity. Ultimately, this process results in the MMLU-CF, consisting of 10,000 questions for the closed-source test set and 10,000 for the open-source validation set.## 6. Contact
For any inquiries or concerns, feel free to reach out to us via Email: [Qihao Zhao](qhzhaoo@gmail.com) and [Yangyu Huang](yanghuan@microsoft.com).## 7. Citation
```
@misc{zhao2024mmlucfcontaminationfreemultitasklanguage,
title={MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark},
author={Qihao Zhao and Yangyu Huang and Tengchao Lv and Lei Cui and Qinzheng Sun and Shaoguang Mao and Xin Zhang and Ying Xin and Qiufeng Yin and Scarlett Li and Furu Wei},
year={2024},
eprint={2412.15194},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.15194},
}
```## 8. License
This repository is licensed under the [MIT](https://github.com/microsoft/PEACE/blob/main/LICENSE) License.
The validation dataset of MMLU-CF is subject to the [CDLA-2.0](https://cdla.dev/permissive-2-0/) License.