https://github.com/evalplus/evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
https://github.com/evalplus/evalplus

benchmark chatgpt efficiency gpt-4 large-language-models program-synthesis testing

Last synced: 7 months ago
JSON representation

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Host: GitHub
URL: https://github.com/evalplus/evalplus
Owner: evalplus
License: apache-2.0
Created: 2023-04-15T04:20:10.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-10-29T08:23:15.000Z (11 months ago)
Last Synced: 2024-10-29T09:43:26.857Z (11 months ago)
Topics: benchmark, chatgpt, efficiency, gpt-4, large-language-models, program-synthesis, testing
Language: Python
Homepage: https://evalplus.github.io
Size: 4.69 MB
Stars: 1,216
Watchers: 8
Forks: 107
Open Issues: 45
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

awesome - evalplus/evalplus - Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024 (Python)
StarryDivineSky - evalplus/evalplus - NeurIPS 2023。EvalPlus 是 LLM4Code 的严格评估框架，具有：HumanEval+：测试次数比原来的 HumanEval 多 80 倍！MBPP+：测试次数是原始 MBPP 的 35 倍！评估框架：我们的 packages/images/tools 可以在上述基准测试中轻松安全地评估 LLMs。为什么选择EvalPlus？精确的评估和排名：查看我们的排行榜以获取严格的评估前后的最新LLM排名。编码严谨性：看看分数差异！尤其是在使用 EvalPlus 测试之前和之后！丢弃越少越好，因为它意味着代码生成更加严格和不那么松懈;而大幅下降意味着生成的代码往往很脆弱。预生成样本：EvalPlus 通过开源 LLM——无需重新运行昂贵的基准测试！ (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-ChatGPT-repositories - evalplus - EvalPlus for rigourous evaluation of LLM-synthesized code (NLP)
awesome-production-machine-learning - EvalPlus - EvalPlus is a robust evaluation framework for LLM4Code, featuring expanded HumanEval+ and MBPP+ benchmarks, efficiency assessment (EvalPerf), and a secure, extensible evaluation toolkit. (Evaluation and Monitoring)
awesome-chatgpt - evalplus/evalplus - EvalPlus is a rigorous evaluation framework for large language models synthesizing code, providing expanded benchmarks and tools to assess both correctness and efficiency of generated code. (UIs / Command-line(shell) interface)

README

          # `EvalPlus(📖) => 📚`



    

    

    

    

    

    





    📙About •

    🔥Quick Start •

    🚀LLM Backends •

    📚Documents •

    📜Citation •

    🙏Acknowledgement



## 📢 News

Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:

* [Meta Llama 3.1 and 3.3](https://ai.meta.com/blog/meta-llama-3-1/)

* [Allen AI TÜLU 1/2/3](https://github.com/allenai/open-instruct/blob/main/docs/tulu1_tulu2.md#benchmark-based-eval)

* [Qwen2.5-Coder](https://qwenlm.github.io/blog/qwen2.5-coder-family/)

* [CodeQwen 1.5](https://qwenlm.github.io/blog/codeqwen1.5/)

* [DeepSeek-Coder V2](https://arxiv.org/pdf/2406.11931)

* [Qwen2](https://arxiv.org/pdf/2407.10671)

* [Snowflake Arctic](https://www.snowflake.com/en/data-cloud/arctic/)

* [StarCoder2](https://arxiv.org/pdf/2402.19173)

* [Magicoder](https://arxiv.org/pdf/2312.02120)

* [WizardCoder](https://arxiv.org/pdf/2306.08568)

Below tracks the notable updates of EvalPlus:

- **[2024-10-20 `v0.3.1`]**: EvalPlus `v0.3.1` is officially released! Highlights: *(i)* Code efficiency evaluation via EvalPerf, *(ii)* one command to run all: generation + post-processing + evaluation, *(iii)* support for more inference backends such as Google Gemini & Anthropic, etc.

- **[2024-06-09 pre `v0.3.0`]**: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to [EvalArena](https://github.com/crux-eval/eval-arena).

- **[2024-04-17 pre `v0.3.0`]**: MBPP+ is upgraded to `v0.2.0` by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.

Earlier news :: click to expand ::



- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32).

- ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).

- ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6)

- ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)

- ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!

- ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.

- ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!



## 📙 About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

- ✨ **HumanEval+**: 80x more tests than the original HumanEval!

- ✨ **MBPP+**: 35x more tests than the original MBPP!

- ✨ **EvalPerf**: evaluating the efficiency of LLM-generated code!

- ✨ **Framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

- ✨ **Precise evaluation**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.

- ✨ **Coding rigorousness**: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.

- ✨ **Code efficiency**: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

- **EvalPlus**: [NeurIPS'23 paper](https://openreview.net/forum?id=1qvx610Cu7), [Slides](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs), [Poster](https://jw-liu.xyz/assets/pdf/EvalPlus_Poster.pdf), [Leaderboard](https://evalplus.github.io/leaderboard.html)

- **EvalPerf**: [COLM'24 paper](https://openreview.net/forum?id=IBCBMeAhmC), [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf), [Documentation](./docs/evalperf.md), [Leaderboard](https://evalplus.github.io/evalperf.html)

## 🔥 Quick Start

### Code Correctness Evaluation: HumanEval(+) or MBPP(+)

```bash

pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"

# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \

                  --dataset [humaneval|mbpp]             \

                  --backend vllm                         \

                  --greedy

```

🛡️ Safe code execution within Docker :: click to expand ::



```bash

# Local generation

evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \

                 --dataset humaneval                    \

                 --backend vllm                         \

                 --greedy

# Code execution within Docker

docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \

           evalplus.evaluate --dataset humaneval                                     \

           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl

```



### Code Efficiency Evaluation: EvalPerf (*nix only)

```bash

pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"

# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf

evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

```

🛡️ Safe code execution within Docker :: click to expand ::



```bash

# Local generation

evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \

                 --dataset evalperf                     \

                 --backend vllm                         \

                 --temperature 1.0                      \

                 --n-samples 100

# Code execution within Docker

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf

docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \

           evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

```



## 🚀 LLM Backends

### HuggingFace models

- `transformers` backend:

```bash

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \

                  --dataset [humaneval|mbpp]             \

                  --backend hf                           \

                  --greedy

```

> [!Note]

>

> EvalPlus uses different prompts for base and chat models.

> By default it is detected by `tokenizer.chat_template` when using `hf`/`vllm` as backend.

> For other backends, only chat mode is allowed.

>

> Therefore, if your base models come with a `tokenizer.chat_template`,

> please add `--force-base-prompt` to avoid being evaluated

> in a chat mode.

Enable Flash Attention 2 :: click to expand ::



```bash

# Install Flash Attention 2

pip install packaging ninja

pip install flash-attn --no-build-isolation

# Note: if you have installation problem, consider using pre-built

# wheels from https://github.com/Dao-AILab/flash-attention/releases

# Run evaluation with FA2

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B"         \

                  --dataset [humaneval|mbpp]                     \

                  --backend hf                                   \

                  --attn-implementation [flash_attention_2|sdpa] \

                  --greedy

```



- `vllm` backend:

```bash

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \

                  --dataset [humaneval|mbpp]             \

                  --backend vllm                         \

                  --tp [TENSOR_PARALLEL_SIZE]            \

                  --greedy

```

- `openai` compatible servers (e.g., [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)):

```bash

# OpenAI models

export OPENAI_API_KEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys

evalplus.evaluate --model "gpt-4o-2024-08-06"  \

                  --dataset [humaneval|mbpp]   \

                  --backend openai --greedy

# DeepSeek

export OPENAI_API_KEY="{KEY}" # https://platform.deepseek.com/api_keys

evalplus.evaluate --model "deepseek-chat"              \

                  --dataset [humaneval|mbpp]           \

                  --base-url https://api.deepseek.com  \

                  --backend openai --greedy

# Grok

export OPENAI_API_KEY="{KEY}" # https://console.x.ai/

evalplus.evaluate --model "grok-beta"             \

                  --dataset [humaneval|mbpp]      \

                  --base-url https://api.x.ai/v1  \

                  --backend openai --greedy

# vLLM server

# First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \

                  --dataset [humaneval|mbpp]             \

                  --base-url http://localhost:8000/v1    \

                  --backend openai --greedy

# GPTQModel

evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \

                  --dataset [humaneval|mbpp]                                          \

                  --backend gptqmodel --greedy

```

### OpenAI models

- Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)

```bash

export OPENAI_API_KEY="[YOUR_API_KEY]"

evalplus.evaluate --model "gpt-4o"            \

                  --dataset [humaneval|mbpp]  \

                  --backend openai            \

                  --greedy

```

### Anthropic models

- Access Anthropic APIs from [Anthropic Console](https://console.anthropic.com/)

```bash

export ANTHROPIC_API_KEY="[YOUR_API_KEY]"

evalplus.evaluate --model "claude-3-haiku-20240307" \

                  --dataset [humaneval|mbpp]        \

                  --backend anthropic               \

                  --greedy

```

### Google Gemini models

- Access Gemini APIs from [Google AI Studio](https://aistudio.google.com/)

```bash

export GOOGLE_API_KEY="[YOUR_API_KEY]"

evalplus.evaluate --model "gemini-1.5-pro"    \

                  --dataset [humaneval|mbpp]  \

                  --backend google            \

                  --greedy

```

### Amazon Bedrock models

- [Amazon Bedrock](https://aws.amazon.com/bedrock/)

```bash

export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]"

evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \

                  --dataset [humaneval|mbpp]                          \

                  --backend bedrock                                   \

                  --greedy

```

You can checkout the generation and results at `evalplus_results/[humaneval|mbpp]/`

⏬ Using EvalPlus as a local repo? :: click to expand ::



```bash

git clone https://github.com/evalplus/evalplus.git

cd evalplus

export PYTHONPATH=$PYTHONPATH:$(pwd)

pip install -r requirements.txt

```



## 📚 Documents

To learn more about how to use EvalPlus, please refer to:

- [EvalPlus Commands](./docs/cli.md)

- [EvalPerf](./docs/evalperf.md)

- [Program Execution](./docs/execution.md)

## 📜 Citation

```bibtex

@inproceedings{evalplus,

  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},

  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},

  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},

  year = {2023},

  url = {https://openreview.net/forum?id=1qvx610Cu7},

}

@inproceedings{evalperf,

  title = {Evaluating Language Models for Efficient Code Generation},

  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},

  booktitle = {First Conference on Language Modeling},

  year = {2024},

  url = {https://openreview.net/forum?id=IBCBMeAhmC},

}

```

## 🙏 Acknowledgement

- [HumanEval](https://github.com/openai/human-eval)

- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/evalplus/evalplus

Awesome Lists containing this project

README