Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/evalplus/evalplus
EvalPlus for rigourous evaluation of LLM-synthesized code
https://github.com/evalplus/evalplus
benchmark chatgpt gpt-4 large-language-models program-synthesis testing
Last synced: about 1 month ago
JSON representation
EvalPlus for rigourous evaluation of LLM-synthesized code
- Host: GitHub
- URL: https://github.com/evalplus/evalplus
- Owner: evalplus
- License: apache-2.0
- Created: 2023-04-15T04:20:10.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-04-21T19:35:24.000Z (8 months ago)
- Last Synced: 2024-04-24T03:47:51.477Z (8 months ago)
- Topics: benchmark, chatgpt, gpt-4, large-language-models, program-synthesis, testing
- Language: Python
- Homepage: https://evalplus.github.io
- Size: 4.58 MB
- Stars: 861
- Watchers: 8
- Forks: 72
- Open Issues: 26
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
- awesome - evalplus/evalplus - Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024 (Python)
- awesome-ChatGPT-repositories - evalplus - EvalPlus for rigourous evaluation of LLM-synthesized code (NLP)
- awesome-chatgpt - evalplus/evalplus - EvalPlus is a Python library for rigorous evaluation of LLM-synthesized code (SDK, Libraries, Frameworks / Python)
- awesome-production-machine-learning - EvalPlus - EvalPlus is a rigorous evaluation framework for LLM4Code. (Industry Strength Evaluation)
- StarryDivineSky - evalplus/evalplus - NeurIPS 2023。EvalPlus 是 LLM4Code 的严格评估框架,具有:HumanEval+:测试次数比原来的 HumanEval 多 80 倍!MBPP+:测试次数是原始 MBPP 的 35 倍!评估框架:我们的 packages/images/tools 可以在上述基准测试中轻松安全地评估 LLMs。为什么选择EvalPlus?精确的评估和排名:查看我们的排行榜以获取严格的评估前后的最新LLM排名。编码严谨性:看看分数差异!尤其是在使用 EvalPlus 测试之前和之后!丢弃越少越好,因为它意味着代码生成更加严格和不那么松懈;而大幅下降意味着生成的代码往往很脆弱。预生成样本:EvalPlus 通过开源 LLM——无需重新运行昂贵的基准测试! (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# `EvalPlus(📖) => 📚`
📢News •
🔥Quick Start •
🚀LLM Backends •
📚Documents •
📜Citation •
🙏Acknowledgement## About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ **HumanEval+**: 80x more tests than the original HumanEval!
- ✨ **MBPP+**: 35x more tests than the original MBPP!
- ✨ **EvalPerf**: evaluating the efficiency of LLM-generated code!
- ✨ **Framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.Why EvalPlus?
- ✨ **Precise evaluation**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.
- ✨ **Coding rigorousness**: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
- ✨ **Code efficiency**: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.Want to know more details? Read our papers & materials!
- **EvalPlus**: [NeurIPS'23 paper](https://openreview.net/forum?id=1qvx610Cu7), [Slides](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs), [Poster](https://jw-liu.xyz/assets/pdf/EvalPlus_Poster.pdf), [Leaderboard](https://evalplus.github.io/leaderboard.html)
- **EvalPerf**: [COLM'24 paper](https://openreview.net/forum?id=IBCBMeAhmC), [Poster](https://jw-liu.xyz/assets/pdf/jiawei-colm-evalperf-poster.pdf), [Documentation](./docs/evalperf.md), [Leaderboard](https://evalplus.github.io/evalperf.html)## 📢 News
Below tracks the notable updates of EvalPlus:
- **[2024-10-20 `v0.3.1`]**: EvalPlus `v0.3.1` is officially released! Highlights: *(i)* Code efficiency evaluation via EvalPerf, *(ii)* one command to run all: generation + post-processing + evaluation, *(iii)* support for more inference backends such as Google Gemini & Anthropic, etc.
- **[2024-06-09 pre `v0.3.0`]**: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to [EvalArena](https://github.com/crux-eval/eval-arena).
- **[2024-04-17 pre `v0.3.0`]**: MBPP+ is upgraded to `v0.2.0` by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
- **Earlier**:
- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32).
- ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).
- ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6)
- ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)
- ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!
- ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.
- ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!## 🔥 Quick Start
### Code Correctness Evaluation: HumanEval(+) or MBPP(+)
```bash
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable releaseevalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--greedy
```🛡️ Safe code execution within Docker :: click to expand ::
```bash
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset humaneval \
--backend vllm \
--greedy# Code execution within Docker
docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
evalplus.evaluate --dataset humaneval \
--samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
```### Code Efficiency Evaluation: EvalPerf (*nix only)
```bash
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable releasesudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
```🛡️ Safe code execution within Docker :: click to expand ::
```bash
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset evalperf \
--backend vllm \
--temperature 1.0 \
--n-samples 100# Code execution within Docker
sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \
evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl
```## 🚀 LLM Backends
### HuggingFace models
- `transformers` backend:
```bash
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
```> [!Note]
>
> EvalPlus uses different prompts for base and chat models.
> By default it is detected by `tokenizer.chat_template` when using `hf`/`vllm` as backend.
> For other backends, only chat mode is allowed.
>
> Therefore, if your base models come with a `tokenizer.chat_template`,
> please add `--force-base-prompt` to avoid being evaluated
> in a chat mode.Enable Flash Attention 2 :: click to expand ::
```bash
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--attn-implementation [flash_attention_2|sdpa] \
--greedy
```- `vllm` backend:
```bash
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
```- `openai` compatible servers (e.g., [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)):
```bash
# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend openai \
--base-url http://localhost:8000/v1 \
--greedy
```### OpenAI models
- Access OpenAI APIs from [OpenAI Console](https://platform.openai.com/)
```bash
export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o" \
--dataset [humaneval|mbpp] \
--backend openai \
--greedy
```### Anthropic models
- Access Anthropic APIs from [Anthropic Console](https://console.anthropic.com/)
```bash
export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
--dataset [humaneval|mbpp] \
--backend anthropic \
--greedy
```### Google Gemini models
- Access Gemini APIs from [Google AI Studio](https://aistudio.google.com/)
```bash
export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro" \
--dataset [humaneval|mbpp] \
--backend google \
--greedy
```You can checkout the generation and results at `evalplus_results/[humaneval|mbpp]/`
⏬ Using EvalPlus as a local repo? :: click to expand ::
```bash
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
```## 📚 Documents
To learn more about how to use EvalPlus, please refer to:
- [Command Line Interface](./docs/cli.md)
- [EvalPerf](./docs/evalperf.md)
- [Program Execution](./docs/execution.md)## 📜 Citation
```bibtex
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}@inproceedings{evalperf,
title = {Evaluating Language Models for Efficient Code Generation},
author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
booktitle = {First Conference on Language Modeling},
year = {2024},
url = {https://openreview.net/forum?id=IBCBMeAhmC},
}
```## 🙏 Acknowledgement
- [HumanEval](https://github.com/openai/human-eval)
- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp)