{"id":17526983,"url":"https://github.com/LiveCodeBench/LiveCodeBench","last_synced_at":"2025-03-06T06:31:14.483Z","repository":{"id":227544467,"uuid":"771234452","full_name":"LiveCodeBench/LiveCodeBench","owner":"LiveCodeBench","description":"Official repository for the paper \"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code\"","archived":false,"fork":false,"pushed_at":"2024-04-14T01:18:50.000Z","size":3082,"stargazers_count":54,"open_issues_count":4,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-04-14T13:16:07.346Z","etag":null,"topics":["code-execution","code-generation","code-llms","code-repair","gpt-4","test-generation"],"latest_commit_sha":null,"homepage":"https://livecodebench.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LiveCodeBench.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-03-12T23:34:37.000Z","updated_at":"2024-04-16T07:41:34.571Z","dependencies_parsed_at":"2024-04-16T07:51:46.095Z","dependency_job_id":null,"html_url":"https://github.com/LiveCodeBench/LiveCodeBench","commit_stats":null,"previous_names":["livecodebench/livecodebench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveCodeBench%2FLiveCodeBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveCodeBench%2FLiveCodeBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveCodeBench%2FLiveCodeBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LiveCodeBench%2FLiveCodeBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LiveCodeBench","download_url":"https://codeload.github.com/LiveCodeBench/LiveCodeBench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242161580,"owners_count":20081897,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-execution","code-generation","code-llms","code-repair","gpt-4","test-generation"],"created_at":"2024-10-20T15:02:41.163Z","updated_at":"2025-03-06T06:31:14.472Z","avatar_url":"https://github.com/LiveCodeBench.png","language":"Python","readme":"# LiveCodeBench\nOfficial repository for the paper \"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code\"\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://livecodebench.github.io/\"\u003e🏠 Home Page\u003c/a\u003e •\n    \u003ca href=\"https://huggingface.co/datasets/livecodebench/\"\u003e💻 Data \u003c/a\u003e •\n    \u003ca href=\"https://livecodebench.github.io/leaderboard.html\"\u003e🏆 Leaderboard\u003c/a\u003e •\n    \u003ca href=\"https://livecodebench.github.io/leaderboard.html](https://huggingface.co/spaces/livecodebench/code_generation_samples\"\u003e🔍 Explorer\u003c/a\u003e \n\u003c/p\u003e\n\n## Introduction\nLiveCodeBench provides holistic and contamination-free evaluation of coding capabilities of LLMs.  Particularly, LiveCodeBench continuously collects new problems over time from contests across three competition platforms -- LeetCode, AtCoder, and CodeForces. Next, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and March 2024.\n\n\n## Installation\nYou can clone the repository using the following command:\n\n```bash\ngit clone https://github.com/LiveCodeBench/LiveCodeBench.git\ncd LiveCodeBench\n```\n\nWe recommend using uv for managing dependencies. You can install uv and the dependencies using the following commands:\n\n```bash\nuv venv --python 3.11\nsource .venv/bin/activate\n\nuv pip install -e .\n```\n\n## Data\nWe provide a benchmark for different code capability scenarios\n- [Code Generation](https://huggingface.co/datasets/livecodebench/code_generation_lite)\n- [Code Execution](https://huggingface.co/datasets/livecodebench/execution)\n- [Test Output Prediction](https://huggingface.co/datasets/livecodebench/test_generation)\n\n## Inference and Evaluation\n\n### Dataset Versions\nSince LiveCodeBench is a continuously updated benchmark, we provide different versions of the dataset. Particularly, we provide the following versions of the dataset:\n- `release_v1`: The initial release of the dataset with problems released between May 2023 and Mar 2024 containing 400 problems.\n- `release_v2`: The updated release of the dataset with problems released between May 2023 and May 2024 containing 511 problems.\n- `release_v3`: The updated release of the dataset with problems released between May 2023 and Jul 2024 containing 612 problems.\n- `release_v4`: The updated release of the dataset with problems released between May 2023 and Sep 2024 containing 713 problems.\n- `release_v5`: The updated release of the dataset with problems released between May 2023 and Jan 2025 containing 880 problems.\n\nYou can use the `--release_version` flag to specify the dataset version you wish to use. Particularly, you can use the following command to run the evaluation on the `release_v2` dataset. Release version defaults to `release_latest`. Additionally, we have introduced fine-grained release versions such as `v1`, `v2`, `v1_v3`, `v4_v5` for specific versions of the dataset.\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version release_v2\n```\n\n### Code Generation\n\nWe use `vllm` for inference using open models. By default, we use  `tensor_parallel_size=${num_gpus}` to parallelize inference across all available GPUs. It can be configured using the  `--tensor_parallel_size` flag as required. \n\nFor running the inference, please provide the `model_name` based on the [./lcb_runner/lm_styles.py](./lcb_runner/lm_styles.py) file.\nThe scenario (here `codegeneration`) can be used to specify the scenario for the model.\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration\n```\n\nAdditionally, `--use_cache` flag can be used to cache the generated outputs and `--continue_existing` flag can be used to use the existing dumped results. In case you wish to use model from a local path, you can additionally provide `--local_model_path` flag with the path to the model. We use `n=10` and `temperature=0.2` for generation. Please check the [./lcb_runner/runner/parser.py](./lcb_runner/runner/parser.py) file for more details on the flags.\n\nFor closed API models,  `--multiprocess` flag can be used to parallelize queries to API servers (adjustable according to rate limits).\n\n\n#### Evaluation\nWe compute `pass@1` and `pass@5` metrics for model evaluations.\nWe use a modified version of the checker released with the [`apps` benchmark](https://github.com/hendrycks/apps/blob/main/eval/testing_util.py) to compute the metrics. Particularly, we identified some unhandled edge cases in the original checker and fixed them and additionally simplified the checker based on our collected dataset. To run the evaluation, you can add the `--evaluate` flag:\n\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate\n```\n\nNote that time limits can cause slight (`\u003c 0.5`) points of variation in the computation of the `pass@1` and `pass@5` metrics.\nIf you observe a significant variation in performance, adjust the `--num_process_evaluate` flag to a lower value or increase the `--timeout` flag. Please report particular issues caused by improper timeouts here. \n\nFinally, to get scores over different time windows, you can use [./lcb_runner/evaluation/compute_scores.py](./lcb_runner/evaluation/compute_scores.py) file. \nParticularly, you can provide `--start_date` and `--end_date` flags (using the `YYYY-MM-DD` format) to get scores over the specified time window. In our paper, to counter contamination in the DeepSeek models, we only report results on problems released after August 2023. You can replicate those evaluations using:\n\n```bash\npython -m lcb_runner.evaluation.compute_scores --eval_all_file {saved_eval_all_file} --start_date 2023-09-01\n```\n\n**NOTE: We have pruned a large number of test cases from the original benchmark and created `code_generation_lite` which is set as the default benchmark offering similar performance estimation much faster. If you wish to use the original benchmark, please use the `--not_fast` flag. We are in the process of updating the leaderboard scores with this updated setting.** \n\n**NOTE: V2 Update: to run the update LiveCodeBench please use `--release_version release_v2`. In addition, if you have existing results from `release_v1` you can add `--continue_existing` or better `--continue_existing_with_eval` flags to reuse the old completions or evaluations respectively.**\n\n\n### Self Repair\nFor running self repair, you need to provide an additional `--codegen_n` flag that maps to the number of codes that were generated during code generation. Additionally, the `--temperature` flag is used to resolve the old code generation eval file which must be present in the `output` directory. \n\n```bash\npython -m lcb_runner.runner.main --model {model_name --scenario selfrepair --codegen_n {num_codes_codegen} --n 1 # only n=1 supported\n```\n\nIn case you have results on a smaller subset or version of the benchmark, you can use `--continue_existing` and `--continue_existing_with_eval` flags to reuse the old computations. Particularly, you can run the following command to continue from existing generated solutions.\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario selfrepair --evaluate --continue_existing\n```\n\nNote that this will only reuse the generated samples and rerun evaluations. To reuse the old evaluations, you can add the `--continue_existing_with_eval` flag.\n\n### Test Output Prediction\nFor running the test output prediction scenario you can simply run\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario testoutputprediction --evaluate\n```\n\n### Code Execution\nFor running the test output prediction scenario you can simply run\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codeexecution --evaluate\n```\n\nAdditionally, we support the COT setting with\n\n```bash\npython -m lcb_runner.runner.main --model {model_name} --scenario codeexecution --cot_code_execution --evaluate\n```\n\n## Custom Evaluation\nAlternatively, you can using [`lcb_runner/runner/custom_evaluator.py`](./lcb_runner/runner/custom_evaluator.py) to directly evaluated model generations in a custom file. The file should contain a list of model outputs, appropirately formatted for evaluation in the order of benchmark problems. \n\n```bash\npython -m lcb_runner.runner.custom_evaluator --custom_output_file {path_to_custom_outputs}\n```\n\nParticularly, arrange the outputs in the following format\n\n```json\n[\n    {\"question_id\": \"id1\", \"code_list\": [\"code1\", \"code2\"]},\n    {\"question_id\": \"id2\", \"code_list\": [\"code1\", \"code2\"]}\n]\n```\n\n\n## Adding Support for New Models\n\nTo add support for new models, we have implemented an extensible framework to add new models and customize prompts appropirately. \n\nStep 1: Add a new model to the [./lcb_runner/lm_styles.py](./lcb_runner/lm_styles.py) file. Particularly, extend the `LMStyle` class to add a new model family and extend the model to the `LanguageModelList` array.\n\nStep 2: Since we use instruction tuned models, we allow configuring the instruction for each model. Modify the [./lcb_runner/prompts/generation.py](./lcb_runner/prompts/generation.py) file to add a new prompt for the model in the `format_prompt_generation` function. \nFor example, the prompt for `DeepSeekCodeInstruct` family of models looks as follows\n\n```python\n# ./lcb_runner/prompts/generation.py\nif LanguageModelStyle == LMStyle.DeepSeekCodeInstruct:\n    prompt = f\"{PromptConstants.SYSTEM_MESSAGE_DEEPSEEK}\\n\\n\"\n    prompt += f\"{get_deepseekcode_question_template_answer(question)}\"\n    return prompt\n```\n\n## Submit Models to Leaderboard\nWe are currently only accepting submissions for only the code generation scenario. To submit models you can create a pull request on our [submissions](https://github.com/LiveCodeBench/submissions). Particularly, you can copy your model generations folder from `output` to the `submissions` folder and create a pull request. We will review the submission and add the model to the leaderboard accordingly. \n\n## ERRATA\nWe maintain a list of known issues and updates in the [ERRATA.md](./ERRATA.md) file. Particularly, we document issues regarding erroneous tests and problems not amenable to autograding. We are constantly using this feedback to improve our problem selection heuristics as we update LiveCodeBench.\n\n## Results\nLiveCodeBench can be used to evaluate performance of LLMs on different time-windows (using problem release date to filter the models). \nThus we can detect and prevent potential contamination in the evaluation process and evaluate LLMs on _new_ problems.\n\n\u003cdiv style=\"text-align: center;\"\u003e\n    \u003cimg src=\"./assets/images/contamination1.png\" alt=\"Code Generation Live Evaluation\" class=\"teaser-image\"\n    width=\"40%\" /\u003e\n    \u003cimg src=\"./assets/images/contamination2.png\" alt=\"Test Output Prediction Live Evaluation\" class=\"teaser-image\"\n    width=\"40%\" /\u003e\n\u003c/div\u003e\n\nNext, we evaluate models on different code capabilities and find that relative performances of models do change over tasks (left). \nThus, it highlights the need for holistic evaluation of LLMs for code.\n\n\u003cdiv style=\"text-align: center;\"\u003e\n    \u003cimg src=\"./assets/images/tasks_radar.png\" alt=\"Holistic Tasks Evaluation\" class=\"teaser-image\"\n    width=\"36.1%\" /\u003e\n    \u003cimg src=\"./assets/images/lcb_vs_he.png\" alt=\"Comparing LCB vs HumanEval\" class=\"teaser-image\"\n    width=\"46%\" /\u003e\n\u003c/div\u003e\n\nWe also find evidence of possible overfitting on HumanEval (right). \nParticularly, models that perform well on HumanEval do not necessarily perform well on LiveCodeBench. \nIn the scatterplot above, we find the models get clustered into two groups, shaded in red and green. \nThe red group contains models that perform well on HumanEval but poorly on LiveCodeBench, while the green group contains models that perform well on both.\n\nFor more details, please refer to our website at [livecodebench.github.io](https://livecodebench.github.io).\n\n## Citation\n\n```bibtex\n@article{jain2024livecodebench,\n  author    = {Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica},\n  title     = {LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code},\n  year      = {2024},\n  journal   = {arXiv preprint},\n}\n```\n","funding_links":[],"categories":["Python","Benchmark","A01_文本生成_文本对话"],"sub_categories":["Code","大语言对话模型及数据"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLiveCodeBench%2FLiveCodeBench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLiveCodeBench%2FLiveCodeBench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLiveCodeBench%2FLiveCodeBench/lists"}