{"id":13788738,"url":"https://github.com/abacaj/code-eval","last_synced_at":"2025-04-06T07:14:59.115Z","repository":{"id":177948868,"uuid":"661130724","full_name":"abacaj/code-eval","owner":"abacaj","description":"Run evaluation on LLMs using human-eval benchmark","archived":false,"fork":false,"pushed_at":"2023-09-12T03:03:56.000Z","size":113,"stargazers_count":402,"open_issues_count":5,"forks_count":35,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-03-30T06:09:40.749Z","etag":null,"topics":["humaneval","wizardcoder"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abacaj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-07-01T22:16:02.000Z","updated_at":"2025-03-28T03:37:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"23e1b464-cacc-4d6d-a31b-13363d53053d","html_url":"https://github.com/abacaj/code-eval","commit_stats":null,"previous_names":["abacaj/code-eval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abacaj%2Fcode-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abacaj%2Fcode-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abacaj%2Fcode-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abacaj%2Fcode-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abacaj","download_url":"https://codeload.github.com/abacaj/code-eval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247445681,"owners_count":20939961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["humaneval","wizardcoder"],"created_at":"2024-08-03T21:00:52.647Z","updated_at":"2025-04-06T07:14:59.094Z","avatar_url":"https://github.com/abacaj.png","language":"Python","readme":"# code-eval\n\n## What\n\nThis is a repo I use to run human-eval on code models, adjust as needed. Some scripts were adjusted from wizardcoder repo ([process_eval.py](https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/src/process_humaneval.py)). The evaluation code is duplicated in several files, mostly to handle edge cases around model tokenizing and loading (will clean it up).\n\n## Results\n\nTable is sorted by pass@1 score.\n \n| model                                                                                                 | size | pass@1  | pass@10 | screenshot                                                                                                         |\n| ----------------------------------------------------------------------------------------------------- | ---- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------ |\n| [sahil2801/replit-code-instruct-glaive](https://huggingface.co/sahil2801/replit-code-instruct-glaive) | 3B   | 63.5%   | 67%     | ![instruct-glaive](https://github.com/abacaj/code-eval/assets/7272343/6fd7527d-0dc4-4b48-8a57-ad0373074bc5)        |\n| [WizardCoder-15B-V1.0](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)                          | 15B  | 57%     | 68.9%   | ![wizardcoder](https://github.com/abacaj/code-eval/assets/7272343/0b941ff8-b474-4236-bbc0-89d925bbd34e)            |\n| [bigcode/starcoder](https://huggingface.co/bigcode/starcoder)                                         | 15B  | 34.6%   | 48.7%   | ![starcoder](https://github.com/abacaj/code-eval/assets/7272343/eb5df978-f56b-4557-a433-8b8fa863a059)              |\n| [openchat/opencoderplus](https://huggingface.co/openchat/opencoderplus)                               | 15B  | 27.3%   | 43.9%   | ![opencoder](https://github.com/abacaj/code-eval/assets/7272343/1fa9f5ef-941b-4ea8-981e-c3f258c03fee)              |\n| [teknium/Replit-v1-CodeInstruct-3B](https://huggingface.co/teknium/Replit-v1-CodeInstruct-3B)         | 3B   | 25.8%   | 42.6%   | ![replit-codeinstruct-v1](https://github.com/abacaj/code-eval/assets/7272343/4fca98d8-2c22-43ce-9639-e998ecb4fedc) |\n| [teknium/Replit-v2-CodeInstruct-3B](https://huggingface.co/teknium/Replit-v2-CodeInstruct-3B)         | 3B   | 21.5%   | 31%     | ![replit-codeinstruct-v2](https://github.com/abacaj/code-eval/assets/7272343/655aaa1d-0715-4fcd-b9ba-a22b5fddb215) |\n| [replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b)                                  | 3B   | 17.1%   | 29.8%   | ![replit-code-v1](https://github.com/abacaj/code-eval/assets/7272343/6b387aa8-db60-4f04-b458-35b010b1145c)         |\n| [mpt-7b](https://huggingface.co/mosaicml/mpt-7b)                                                      | 7B   | 15.9%   | 23.7%   | ![mpt-7b](https://github.com/abacaj/code-eval/assets/7272343/16965905-a368-4254-aeab-5e44126eba84)                 |\n| [xgen-7b-8k-base](https://huggingface.co/Salesforce/xgen-7b-8k-base)                                  | 7B   | 14.9%   | 22.5%   | ![xgen-7b-8k-base](https://github.com/abacaj/code-eval/assets/7272343/995c84a9-ee69-43bf-8502-a74eba1d927a)        |\n| [openllama-7b-v2](https://huggingface.co/openlm-research/open_llama_7b)                               | 7B   | 14%     | 23.1%   | ![openllama-7b-v2](https://github.com/abacaj/code-eval/assets/7272343/e38f08a0-ae74-4c51-b3a7-638781477e1b)        |\n| [llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf)                                                | 7B   | 13.1%   | 21.9%   | ![llama-2-7b](https://github.com/abacaj/code-eval/assets/7272343/cc86cc7c-beac-4993-9ca3-d91a48a790e4)                                                                                               |\n| [llama-7b](https://huggingface.co/huggyllama/llama-7b)                                                | 7B   | 12.1%   | 18.9%   | ![llama-7b](https://github.com/abacaj/code-eval/assets/7272343/605a3c4e-0b2b-4c10-a185-f2a4d34ec10d)                                                                                               |\n| [mpt-30b](https://huggingface.co/mosaicml/mpt-30b)                                                    | 30B  | pending | pending | pending                                                                                                            |\n\n## FAQ\n\n\u003e Why is there a discrepancy on some of the scores between official numbers? \n\nBecause it is not obvious or published what prompt or processing the official models used to conduct their evaluation on this benchmark. The goal here is to try and best reproduce those numbers, in many cases it is possible to get very close to the published numbers.\n\nAll of the scores here were run independently of any published numbers and are reproducible by cloning the repo and following the setup.\n\n\u003e Why do some models have a filter_code post generation step?\n\nBase models can in many cases repeat outputs, breaking the benchmark scores. Instruct models don't have this problem and so you won't see this step, they tend to output a end of sequence token.\n\n## Setup\n\nCreate python environment\n\n```sh\npython -m venv env \u0026\u0026 source env/bin/activate\n```\n\nInstall dependencies\n\n```sh\npip install -r requirements.txt\n```\n\nRun the eval script\n\n```sh\n# replace script file name for various models:\n# eval_wizard.py\n# eval_opencode.py\n# eval_mpt.py\n# eval_starcoder.py\n# eval_replit.py\n# eval_replit_glaive.py\n# eval_replit_instruct.py\n\npython eval_wizard.py\n```\n\nProcess the jsonl file to extract code samples from model completions.\n\n**Note**: Only wizard \u0026 opencoder require this, they return markdown output with code.\n\n```sh\n# replace args for various models:\n# --path results/wizard --out_path results/wizard/eval.jsonl\n# --path results/opencode --out_path results/opencode/eval.jsonl\n\npython process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_prompt\n```\n\nThen get the results\n\n```sh\n# replace args for various models:\n# results/wizard/processed.jsonl\n# results/starcoder/eval.jsonl\n# results/mpt/eval.jsonl\n# results/opencode/processed.jsonl\n# results/replit_instruct/eval.jsonl\n# results/replit_glaive/eval.jsonl\n# results/replit/eval.jsonl\n\nevaluate_functional_correctness results/wizard/processed.jsonl\n```\n","funding_links":[],"categories":["💡 Evaluation Toolkit:","Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabacaj%2Fcode-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabacaj%2Fcode-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabacaj%2Fcode-eval/lists"}