{"id":13653146,"url":"https://github.com/FloatAI/HumanEval-XL","last_synced_at":"2025-04-23T06:31:18.826Z","repository":{"id":224577210,"uuid":"757918441","full_name":"floatai/humaneval-xl","owner":"floatai","description":"[LREC-COLING'24] HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization","archived":false,"fork":false,"pushed_at":"2025-03-07T11:38:49.000Z","size":8366,"stargazers_count":35,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-07T12:29:12.968Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/floatai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-15T08:59:40.000Z","updated_at":"2025-03-07T11:38:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"8f67819b-24ac-474d-a5a4-c7d3e1d65315","html_url":"https://github.com/floatai/humaneval-xl","commit_stats":null,"previous_names":["floatai/humaneval-xl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floatai%2Fhumaneval-xl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floatai%2Fhumaneval-xl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floatai%2Fhumaneval-xl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/floatai%2Fhumaneval-xl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/floatai","download_url":"https://codeload.github.com/floatai/humaneval-xl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250384893,"owners_count":21421813,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T02:01:06.408Z","updated_at":"2025-04-23T06:31:13.815Z","avatar_url":"https://github.com/floatai.png","language":"Python","funding_links":[],"categories":["Datasets-or-Benchmark"],"sub_categories":["代码能力"],"readme":"# [LREC-COLING 2024 | HumanEval-XL: An Execution-based Multilingual Code Generation Benchmark Across 23 Natural Languages and 12 Programming Languages](https://aclanthology.org/2024.lrec-main.735/)\n\n   \u003ca href=\"https://huggingface.co/datasets/FloatAI/HumanEval-XL\" target=\"_blank\"\u003e\n      \u003cimg alt=\"Datasets\" src=\"https://img.shields.io/badge/📚-Datasets-green\" /\u003e\n   \u003c/a\u003e\n  \u003ca href=\"https://aclanthology.org/2024.lrec-main.735/\" target=\"_blank\"\u003e\n      \u003cimg alt=\"Paper\" src=\"https://img.shields.io/badge/📜-Paper-purple\" /\u003e\n   \u003c/a\u003e\n  \u003ca href=\"https://lrec-coling-2024.org/\" target=\"_blank\"\u003e\n      \u003cimg alt=\"LREC-COLING 2024\" src=\"https://img.shields.io/badge/Proceedings-COLING 2024-red\" /\u003e\n   \u003c/a\u003e\n\n\nThis repository contains data and evaluation code for the paper \"[HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization](https://aclanthology.org/2024.lrec-main.735.pdf)\".\n\n\n## 🔥 News\n* **26 February, 2024:** 🎉 We release the official codebase and data! [[GitHub](https://github.com/FloatAI/humaneval-xl/tree/main?tab=readme-ov-file#dataset),[\n🤗dataset](https://huggingface.co/datasets/FloatAI/HumanEval-XL)] 🔥\n* **19 February, 2024:** 🎉 Our work has been accepted to [LREC-COLING 2024](https://lrec-coling-2024.org/)! ✨\n\n## 🌟 Overview\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"src/data_process.png\"\u003e\n\u003c/div\u003e\n\nLarge language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring *parallel* data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at [https://github.com/FloatAI/HumanEval-XL](https://github.com/FloatAI/HumanEval-XL).\n\n\u003cimg width=\"70%\" alt=\"image\" src=\"https://github.com/FloatAI/humaneval-xl/assets/13767887/e5b7a96e-20a6-4f17-a380-13c8b5ffbc8a\"\u003e\n\n\n## Dataset\nThe data is stored in `data/program_language/natural_language/`. We have 80 parallel problems in 23 different natural languages and 12 programming languages. \n\n**23 NLs** are:\n\"English\", \"Russian\", \"Chinese\", \"German\", \"Spanish\", \"French\", \"Italian\", \"Portuguese\", \"Greek\", \"Hungarian\", \"Dutch\", \"Finnish\", \"Indonesian\", \"Turkish\", \"Arabic\", \"Vietnamese\", \"Bulgarian\", \"Persian\", \"Malay\", \"Hebrew\", \"Estonian\", \"Tagalog\", \"Afrikaans\"\n\n**12 PLs** are:\n\"python\", \"java\", \"javascript\", \"csharp\", \"go\", \"kotlin\", \"perl\", \"php\", \"ruby\", \"scala\", \"swift\", \"typescript\"\n\n\n\u003cimg width=\"60%\" alt=\"image\" src=\"https://github.com/FloatAI/humaneval-xl/assets/13767887/37023fcd-4c7e-41bf-8323-c5fcb5ac36a4\"\u003e\n\n\n### Usage with HuggingFace datasets🤗\nYou can also use [🤗**HuggingFace datasets**](https://huggingface.co/datasets/FloatAI/HumanEval-XL) to load a specific dataset and language of our dataset!!!\n```python\nfrom datasets import load_dataset\ndataset = load_dataset(\"FloatAI/HumanEval-XL\", \"python\")\nDatasetDict({\n    English: Dataset({\n        features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],\n        num_rows: 80\n    })\n    Russian: Dataset({\n        features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],\n        num_rows: 80\n    })\n    Chinese: Dataset({\n        features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],\n        num_rows: 80\n    })\n\n    ⋮\n\n    Afrikaans: Dataset({\n        features: ['task_id', 'language', 'prompt', 'description', 'test', 'entry_point', 'canonical_solution', 'natural_language'],\n        num_rows: 80\n    })\n})\n\n```\n\n### Data Instances\n\nAn example of a dataset instance (In python split with Chinese prompts - dataset[\"Chinese\"][0]):\n\n```python\n{\n'task_id': 'python/0',\n'language': 'python',\n'prompt': 'from typing import List\\n\\n\\ndef below_zero(operations: List[int]) -\u003e bool:\\n    \"\"\" 你会得到一个银行账户的存款和取款操作列表，该账户从零余额开始。你的任务是检测账户余额是否在任何时候降至零以下，并在该点返回True。否则应返回False。\\n    \\n    \u003e\u003e\u003e below_zero([1, 2, 3])\\n    False\\n    \u003e\u003e\u003e below_zero([1, 2, -4, 5])\\n    True\\n    \"\"\"\\n',\n'description': '你会得到一个银行账户的存款和取款操作列表，该账户从零余额开始。你的任务是检测账户余额是否在任何时候降至零以下，并在该点返回True。否则应返回False。\\n    ',\n'test': \"\\n\\nMETADATA = {\\n    'author': 'jt',\\n    'dataset': 'test'\\n}\\n\\n\\ndef check(candidate):\\n    assert candidate([]) == False\\n    assert candidate([1, 2, -3, 1, 2, -3]) == False\\n    assert candidate([1, 2, -4, 5, 6]) == True\\n    assert candidate([1, -1, 2, -2, 5, -5, 4, -4]) == False\\n    assert candidate([1, -1, 2, -2, 5, -5, 4, -5]) == True\\n    assert candidate([1, -2, 2, -2, 5, -5, 4, -4]) == True\\n\",\n'entry_point': 'below_zero',\n'canonical_solution': '    balance = 0\\n\\n    for op in operations:\\n        balance += op\\n        if balance \u003c 0:\\n            return True\\n\\n    return False\\n',\n'natural_language': 'Chinese'\n}\n```\n\n### Data Fields\n\n- `task_id`: identifier for the data sample\n- `prompt`: input for the model containing function header and docstrings\n- `canonical_solution`: solution for the problem in the `prompt`\n- `description`: task description\n- `test`: contains function to test generated code for correctness\n- `entry_point`: entry point for test\n- `language`: programming lanuage identifier to call the appropriate subprocess call for program execution\n- `natural_language`: natural language identifier to show the language the prompt is in\n\n\n### Data Splits\nprogramming languages are used to speicify splits:\n - python \n - java \n - javascript\n - csharp\n - go\n - kotlin\n - php\n - perl\n - ruby\n - swift\n - scala\n - typescript\n\n## Evaluation\n### Installation\n\nCheck out and install this repository:\n```\ngit clone git@github.com:FloatAI/humaneval-xl.git\ncd mxeval\npip install -e mxeval\n```\n\n### Dependencies\nWe provide scripts to help set up programming language dependencies that are used to execute and evaluate using dataset.\n(We use the same scripts from https://github.com/amazon-science/mxeval for code generation evaluation)\n\n#### Amazon Linux AMI\n```\nbash language_setup/amazon_linux_ami.sh\n```\n#### Ubuntu\n```\nbash language_setup/ubuntu.sh\n```\n\n## Evaluation Usage\n\n**This program exists to run untrusted model-generated code. Users are strongly\nencouraged not to do so outside of a robust security sandbox. See the comment in\n`execution.py` for more information and instructions.**\n(We use the same scripts from https://github.com/amazon-science/mxeval for code generation evaluation)\n\nEach sample is formatted into a single line:\n```\n{\"task_id\": \"Corresponding task ID\", \"completion\": \"Completion only without the prompt\",\n\"language\": \"programming language name\"}\n```\nWe provide `python_chinese_generated_samples.jsonl` to illustrate the format. \n\nHere is nearly functional example code (you just have to provide\n`generate_one_completion` to make it work) that saves generated completions to\n`samples.jsonl`.\n```\nfrom mxeval.data import write_jsonl, read_problems\n\nproblems = read_problems()\n\nnum_samples_per_task = 200\nsamples = [\n    dict(task_id=task_id, language=problems[task_id][\"language\"], completion=generate_one_completion(problems[task_id][\"prompt\"]))\n    for task_id in problems\n    for _ in range(num_samples_per_task)\n]\nwrite_jsonl(\"samples.jsonl\", samples)\n```\n\nTo evaluate the samples for, e.g., Python, Chinese evaluation, run \n```\nevaluate_functional_correctness python_chinese_generated_samples.jsonl --problem_file data/python/Chinese.jsonl\n```\n\nNote: Because there is no unbiased way of estimating pass@k when there are fewer\nsamples than k, the script does not evaluate pass@k for these cases. To\nevaluate with other k values, pass `--k \u003ccomma-separated-values-here\u003e`. For\nother options, see\n```\n$ evaluate_functional_correctness --help\n```\nHowever, we recommend that you use the default values for the rest.\n\n## Credits\nWe adapted Amazon-science's mxeval package (https://github.com/amazon-science/mxeval) for the evaluation. We thank Amazon for their pioneering effort in this field including the release of the dataset and evaluation code.\n\n## Citation\n```\n@inproceedings{peng-etal-2024-humaneval-xl,\n    title = \"{H}uman{E}val-{XL}: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization\",\n    author = \"Peng, Qiwei  and\n      Chai, Yekun  and\n      Li, Xuhong\",\n    editor = \"Calzolari, Nicoletta  and\n      Kan, Min-Yen  and\n      Hoste, Veronique  and\n      Lenci, Alessandro  and\n      Sakti, Sakriani  and\n      Xue, Nianwen\",\n    booktitle = \"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)\",\n    month = may,\n    year = \"2024\",\n    address = \"Torino, Italia\",\n    publisher = \"ELRA and ICCL\",\n    url = \"https://aclanthology.org/2024.lrec-main.735\",\n    pages = \"8383--8394\",\n    abstract = \"Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.\",\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFloatAI%2FHumanEval-XL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFloatAI%2FHumanEval-XL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFloatAI%2FHumanEval-XL/lists"}