{"id":13572140,"url":"https://github.com/nuprl/MultiPL-E","last_synced_at":"2025-04-04T09:31:43.193Z","repository":{"id":56814949,"uuid":"517691031","full_name":"nuprl/MultiPL-E","owner":"nuprl","description":"A multi-programming language benchmark for evaluating the performance of large language model of code.","archived":false,"fork":false,"pushed_at":"2024-04-13T02:01:36.000Z","size":23179,"stargazers_count":147,"open_issues_count":8,"forks_count":33,"subscribers_count":15,"default_branch":"main","last_synced_at":"2024-04-14T01:05:11.818Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://nuprl.github.io/MultiPL-E/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nuprl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-07-25T14:07:20.000Z","updated_at":"2024-04-15T05:47:49.557Z","dependencies_parsed_at":"2023-02-19T10:46:05.181Z","dependency_job_id":"e9c18fe9-bac1-4440-84aa-fa8fb5dbe8b3","html_url":"https://github.com/nuprl/MultiPL-E","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuprl%2FMultiPL-E","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuprl%2FMultiPL-E/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuprl%2FMultiPL-E/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuprl%2FMultiPL-E/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nuprl","download_url":"https://codeload.github.com/nuprl/MultiPL-E/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246751004,"owners_count":20827817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T14:01:14.706Z","updated_at":"2025-04-04T09:31:43.178Z","avatar_url":"https://github.com/nuprl.png","language":"Python","readme":"# Multi-Programming Language Evaluation of Large Language Models of Code (MultiPL-E)\n\n## Introduction\n\nMultiPL-E is a system for translating unit test-driven neural code generation\nbenchmarks to new languages. We have used MultiPL-E to translate two popular\nPython benchmarks (HumanEval and MBPP) to 18 other programming languages.\n\nFor more information:\n\n- MultiPL-E is part of the [BigCode Code Generation LM Harness]. This\n  is the easiest way to use MultiPL-E.\n- The [Multilingual Code Models Evaluation] by BigCode evaluates Code LLMs\n  using several benchmarks, including MultiPL-E.\n- Read our paper [MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation].\n- The [MultiPL-E dataset] of translated prompts is available on the Hugging Face\n  Hub.\n\n## Tutorial\n\nThese are instructions on how to use MultiPl-E directly, without the \nBigCode evaluation harness.\n\nIn this tutorial, we will run a small experiment to evaluate the performance of\n[SantaCoder] on Rust with a small subset of the MBPP benchmarks. \nWe will only fetch 20 completions per problem, so that you\ncan run it quickly on a single machine.  \nYou can also run on the full suite of benchmarks or substitute your own\nbenchmark programs. Later, we'll show you how to add support for other languages\nand evaluate other models.\n\n### Prerequisites\n\n1. You will need Python 3.8 or higher.\n\n2. You will need to install some Python packages:\n\n    ```bash\n    pip3 install aiohttp numpy tqdm pytest datasets torch transformers\n    ```\n\n3. You need to install one of [Podman] or [Docker].\n\n3. Check out the repository:    \n\n   ```bash\n   git clone https://github.com/nuprl/MultiPL-E\n   ```\n\n4. Enter the repository directory:\n\n   ```bash\n   cd MultiPL-E\n   ```\n\n### Background\n\nOut of the box, MultiPL-E supports several models, programming languages, \nand datasets.  Using MultiPL-E is a two step process:\n\n1. We *generate* completions, which requires a GPU.\n\n2. We *execute* the generated completions, which requires a machine that\n   supports Docker or Podman.\n\n### Generation\n\nThe following command will generate completions for the HumanEval benchmark,\nwhich is originally in Python, but translated to Rust with MultiPL-E:\n\n```\nmkdir tutorial\npython3 automodel.py \\\n    --name bigcode/gpt_bigcode-santacoder \\\n    --root-dataset humaneval \\\n    --lang rs \\\n    --temperature 0.2 \\\n    --batch-size 20 \\\n    --completion-limit 20 \\\n    --output-dir-prefix tutorial\n```\n\nThe model name above refers to the\n[SantaCoder](https://huggingface.co/bigcode/gpt_bigcode-santacoder) model on the\nHugging Face Hub. You can use any other text generation model instead.\n\nNotes:\n\n1. This command requires about 13 GB VRAM and takes 30 minutes with a Quadro RTX \n   6000.\n2. If you have less VRAM, you can set `--batch-size` to a smaller value.\n   E.g., with `--batch-size 10` it should work on consumer graphics cards,\n   such as the RTX series cards.\n3. If you're feeling impatient, you can kill the command early (use `Ctrl+C`)\n   before all generations are complete. Your results won't be accurate,\n   but you can move on to the evaluation step to get a partial result. Before\n   killing generation, ensure that a few files have been generated:\n\n   ```bash\n   ls tutorial/*/*.json.gz\n   ```\n\n### Execution\n\nYou can run MultiPL-E's execution with or without a container, but we strongly\nrecommend using the container that we have provided. The container includes the\ntoolchains for all languages that we support. Without it, you will need to\npainstakingly install them again. There is also a risk that the generated code\nmay do something that breaks your system. The container mitigates that risk.\n\n#### Execution with a Container\n\nWhen you first run evaluation, you need to pull and tag the [execution container](https://github.com/nuprl/MultiPL-E/pkgs/container/multipl-e-evaluation):\n\n\n```bash\npodman pull ghcr.io/nuprl/multipl-e-evaluation\npodman tag ghcr.io/nuprl/multipl-e-evaluation multipl-e-eval\n```\n\nThe following command will run execution on the generated completions:\n\n```bash\npodman run --rm --network none -v ./tutorial:/tutorial:rw multipl-e-eval --dir /tutorial --output-dir /tutorial --recursive\n```\n\nIf execution is successful, you will see several `.results.json.gz` files\nalongside the `.json.gz` files that were created during generation:\n\n```\nls tutorial/*/*.results.json.gz\n```\n\n#### Execution without a Container\n\nAssuming you have setup the needed language toolchains, here is how you\ndo executions without a container:\n\n```bash\ncd evaluation/src\npython3 main.py --dir ../../tutorial --output-dir ../../tutorial --recursive\n```\n\nIf execution is successful, you will see several `.results.json.gz` files\nalongside the `.json.gz` files that were created during generation:\n\n```bash\nls ../../tutorial/*/*.results.json.gz\n```\n\n### Analyzing Results\n\nFinally, you can calculate the pass rates:\n\n```\npython3 pass_k.py ./tutorial/*\n```\n\nThe experiment prints pass rates for k=1 as we only made 20 results at\ntemperature 0.2. If you want to see pass@10 and pass@100 pass rates, you\ncan regenerate with `--temperature 0.8`.\n\n**Warning:** In generation, we used `--completion-limit 20` to only generate\n20 samples for each prompt. You should remove this flag to generate 200 samples\nfor temperature 0.8. We have found that 20 samples is adequate for estimate \npass@1 (there will be a little variance). However, you need more samples to estimate\npass@10 and pass@100.\n\n## Adding Support for a New Programming Language\n\nIf you want to learn by example, you can look at pull requests that have added\nsupport for several languages:\n\n- [Ada](https://github.com/nuprl/MultiPL-E/pull/162)\n- [Dart](https://github.com/nuprl/MultiPL-E/pull/153)\n- [Clojure](https://github.com/nuprl/MultiPL-E/pull/136)\n- [Elixir](https://github.com/nuprl/MultiPL-E/pull/117)\n\nIn general, you need to make three changes to support a new language *L*:\n\n1. Write an execution script to run and test *L* language that goes in\n   [evaluation/src](https://github.com/nuprl/MultiPL-E/tree/main/evaluation/src).\n\n2. Write a translator to translate benchmarks to *L* that new language that goes\n   in [dataset_builder](https://github.com/nuprl/MultiPL-E/tree/main/dataset_builder).\n\n3. Add terms for *L* to `dataset_builder/terms.csv` to translate comments.\n\n### Writing the Translator\n\nLet's say we had not included Perl in the set of benchmark languages and \nyou want to add it. In a new file `humaneval_to_perl.py` you will need to \ndefine a class called `Translator`. `Translator` contains numerous methods -\nthe interface for a generic `Translator` class is provided in `base_language_translator.py `. \n\n*Note*: You must name your translator `humaneval_to_L.py`. However, the code\nworks with several other benchmarks, including MBPP.\n\nThere are three types of methods for `Translator`: (1) methods that handle \ntranslating the prompt, (2) methods that handle translating the unit tests, and\n(3) methods that handle the value-to-value translation. \n\nFirst, let's handle converting the Python prompt to a Perl prompt. This is \ndone by the `translate_prompt` method. `translate_prompt` needs to return \na string (we definitely suggest using a formatted Python string here) that \ncontains the Perl prompt and then the Perl function signature. We suggest \naccumulating the prompt into one string as follows: \n```\nperl_description = \"# \" + re.sub(DOCSTRING_LINESTART_RE, \"\\n# \", description.strip()) + \"\\n\"\n```\nwhere `\"#\"` are Perl single-line comments. `DOCSTRING_LINESTART_RE` identifies the \nfirst line in the prompt using a regex and then `description` is a string representing \nthe rest of the prompt. This process should be pretty simple - just connect them together with \nyour comment structure of choice.\n\nThe argument `name` to `translate_prompt` takes care of the function name, you \njust need to format the function arguments (argument `args`) and delimiters to complete \nthe prompt translation.\n\nNow let's consider the three methods which help translate unit tests:\n`test_suite_prefix_lines`, `test_suite_suffix_lines`, and `deep_equality`. \nThe prefix and suffix methods return a \"wrapper\" around the set of generated unit \ntests. In most languages, as is the case in Perl, the prefix defines a function/class \nfor testing and the suffix calls that function. This may include calls to your testing library \nof choice (please look at existing `humaneval_to` files for examples!). \nThe wrapper in Perl we use is:\n```\nsub testhumaneval {\n   my $candidate = entry_point;\n   # Tests go here\n}\ntesthumaneval();\n```\n\nNote the argument `entry_point` to `test_suite_prefix_lines`: this is the name \nof the function for each benchmark. In most languages, we either assign that to \na variable `candidate` (as done in the original HumanEval benchmark) or call \n`entry_point` directly. \n\nThe final unit test function is `deep_equality`, which is where you define how \nto check whether two arguments (`left` and `right`) are structurally equal. In Perl\nwe do this with `eq_deeply`. (Hint: note that sometimes the order of `left` and \n`right` can be switched in some testing frameworks - try this out to produce \nthe best error messages possible!).\n\nThird, let's tackle the value-to-value translation methods. All of them take\na Python value (or some representation of one) as an argument and return a string \nrepresenting that value's equivalent in Perl.\n\nFor instance, `gen_dict` defines what dictionaries in Python should map to in\nPerl. Our implementation is below; the only work we need to do is use of `=\u003e` i\nnstead of `:` to differentiate keys and values in Perl.\n\n```\n def gen_dict(self, keys: List[str], values: List[str]) -\u003e str:\n        return \"{\" + \", \".join(f\"{k} =\u003e {v}\" for k, v in zip(keys, values)) + \"}\"\n```\n\nThis step should be quite straightforward for each value and its associated \nmethod. When there is choice, we used our language knowledge or consulted \nthe style guides from the language communities (see our paper's Appendix). As we \nmention in our paper, the ease of value-to-value mapping is one of the key aspects of \nthis approach. \n\nThere are also smaller elements to `Translator` (stop tokens, file_ext, etc.)\nthat you will need to populate accordingly. \n\nIf you've successfully gotten to this point: great, you're done and can move \non to `eval_foo` and testing. If you wanted to add a statically typed \nbenchmark - Read on!\n\n#### What about statically typed languages?\n\nStatically typed translations are notably more challenging to implement than the \nPerl example above. Rather than walk you through the steps directly, we provide a \nwell-documented version of `humaneval_to_ts.py` for TypeScript as an example. Feel free\nto also consult translations for other languages in the benchmark, although your \nmileage may vary. \n\n### Writing the Execution Script\n\nNow that you're done converting Python to your language of choice, you need \nto define how to evaluate the generated programs. As a reminder, one of the \ncontributions of this benchmark suite is actually evaluating the generated\ncode. Let's continue with the idea that you are adding Perl as a new language to our dataset.\n\nIn `eval_L.py` you should define a function, `eval_script`, with the \nfollowing signature and imports:\n```\nfrom pathlib import Path\nfrom safe_subprocess import run\n\ndef eval_script(path: Path):\n```\n\nIn the body of `eval_script` you should call `run` with the \nrequisite arguments (please refer to it's documentation and your computing architecture\nto do this correctly). For our results, we use the following call to `run` for Perl:\n```\nr = run([\"perl\", path])\n```\n\nYou should then determine how to handle what gets assigned to `r`. If you \nlook around the eval scripts we provide, there are different granularities for\nhandling program evaluation. For instance some statically typed errors\nhandle compilation and runtime errors differently. We recommend, at minimum,\nhandling success (typically exit code 0), timeouts, syntax errors, \nand exceptions as four subclasses of results. You can do this using \n`try-except` statements or simply with conditionals:\n\n```\n   if r.timeout:\n        status = \"Timeout\"\n   ... handle other errors ...\n    else:\n        status = \"OK\"\n```\n\n`eval_script` should return a dictionary of the form below - the scripts above \nrely on this output format to calculate pass@k metrics:\n\n```\nreturn {\n        \"status\": status,\n        \"exit_code\": r.exit_code,\n        \"stdout\": r.stdout,\n        \"stderr\": r.stderr,\n      }\n```\n\nThe final two steps are:\n\n1. A reference to your evaluator in the file `./evaluation/src/containerized_eval.py`.\n\n2. Create a Dockerfile for your language in the `evaluation` directory.\n\nThere is one final step if you want to run the completion\ntutorial above for your brand new language. Open `containerized_eval.py` and \nadd links to your new language in two places:\n\n### Writing the Terms to Translate Comments\n\nAdd a row for $L$ to `dataset_builder/terms.csv`, which instructs how to convert\nthe prompt into your language's verbiage.\n\n### Testing a New Language\n\nThe MultiPL-E benchmark lives on the Hugging Face Hub, but it is easier to test\nand iterate on your new language without uploading a new dataset every time\nyou make a change. When the translator is ready, you can test it by translating\nHumanEval to *L* with the following command:\n\n```bash\ncd MultiPL-E/dataset_builder\npython3 prepare_prompts_json.py \\\n     --lang humaneval_to_L.py \\\n     --doctests transform \\\n     --prompt-terminology reworded \\\n     --output ../L_prompts.jsonl\n```\n\nThis creates the file `L_prompts.jsonl` in the root of the repository. You can\nthen evaluate a model on these prompts with the following command:\n\n```bash\ncd MultiPL-E\npython3 automodel_vllm.py \\\n     --name MODEL_NAME \\\n     --root-dataset humaneval \\\n     --use-local \\\n     --dataset ./L_prompts.jsonl \\\n     --temperature 0.2 \\\n     --batch-size 50 \\\n     --completion-limit 20 \\\n```\n\nYou can safely set --completion-limit 20 and get a reasonable stable\nresult. Any lower and you'll get variations greater than 1%. The command\nabove will create a directory named `humaneval-L-MODEL_NAME-0.2-reworded`.\nAt this point, you can look at the *.json.gz* files to see if the results\nlook reasonable. We recommend looking at least problem 53. It is an easy\nproblem that every model should get right.\n\nFinally, you can test the generated code with the following command:\n\n```\ncd MultiPL-E\npython3 evaluation/src/main.py \\\n  --dir humaneval-L-MODEL_NAME-0.2-reworded \\\n  --output-dir humaneval-L-MODEL_NAME-0.2-reworded\n```\n\nThis creates several *.results.json.gz* files, alongside the *.json.gz* files.\n\nTo compute pass@1:\n\n```\ncd MultiPL-E\npython3 pass_k.py humaneval-L-MODEL_NAME-0.2-reworded\n```\n\n## Add a New Benchmark\n\nThis is the really easy part. All you need to do is create directory of Python\nprograms that looks like the following:\n\n```python\ndef my_function(a: int, b: int, c: int, k: int) -\u003e int:\n    \"\"\"\n    Given positive integers a, b, and c, return an integer n \u003e k such that\n    (a ** n) + (b ** n) = (c ** n).\n    \"\"\"\n    pass\n    \n\n### Unit tests below ###\ndef check(candidate):\n    assert candidate(1, 1, 2, 0) == 1\n    assert candidate(3, 4, 5, 0) == 2\n\ndef test_check():\n    check(my_function)\n```\n\nFor an example, see `datasets/originals-with-cleaned-doctests`. These\nare the HumanEval problems (with some cleanup) that we translate to the\nMultiPl-E supported languages.\n\nSome things to note:\n\n1. The *unit tests below* line is important, because we look for that in our\n   scripts.\n\n2. We also rely on the name `candidate`. This is not fundamental, and we may get\n   around to removing it.\n\n3. You can use `from typing import ...` and `import typing`, but you cannot\n   have any other code above the function signature.\n\n4. The type annotations are not required, but are necessary to evaluate some\n   languages.\n\n5. The assertions must be equalities with simple input and output values,\n   as shown above.\n\n6. Finally, note that you do not implement the function yourself. You can leave\n   the body as `pass`.\n\nLet's suppose that you've created a set of benchmark problems in the directory\n`datasets/new_benchmark`. You can then translate the benchmark to language $L$\n as follows:\n\n```bash\ncd MultiPL-E/dataset_builder\npython3 prepare_prompts_json.py \\\n     --originals ../datasets/new_benchmark\n     --lang humaneval_to_L.py \\\n     --doctests transform \\\n     --prompt-terminology reworded \\\n     --output ../L_prompts.jsonl\n```\n\nYou can then test the dataset by following the steps in \n[Testing a new language](https://github.com/nuprl/MultiPL-E?tab=readme-ov-file#testing-a-new-language).\n\n## Credits\n\nMultiPL-E was originally authored by:\n\n- Federico Cassano (Northeastern University)\n- John Gouwar (Northeastern University)\n- Daniel Nguyen (Hanover High School)\n- Sydney Nguyen (Wellesley College)\n- Luna Phipps-Costin (Northeastern University)\n- Donald Pinckney (Northeastern University)\n- Ming-Ho Yee (Northeastern University)\n- Yangtian Zi (Northeastern University)\n- Carolyn Jane Anderson (Wellesley College)\n- Molly Q Feldman (Oberlin College)\n- Arjun Guha (Northeastern University and Roblox Research)\n- Michael Greenberg (Stevens Institute of Technology)\n- Abhinav Jangda (University of Massachusetts Amherst)\n\nWe thank Steven Holtzen for loaning us his GPUs for a few weeks. We thank\n[Research Computing at Northeastern University] for supporting the\nDiscovery cluster.\n\nSeveral people have since contributed to MultiPL-E. Please see the\n[changelog](https://huggingface.co/datasets/nuprl/MultiPL-E) for those acknowledgments.\n\n[BigCode Code Generation LM Harness]: https://github.com/bigcode-project/bigcode-evaluation-harness\n[MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation]: https://ieeexplore.ieee.org/abstract/document/10103177\n[SantaCoder]: https://arxiv.org/abs/2301.03988\n[MultiPL-E dataset]: https://huggingface.co/datasets/nuprl/MultiPL-E\n[StarCoder]: https://arxiv.org/abs/2305.06161\n[Multilingual Code Models Evaluation]: https://huggingface.co/spaces/bigcode/multilingual-code-evals\n[Conda]: https://conda.io/\n[Podman]: https://podman.io/\n[Docker]: https://www.docker.com/\n","funding_links":[],"categories":["Python","Anthropomorphic-Taxonomy","Benchmark"],"sub_categories":["Typical Professional Quotient (PQ)-Professional Expertise evaluation benchmarks","Code"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuprl%2FMultiPL-E","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnuprl%2FMultiPL-E","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuprl%2FMultiPL-E/lists"}