{"id":19932210,"url":"https://github.com/amazon-science/recode","last_synced_at":"2025-05-03T11:31:44.974Z","repository":{"id":139012820,"uuid":"580168242","full_name":"amazon-science/recode","owner":"amazon-science","description":"Releasing code for \"ReCode: Robustness Evaluation of Code Generation Models\"","archived":false,"fork":false,"pushed_at":"2023-12-05T05:10:32.000Z","size":10038,"stargazers_count":37,"open_issues_count":4,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2023-12-05T06:25:23.031Z","etag":null,"topics":["code-generation","large-language-models","nlp","robustness"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-12-19T22:14:07.000Z","updated_at":"2023-12-05T06:25:24.033Z","dependencies_parsed_at":"2023-12-05T06:35:39.952Z","dependency_job_id":null,"html_url":"https://github.com/amazon-science/recode","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Frecode","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Frecode/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Frecode/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Frecode/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/recode/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224360231,"owners_count":17298319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-generation","large-language-models","nlp","robustness"],"created_at":"2024-11-12T23:09:24.026Z","updated_at":"2024-11-12T23:09:24.736Z","avatar_url":"https://github.com/amazon-science.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# ReCode: Robustness Evaluation of Code Generation Models\n\nThis is the repo for ReCode ([arXiv](https://arxiv.org/abs/2212.10264)), providing a comprehensive evaluation for the practical robustness of code generation models like CodeGEN, Incoder, GPT-J. In specific, this benchmark provides over 30 different general perturbations on docstrings, function names, and codes. The perturbations are carefully selected and implemented such that the perturbed datasets are naturally and semantically close to the original non-perturbed datasets. All the perturbations are well implemented with automatic generation, providing easy usage and customization. With these perturbations available in our benchmark, the user can get to know a comprehensive analysis of model robustness performance.\n\nOur benchmark is general with regard to datasets and models. Given the perturbed datasets, the users can evaluate any of public/customized code generation models with the default inference provided by our benchmark. We also allow users to provide their own datasets and models to evaluate robustness in our benchmark by configuring `config.json` and inference script `evaluate-public-models/run_eval_models.sh`.\n\nAfter the model evaluation is done on perturbed datasets, we provide overall robustness analysis for the evaluated models such that the users can easily compare across different models and get aware of the possible practical robustness problems.\n\nLastly, we release a standard version of the perturbed datasets `dataset-release/perturbed-finalized` for HumanEval and MBPP in this benchmark for general robustness evaluation and compare across different models proposed in future works.\n\n## Installation\nWe are using python 3.8, cuda 11.6. Anaconda would be recommended. Please run the following commands for installation.\n```\nconda deactivate; conda env remove --name ReCode\nconda create --name ReCode python=3.8\nconda activate ReCode\n```\n\nInstalling huggingface for model inference\n```\npip install transformers==4.21.1\npip install -U torch==1.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html\n```\n\nInstalling humaneval. Need to enable humaneval by uncommenting out execution line `exec(check_program, exec_globals)` in `execution.py`.\n```\ncd evaluate-public-models\ngit clone https://github.com/openai/human-eval\npip install -e human-eval\ncd ..\n```\n\nInstalling nlaugmenter for perturbations\n```\ncd nlaugmenter\npip install -r requirements.txt\npip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz\ncd ..\n```\n\nInstalling treesitter for perturbations. Note that we customized our code syntax perturbatons based on [natgen](https://github.com/saikat107/NatGen). \n```\ncd natgen/treesitter\ngit clone https://github.com/tree-sitter/tree-sitter-python # clone the py-tree-sitter\npython build.py # build my-languages.so file\ncd ../transformations\nln -s ../treesitter/build/my-languages.so ./\npip install sympy\ncd ../..\n```\n\n## Running our ReCode benchmark\nWe provide general APIs for running our benchmark in `run_robust.py`. We have four main types of perturbations (1) nlaugmenter on docstrings (nlaugmenter) (2) function rename (func_name) (3) code syntax (natgen) (4) code format (format). Multiple variances are defined and implemented for each type of perturbation. One can find detailed config in `config.json`. \n\nOverall we have multiple steps for benchmark as described in detail in the following sections: (1) [perturb] creating perturbed datasets, (2) [exec] run the models on nominal/perturbed datasets, and (3) [report_coarse] collect and summarize running results according to our proposed robustness metrics.\n\n### Step1: Create perturbed datasets [perturb] \n\n[perturb] option is used to create perturbed datasets. One can run the following commands to perturb based on one's own nominal datasets (path config in `config.json`). \n\nNote that we also released our perturbed data used for evaluation in paper as a general robustness benchmark (`dataset-release/perturbed_finalized`). To directly evaluate on our created benchmark datasets, please change `output_adv_path` in `config.json` to that path and skip all the following commands for perturbing in this [perturb] section!\n\n\n```\npython run_robust.py create_partial natgen # preparing partial code for code perturbations\npython run_robust.py perturb nlaugmenter # perturb with nlaugmenter transformations on docstrings\npython run_robust.py perturb func_name # perturb with function rename\npython run_robust.py perturb natgen # perturb with code syntax transformations\npython run_robust.py perturb code # perturb with code format transformations\n```\n\nOne can specify augmentation method for each type of perturbations with --aug_method, index can be found in `config.json`. --datasets allow to specify perturbed datasets.\n```\npython run_robust.py perturb func_name --aug_method 0 --datasets humaneval mbpp # perturb with function rename CamelCase (index=0 defined in config.json) on humaneval and mbpp\n``` \n\nOur benchmark also provides [analysis] option to check how each sample is perturbed one by one\n```\npython run_robust.py analysis func_name --aug_method 0 --models None # check perturbed data one by one with function rename CamelCase (index=0 defined in config.json)\n```\n\nTo debug and customize perturbations, one can use low-level APIs. Turn on --print_sample to debug and check customized perturbations on each sample.\n```\npython perturb.py --method format --aug_method 0 --print_sample\n```\n\n### Step2: Run on perturbed datasets [exec] \n\n[exec] option is used for evaluating targeted models on perturbed datasets. To evaluate models with our benchmark, please config the targeted nominal/perturbed datasets and model path correctly in `config.json`. One can then run with:\n```\npython run_robust.py nominal normal # nominal evaluation with non-perturbed datasets\npython run_robust.py nominal natgen # nominal evaluation with non-perturbed partial code datasets\npython run_robust.py exec nlaugmenter # nlaugmenter perturbed datasets evaluation\npython run_robust.py exec func_name # function rename perturbed datasets evaluation\npython run_robust.py exec natgen # code structure perturbed datasets evaluation\npython run_robust.py exec format # code format transformation perturbed datasets evaluation\n```\n\nIf one wants to evaluate specific augmentation method, one can easily run\n```\npython run_robust.py exec func_name --aug_method 0 # evaluate model on dataset with function rename CamelCase (index=0 defined in config.json)\n```\n\nFor targeted models please use augments --models and --datasets. Note that one has to correctly config the model names and path correctly in the running shell file in `run_script` in `config.json`. Detailed running hyperparameters can be configured in that shell file. Please make sure that shell file can run correctly for nominal evaluation on your own models/datasets. Our benchmark will mainly call that file for evaluation. The default one is `evaluate-public-models/run_eval_models.sh`\n```\npython run_robust.py perturb func_name --datasets humaneval mbpp --models codegen-350M-multi codegen-350M-mono # perturb dataset humaneval mbpp on codegen-350M-multi and codegen-350M-mono\npython run_robust.py exec func_name --datasets humaneval mbpp --models codegen-350M-multi codegen-350M-mono # evaluate model on dataset humaneval mbpp on codegen-350M-multi and codegen-350M-mono\n```\n\n### Step3: Summarize running results [report_coarse]\n\nIn our paper, we proposed three main robustness metrics: robust pass@k, robust drop@k, and robust relative@k. To summarize and collect the evaluated results, one can run the following commands. In specific, `report_coarse` option summarizes the robustness numbers for all thee metrics (as shown in main tables in paper). `report` option summarizes the detailed robustness results into csv (detailed tables in appendix of paper). The results will be saved as tables in `csv_coarse` and `csv`.\n```\npython run_robust.py report_coarse func_name --models codegen-350M-multi codegen-350M-mono --datasets humaneval # get summarized results for dataset perturbed with function rename\npython run_robust.py report func_name --models codegen-350M-multi codegen-350M-mono --datasets humaneval # get detailed results for dataset perturbed with function rename\n```\n\n`analysis` option with --models given provides prints for the perturbed data and completion by the model for each prompt.\n```\npython run_robust.py analysis func_name --models codegen-350M-mono --datasets humaneval # analyze completion samples for dataset perturbed with function rename by codegen-350M-mono\n```\n\n\n## License\nThe ReCode benchmark is under Apache-2.0 license.\n\n\n## Cite this work\n\nPlease cite with the following bibtex\n\n```\n@article{recode_wang2022,\n  title = {ReCode: Robustness Evaluation of Code Generation Models},\n  author = {Wang, Shiqi and\n   Zheng, Li and\n   Qian, Haifeng and\n   Yang, Chenghao and\n   Wang, Zijian and\n   Kumar, Varun and\n   Shang, Mingyue and\n   Tan, Samson and\n   Ray, Baishakhi and\n   Bhatia, Parminder and\n   Nallapati, Ramesh and\n   Ramanathan, Murali Krishna and\n   Roth, Dan and\n   Xiang, Bing},\n  doi = {10.48550/arXiv.2212.10264},\n  url = {https://arxiv.org/abs/2212.10264},\n  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL)},\n  publisher = {arXiv},\n  year = {2022},\n  copyright = {Creative Commons Attribution 4.0 International}\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Frecode","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Frecode","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Frecode/lists"}