{"id":18949886,"url":"https://github.com/salesforce/coderl","last_synced_at":"2025-05-15T17:09:01.168Z","repository":{"id":42375779,"uuid":"508912853","full_name":"salesforce/CodeRL","owner":"salesforce","description":"This is the official code for the paper CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (NeurIPS22).","archived":false,"fork":false,"pushed_at":"2025-01-21T08:13:49.000Z","size":25121,"stargazers_count":531,"open_issues_count":39,"forks_count":65,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-05-15T17:08:55.252Z","etag":null,"topics":["ai","codegeneration","languagemodel","machinelearning","programsynthesis","reinforcementlearning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-30T02:54:36.000Z","updated_at":"2025-05-02T05:07:22.000Z","dependencies_parsed_at":"2024-08-01T19:54:11.624Z","dependency_job_id":"49f28bc4-9038-47bd-bc3a-b2f2470ace3c","html_url":"https://github.com/salesforce/CodeRL","commit_stats":{"total_commits":40,"total_committers":4,"mean_commits":10.0,"dds":0.125,"last_synced_commit":"2c62fa26a665a43fc225509f6641007975bae291"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FCodeRL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FCodeRL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FCodeRL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2FCodeRL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/CodeRL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254384989,"owners_count":22062422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","codegeneration","languagemodel","machinelearning","programsynthesis","reinforcementlearning"],"created_at":"2024-11-08T13:19:26.417Z","updated_at":"2025-05-15T17:08:56.152Z","avatar_url":"https://github.com/salesforce.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/logo.jpg\" width=\"50%\"\u003e\n\u003c/p\u003e\n\n## CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning \u003ca name=\"corl\"\u003e\u003c/a\u003e\n\n\nThis is the official code for the paper **[CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/abs/2207.01780)** (accepted to [NeurIPS 2022](https://openreview.net/forum?id=WaGvb7OzySA)). Do check out our [blog](https://blog.salesforceairesearch.com/coderl/) and [poster](https://nips.cc/media/PosterPDFs/NeurIPS%202022/d98d76e2b5ba72023414d98e75403e79.png).\n\nAuthors:\n[Hung Le](https://sites.google.com/view/henryle2018/home), [Yue Wang](https://yuewang-cuhk.github.io/), [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Silvio Savarese](https://scholar.google.com/citations?user=ImpbxLsAAAAJ\u0026hl=en), [Steven C.H. Hoi](https://scholar.google.com/citations?user=JoLjflYAAAAJ\u0026hl=en) \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/ezgif-1-12f629284e.gif\" width=\"100%\" /\u003e\n\u003c/p\u003e\n\n### Contents:\n* [x] [CodeRL Overview](#coderl-overview)\n* [x] [Installation](#installation)\n* [x] [Datasets](#datasets)\n\t* [x] [Example Unit Tests](#example-unit-tests)\n* [x] [Models](#models)\n\t* [x] CodeT5-large\n\t* [x] CodeT5-large-ntp-py\n\t* [x] CodeRL+CodeT5 \n\t* [x] Critic models \n* [ ] [Processes](#processes)  \n\t* [x] [Generating Programs](#generating-programs)\n\t* [x] [Running Unit Tests](#running-unit-tests)\n\t* [x] [Evaluating Programs](#evaluating-programs)\n\t* [x] [Training Critic](#training-critic)\n\t* [x] [Generating Critic Scores](#generating-critic-scores)\n\t* [x] [Finetuning with Ground-truth Programs](#finetuning-with-ground-truth-programs)\n\t* [x] [Finetuning with Generated Programs](#finetuning-with-generated-programs)\n\t* [ ] [Generating Programs with Critic Sampling](#generating-programs-with-critic-sampling)\n* [x] [Example Generated Programs](#example-generated-programs)\n* [x] [Citation](#citation)\n* [x] [License](#license) \n\n## CodeRL Overview  \n\n\n \u003cp align=\"center\"\u003e\n\u003cimg src=\"images/coderl_overview.png\" width=\"100%\" /\u003e\n \u003cbr\u003e\n\u003cb\u003eAn example program synthesis task (Right)\u003c/b\u003e: Each task includes a problem specification in natural language, which often contains example input and output pairs. The expected output is a program that is checked for functional correctness against some unit tests. \n\u003cb\u003eA high-level overview of our CodeRL framework for program synthesis (Left)\u003c/b\u003e: Our CodeRL framework treats pretrained language model (LM) as a stochastic policy, token predictions as actions, and rewards can be estimated based on unit test results of output programs\n\u003c/p\u003e\n\n* During training, we treat the code-generating language models as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. \n* During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. \n\n\n\u003c!---\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/coderl_training.png\" width=\"100%\" /\u003e\n\u003cb\u003eOverview of our actor-critic framework to optimize pretrained LMs for program\nsynthesis\u003c/b\u003e: We treat the LM as an actor network and sample synthetic samples from this actor. Another neural network is trained as a critic model to evaluate these synthetic samples based on their probabilities of passing unit tests. The returns are estimated based on critic scores and finally factored into the RL objective to finetune the actor LM network using synthetic samples.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/coderl_inference.png\" width=\"100%\" /\u003e\n\u003cb\u003eOverview of our Critic Sampling (CS) approach for program synthesis during inference\u003c/b\u003e:\nprograms are refined and repaired based on their results on example unit tests of the corresponding problems. Program candidates are sampled by their critic-predicted scores at the token or sequence level. Dotted lines indicate optional processes that apply during program refining or repairing.\n\u003c/p\u003e\n--\u003e\n\n\n## Installation  \n\nThe code requires some dependencies as specified in `requirements.txt`. Please follow the relevant libraries to install or run: \n\n`pip install -r requirements.txt`\n\nInstall the `transformers` library from the source code (the current source code is developed from the original [code](https://github.com/huggingface/transformers) of version 4.16.1): \n\n```\ncd transformers\npip install -e .\n```\n\n\n## Datasets \n\nFor pretraining, apart from the [CodeSearchNet (CSN)](https://arxiv.org/abs/1909.09436), we use the [Python Github Code Dataset (GCPY)](https://huggingface.co/datasets/lvwerra/github-code). \nWe have compiled public, non-personal data from GitHub consisting of permissively licensed Python code (e.g. “mit”, “apache-2”, “bsd-3-clause”, “bsd-2- 126 clause”, “cc0-1.0”, “unlicense”, “isc”). Please see the paper for more details on pretraining data preprocessing and pretraining. \n\n\nAfter pretraining, we finetune/evaluate models on the following major program synthesis benchmarks: \n\n* **APPS**: Please follow the downloading and preprocessing instructions provided [here](https://github.com/hendrycks/apps). \n* **MBPP**: The dataset is available [here](https://github.com/google-research/google-research/tree/master/mbpp). \n\nOn both benchmarks, we follow the same way of preprocessing data and constructing input/output sequences as the original benchmark papers. \n\nDownload and unzip all files into the `data` folder.\n\n### Example Unit Tests \nIn addition to the original hidden unit tests on APPS, we also utilize the example tests that are often embedded in problem descriptions.\nAfter downloading and unzipping APPS, you can run the notebook `extract_example_test.ipynb` to extract and save example unit tests of APPS test samples into corresponding sample folder e.g. `data/APPS/test/0000/`.\nWe release the example unit tests that we already extracted using this notebook in the folder `data/APPS_test_example_tests/`. The average number of example unit tests per sample is 1.9764.\n\n## Models \n\nWe employ [CodeT5](https://github.com/salesforce/CodeT5) (a family of encoder-decoder language models for code from the [paper](https://arxiv.org/pdf/2109.00859.pdf)) as the foundation model in our work. \n\nWe pretrained CodeT5 with bigger dataset and improved learning objectives. We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py).\n\n* [CodeT5-large](https://huggingface.co/Salesforce/codet5-large): a 770M-CodeT5 model which was pretrained using Masked Span Prediction objective on CSN and achieved new SOTA results on several CodeXGLUE benchmarks. See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.\n* [CodeT5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py): A 770M-CodeT5 model which was first pretrained using Masked Span Prediction objective on CSN and GCPY, followed by using Next Token Prediction objective on GCPY. _This checkpoint was especially optimized for Python code generation tasks and employed by CodeRL_.\n\nFor finetuning on downstream code generation tasks on APPS, we adopted critic models for RL training. We released the following critic model checkpoints (on Google Cloud Storage): \n\n* [CodeT5-finetuned_critic](https://console.cloud.google.com/storage/browser/sfr-coderl-research/codet5_finetuned_critic): a CodeT5 model which is initialized from a normal CodeT5-base and trained as a classifier to predict unit test outcomes (one of Compile Error, Runtime Error, Failed Tests, and Passed Tests). The critic is used to estimate returns and facilitate RL finetuning.\n* [CodeT5-finetuned_critic_binary](https://console.cloud.google.com/storage/browser/sfr-coderl-research/codet5_finetuned_critic_binary): similar to the prior model but was trained with binary annotations (Passed Tests or not Passed Tests only). This critic is used to facilitate generation procedures during inference.\n\nWe released the following finetuned code generation model checkpoints (on Google Cloud Storage): \n\n* [CodeT5-finetuned_CodeRL](https://console.cloud.google.com/storage/browser/sfr-coderl-research/codet5_finetuned_codeRL): a CodeT5 model which was initialized from the prior pretrained CodeT5-large-ntp-py and then finetuned on APPS following our CodeRL training framework. \n\nDownload all files into the `models` folder.\n\n## Processes \n\n### Generating Programs \n\nWe created `scripts/generate.sh` to generate programs on the APPS benchmark. You can directly run this file by configuring the following parameters: \n\n|   **Parameters**  |                                              **Description**                                             |       **Example Values**       |\n|:-----------------:|:--------------------------------------------------------------------------------------------------------:|:------------------------------:|\n| `model_path`        | Path to a trained CodeT5-style model                                                                     | models/codet5\\_finetuned_codeRL |\n| `tokenizer_path`    | Path to the saved tokenizer for CodeT5 (or path to cache the tokenizer)                                  | models/codet5_tokenizer/       |\n| `test_path`         | Path to the original test samples                                                                        | data/APPS/test/                |\n| `start`             | start index of test samples to be generated                                                              | 0                              |\n| `end`               | end index of test samples to be generated                                                                | 5000                           |\n|`num_seqs`          | number of total output programs to be generated (for sampling generation)                                | 1000                           |\n| `num_seqs_per_iter` | Depending on the limit of GPU, we can generate multiple rounds, each with this number of output programs | 50                             |\n| `temp`              | temperature for sampling generation                                                                      | 0.6                            |\n| `output_path`              | Path to save generated programs                                                                      | outputs/codes/                            |\n\nOther parameters are defined in the file `utils/generate_configs.py`.\n\nRunning the generation script will output programs, each of which is saved into a `json` file, including data fields `code` (list of output programs) and `prompt` (constructed input sequence to the LM model).\n\n\n### Running Unit Tests \n\nOnce the programs are generated, they are evaluated against the corresponding unseen unit tests in each problem. \n\nTo execute the unit tests and obtain test outcomes, we adapt our code to the official implementation of the [APPS benchmark](https://github.com/hendrycks/apps/tree/main/eval). \n\nWe created `scripts/run_unit_tests.sh` to run unit tests on generated programs on the APPS benchmark. You can directly run this file by configuring the following parameters:\n\n| **Parameters** |                                                                                **Description**                                                                               |                  **Example Values**                 |\n|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------:|\n| `code_path`      | Path to the generated programs to be evaluated                                                                                                                               | outputs/codes/                                      |\n| `output_path`    | Path to the save unit test results                                                                                                                                           | outputs/test_results/                               |\n| `test_path`      | Path to the original test samples                                                                                                                                            | data/APPS/test/                                     |\n| `example_tests`  | Whether to evaluate the programs on example unit tests (for filtering, refining programs) or hidden unit tests (for final evaluation)                                        | 0: use hidden unit tests; 1: use example unit tests |\n| `start`          | start index of test samples to be evaluated                                                                                                                                  | 0                                                   |\n| `end`            | end index of test samples to be evaluated                                                                                                                                    | 5000                                                |\n| `threads`        | Depending on the capacity of the computation resource to run unit tests, we can run unit tests on multiple test samples over multiple threads to speed up the execution time | 30                                                  |\n\n\nRunning the script will output test results for each program. For each test sample, the results are saved into a `pickle` file, including data fields `results` (list of test outcomes, one of -2 = compile error, -1 = runtime error, False = failed test case, True = passed test case), `errors` (real compile error trace with details like error type and line numbers),  and `sols` (corresponding programs being evaluated).\n\nCompared to the original implementation from APPS, we adopt one trick which will exit the unit testing loop if a program does not pass any test case. This will speed up the testing process while the final passing rate measures are not affected. Refer to the `run_test` function in `utils/testing_utils.py` for more details. \n\n\n### Evaluating Programs \nTo compute the pass@k metrics, rather than using the APPS evaluation metrics, we follow the official implementation of the [HumanEval benchmark](https://github.com/openai/human-eval) (which better measures pass@k normalized by the number of possible k programs)\n\n\n### Training Critic \n\nWe can train a critic model as a classifier that predicts the test outcomes of generated samples. For each training sample, we can follow the prior processes ([generating programs](#generating-programs) and [running unit tests](#running-unit-tests)) to obtain synthetic samples and their annotations of unit test outcomes. On average, we generate 20 programs per training sample (we provided some example generated programs in `data/APPS/train/`).\n\nOnce the programs are tested, we can used their test outcomes as annotations to train a critic model initialized from a LM pretrained on source code data (we used CodeT5-based in this case). \n\nWe created `scripts/train_critic.sh` and `scripts/train_critic_deepspeed.sh` to train a critic using generated programs. You can directly run this file by configuring the following parameters:\n\n| **Parameters** |                                                                                **Description**                                                                               |                  **Example Values**                 |\n|:--------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------:|\n| `batch-size-per-replica`      | Number of training samples per GPU device                                                                                                                               | 8                                      |\n| `grad-acc-steps`    | Gradient accumulation steps                                                                                                                                            | 1                              |\n| `epochs`      | Number of training epochs                                                                                                                                             | 10                                    |\n| `lr`  | Learning rate                                        | 2e-5 |\n| `save-freq`          | Save model checkpoints after this number of training steps                                                                                                                                 | 1000                                                  |\n| `log-freq`            | Save model training losses after this number of training steps                                                                                                                                     | 10                                                |\n| `save_total_limit`        | Total number of checkpoints to keep eventually (only the latest ones are kept) | 5                                                  |\n| `fp16`        | Enable this to training model in 16-bit mode to reduce memory usage  | N/A                                                  |\n| `deepspeed`        | If using deepspeed, set this parameter to the configuration file for deepspeed training  | configs/deepspeed_configs.json                                                  |\n| `db`        | Enable this to train in debugging mode i.e. with small dummy data split and only 1 data worker  | N/A                                                  |\n\nOther parameters are defined in the file `utils/train_configs.py`.\n\nRunning the script will train a critic model as a classifier that receives inputs as a problem description + a generated program and returns an output as one of 4 test outcomes: compile error, runtime error, failed tests, and passed tests. The model checkpoints are saved in a folder under `exps/`. \n\n### Generating Critic Scores\n\nWe created `scripts/generate_critic_scores.sh` to generate critic scores for synthetic programs. We use the same parameters as defined in [the generating program process](#generating-programs) with the following additional parameters:   \n\n|   **Parameters**  |                                              **Description**                                             |       **Example Values**       |\n|:-----------------:|:--------------------------------------------------------------------------------------------------------:|:------------------------------:|\n| `critic_scores`        | Enable this to run inference on critic models and obtain critic scores                                                                    | N/A |\n| `gt_solutions`    | Enable this to run inference on ground-truth programs; else, synthetic programs are used by default                                  | N/A      |\n| `binary_prediction`    | Enable this to predict in binary classification i.e. passed tests or failed tests only                                  | N/A      |\n\nOther parameters are defined in the file `utils/generate_configs.py`.\n\nRunning the generation script will output predictions of the critic model.\nFor each data sample, the prediction is saved into a `pkl` (pickle) file, including data fields `code` (list of programs), `prompt` (constructed input sequence to the critic model), `gt_error_type` (ground-truth test outcomes), `pred_error_type` (predicted test outcomes by critic), `error_hidden_states` (hidden states returned by critic). \n\n### Finetuning with Ground-truth Programs\n\nWe can finetune any pretraind language model as a program synthesis model that can generate code from problem description in natural language. In our approach, this stage of finetuning is a warmup stage using the ground-truth annotations (from APPS) before a further finetuning stage on synthetic/generated programs. \n \nWe created `scripts/train_actor.sh` and `scripts/train_actor_deepspeed.sh` which include the parameters as defined above in the [critic training process](#training-critic). \n\nRunning the script will finetune a pretrained CodeT5-large model that receives a problem description as input and returns a corresponding solution program in Python. \nThe model checkpoints are saved in a folder under `exps/`. \n\n### Finetuning with Generated Programs\n\nWe created `scripts/train_actor_rl.sh` and `scripts/train_actor_rl_deepspeed.sh` to train pretrained LMs with synthetic generated programs. \nWe use the parameters as defined above in the [critic training process](#training-critic) with the following additional parameters: \n\n|   **Parameters**  |                                              **Description**                                             |       **Example Values**       |\n|:-----------------:|:--------------------------------------------------------------------------------------------------------:|:------------------------------:|\n| `model_path`        | Path to a finetuned model checkpoint e.g. from warm-up training                                                                    | models/codet5_finetuned_codeRL |\n| `relative_returns`    | Enable this to consider a baseline to compute relative return estimates rather than absolute return restimates in the RL loss| N/A      |\n\nOther parameters are defined in the file `utils/train_configs.py`.\n\n\nRunning the script will load a finetuned CodeT5-large model and continue to train it with both generated programs as well as ground-truth programs in alternative training steps. \nThe model checkpoints are saved in a folder under `exps/`. \n\n### Generating Programs with Critic Sampling \n\nWe will release the implementation details of our critic sampling procedure. \n\n## Example Generated Programs \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/example_code.png\" width=\"100%\" /\u003e\nThe problem is from the APPS benchmark, and the solution programs are generated by CodeT5 and CodeRL.\n\u003c/p\u003e\n\n## Citation \n\nIf you find the paper or the source code useful to your projects, please cite the following bibtex: \n\u003cpre\u003e\n@inproceedings{\n\tle2022coderl,\n\ttitle={Code{RL}: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},\n\tauthor={Hung Le and Yue Wang and Akhilesh Deepak Gotmare and Silvio Savarese and Steven Hoi},\n\tbooktitle={Advances in Neural Information Processing Systems},\n\teditor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},\n\tyear={2022},\n\turl={https://openreview.net/forum?id=WaGvb7OzySA}\n}\n\u003c/pre\u003e\n\n\n## License \n\nThe code is released under BSD 3-Clause - see `LICENSE.txt` for details.\n\nThis code is developed from other open source projects: including [APPS](https://github.com/hendrycks/apps/), [HumanEval](https://github.com/openai/human-eval), and [transformers](https://github.com/huggingface/transformers). We thank the original contributors of these works for open-sourcing their valuable source codes. \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Fcoderl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2Fcoderl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Fcoderl/lists"}