{"id":17300811,"url":"https://github.com/michiyasunaga/bifi","last_synced_at":"2025-09-14T23:30:03.752Z","repository":{"id":45383522,"uuid":"371252906","full_name":"michiyasunaga/BIFI","owner":"michiyasunaga","description":"[ICML 2021] Break-It-Fix-It: Unsupervised Learning for Program Repair","archived":false,"fork":false,"pushed_at":"2023-04-20T17:19:14.000Z","size":9655,"stargazers_count":113,"open_issues_count":5,"forks_count":26,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-14T11:06:57.943Z","etag":null,"topics":["domain-adaptation","program-repair","translation","unsupervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michiyasunaga.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-05-27T05:07:41.000Z","updated_at":"2025-02-06T03:01:05.000Z","dependencies_parsed_at":"2022-08-12T11:52:03.232Z","dependency_job_id":"e2557b5f-3a62-4914-aede-a19096f74203","html_url":"https://github.com/michiyasunaga/BIFI","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2FBIFI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2FBIFI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2FBIFI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michiyasunaga%2FBIFI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michiyasunaga","download_url":"https://codeload.github.com/michiyasunaga/BIFI/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248868766,"owners_count":21174758,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["domain-adaptation","program-repair","translation","unsupervised-learning"],"created_at":"2024-10-15T11:30:20.864Z","updated_at":"2025-04-14T11:07:07.772Z","avatar_url":"https://github.com/michiyasunaga.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data\n\nThis repo provides the source code \u0026 data of our paper: [Break-It-Fix-It: Unsupervised Learning for Program Repair](http://arxiv.org/abs/2106.06600) (ICML 2021).\n```bib\n@InProceedings{yasunaga2021break,\n  author =  {Michihiro Yasunaga and Percy Liang},\n  title =   {Break-It-Fix-It: Unsupervised Learning for Program Repair},\n  year =    {2021},  \n  booktitle = {International Conference on Machine Learning (ICML)},  \n}\n```\n**Problem: Repair Task**\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figs/task.png\" width=\"600\" title=\"Problem: Repair Task\" alt=\"\"\u003e\n\u003c/p\u003e\n\n**Our approach: BIFI**\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figs/bifi.png\" width=\"500\" title=\"Overview of BIFI training algorithm\" alt=\"\"\u003e\n\u003c/p\u003e\n\n\n## 0. Dependencies\n\nRun the following commands to create a conda environment (assuming CUDA10.1):\n```bash\nconda create -n BIFI python=3.7.7\nconda activate BIFI\npip install tqdm\npip install torch==1.4.0 torchvision==0.5.0\ncd utils/fairseq\npip install -e .\npip install numpy==1.20.1 editdistance\n```\nAlternatively, you can use the Dockerfile in the `docker` folder of this repo to set up the environment.\n\n\n## 1. Download Data\n\nDownload all the data from [here (`data.zip`)](https://nlp.stanford.edu/projects/myasu/BIFI/data.zip) and unzip it (note: 67GB when compressed, 400GB when decompressed). This includes the GitHub-Python dataset, and all the processed training data and trained models associated with BIFI.\nIf you only want the original GitHub-Python dataset, you can download it from [here (`data_minimal.zip`; 1GB)](https://nlp.stanford.edu/projects/myasu/BIFI/data_minimal.zip).\nAfter unzipping the `data.zip`, the resulting file structure will look like:\n```plain\n.\n├── README.md\n└── data/\n    ├── orig_bad_code/       (GitHub-Python dataset's bad code)\n    ├── orig_good_code/      (GitHub-Python dataset's good code)\n    └── round0/\n        ├── data_paired      (paired data used to train fixer in round0)\n        └── model-fixer      (fixer trained in round0)\n    ├── round1-BIFI-part1/\n        ├── data_paired      (paired data used to train breaker in BIFI round1)\n        └── model-breaker    (breaker trained in BIFI round1)\n    ├── round1-BIFI-part2/\n        ├── data_paired      (paired data used to train fixer in BIFI round1)\n        └── model-fixer      (fixer trained in BIFI round1)\n    ├── ...\n```\n\n### About the GitHub-Python dataset\nWe collected 3 million Python3 snippets from GitHub. Using the critic (Python AST parser), the code snippets are split into a set of bad code (with AST parse errors) and a set of good code (with no errors).\nThe set of bad code is located at `data/orig_bad_code/orig.bad.json` and good code at `data/orig_good_code/orig.good.json`.\nEach entry of `orig.bad.json` or `orig.good.json` is a dictionary consisting of\n  - **\"code_string\"**: raw code in the string format\n  - **\"code_toks_joined\"**: the raw code is split into tokens by Python tokenizer, anonymized (string/number is replaced with special tokens `\u003cSTRING\u003e`/`\u003cNUMBER\u003e`), and then joined by whitespace. The tokenization was done by `utils/code_utils.py: tokenize_python_code()`\n  - **\"anonymize_dict\"**: mapping betweens raw string/number and `\u003cSTRING\u003e`/`\u003cNUMBER\u003e` so that \"code_string\" can be recovered from \"code_toks_joined\". This recovery can be done by `utils/code_utils.py: code_toks_to_code_string()`\n  - **\"err_obj\"**: type of the error caught by the critic (e.g. unbalanced parentheses, indentation error). This is only applicable to `orig.bad.json`.\n\n\nThe bad code snippets in `orig.bad.json` are split into 5 chunks (`orig.0.bad` to `orig.4.bad` in `data/orig_bad_code/`), where 3,4 is heldout as the test set and 0,1,2 is made available for BIFI training. This splitting was done by `scripts/split_orig_bad_and_good.py`\n\n\n\n## 2. Training and Evaluation\nFirst, train the initial fixer by running commands in `src/run-round0.py` one by one. We then consider three training algorithms on top of it: **BIFI** (our proposed method), **FixerOnly** (BIFI without breaker), and **BackTranslation** (BT; our baseline). For each algorithm,\n  - **BIFI**: run commands in `src/run-BIFI.py` one by one\n  - **FixerOnly**: run commands in `src/run-FixerOnly.py` one by one\n  - **BT**: run commands in `src/run-BT.py` one by one\n\nBelow is an illustration for the case of BIFI.\n\n**run-round0.sh**\n```bash\nexport PYTHONPATH=.\n\n#Train initial fixer on synthetic paired data\npython src/c001__train_fixer.py --round_name round0 --gpu_id 0 --max_epoch 2\n\n#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic\npython src/c003__run_fixer.py   --round_name round0 --gpu_ids '0,1,2,3,4'\n\n#Evaluate the fixer outputs on the test set (chunk 3,4)\npython src/c005__eval_fixer.py  --round_name round0\n```\n\n\n**run-BIFI.sh** (round 1)\n```bash\n#Use the fixer outputs on the bad code (chunk 0,1,2) to get new paired data (Equation 6 in the paper)\npython src/c006__generate_paired_data_from_fixer.py --round_name round0 --out_round_name round1-BIFI-part1\n\n#Train breaker on the new paired data (Equation 7 in the paper)\npython src/c002__train_breaker.py --round_name round1-BIFI-part1 --gpu_id 0 --max_epoch 3\n\n#Run the trained breaker on the good code and get new paired data (Equation 8 in the paper)\npython src/c004__run_breaker.py   --round_name round1-BIFI-part1 --gpu_ids '0,1,2,3,4'\npython src/c007__generate_paired_data_from_breaker.py --round_name round1-BIFI-part1 --out_round_name round1-BIFI-part2\n\n#Train fixer on the new paired data (Equation 9 in the paper)\npython src/c001__train_fixer.py --round_name round1-BIFI-part2 --gpu_id 0 --max_epoch 2 --continue_from 'data/round0/model-fixer/checkpoint.pt'\n\n#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic\npython src/c003__run_fixer.py   --round_name round1-BIFI-part2 --gpu_ids '0,1,2,3,4'\n\n#Evaluate the fixer outputs on the test set (chunk 3,4)\npython src/c005__eval_fixer.py  --round_name round1-BIFI-part2\n```\nThis is repeated similarly for round 2.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichiyasunaga%2Fbifi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichiyasunaga%2Fbifi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichiyasunaga%2Fbifi/lists"}