{"id":13427568,"url":"https://github.com/facebookresearch/CodeGen","last_synced_at":"2025-03-16T00:31:58.580Z","repository":{"id":37693295,"uuid":"387437060","full_name":"facebookresearch/CodeGen","owner":"facebookresearch","description":"Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.","archived":false,"fork":false,"pushed_at":"2024-03-12T15:30:12.000Z","size":14570,"stargazers_count":699,"open_issues_count":37,"forks_count":141,"subscribers_count":35,"default_branch":"main","last_synced_at":"2024-08-01T01:27:34.555Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-07-19T11:14:39.000Z","updated_at":"2024-07-30T17:39:04.000Z","dependencies_parsed_at":"2022-07-14T23:30:36.293Z","dependency_job_id":"92b4ba97-0ecb-4185-a3cf-07397460375d","html_url":"https://github.com/facebookresearch/CodeGen","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FCodeGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FCodeGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FCodeGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FCodeGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/CodeGen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221631802,"owners_count":16855011,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T01:00:31.873Z","updated_at":"2024-10-27T05:30:16.206Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["Paper List","LLM Models"],"sub_categories":["Transformer-based"],"readme":"This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and model evaluation.\n\nWe provide reference implementations of the following papers:\n- [TransCoder: Unsupervised Translation of Programming Languages](https://arxiv.org/pdf/2006.03511.pdf) (2020)\n- [DOBF: A Deobfuscation Pre-Training Objective for Programming Languages](https://arxiv.org/pdf/2102.07492.pdf) (2021)\n- [TransCoder-ST: Leveraging Automated Unit Tests for Unsupervised Code Translation](https://arxiv.org/pdf/2110.06773.pdf) (2021)\n- [TransCoder-IR: Code Translation with Compiler Representations](https://arxiv.org/pdf/2207.03578.pdf) (2022)\n\nWe also provide pre-trained models for language modeling, translation and deobfuscation.\n\nYou can find some documentation for each projects in the docs folder:\n- [TransCoder](docs/transcoder.md).\n- [DOBF](docs/dobf.md)\n- [TransCoder-ST](docs/TransCoder-ST.md)\n- [TransCoder-IR](docs/TransCoder-IR.md)\n\n\n## Dependencies\nRun [install_env.sh](install_env.sh).\nWe use black code formatter.\n\n## Data\n### Source code processors\n\nThis repository contains [programming languages processors](codegen_sources/preprocessing/lang_processors/lang_processor.py) for C++, Java and Python. These processors include:\n - tokenization and detokenization\n - obfuscation\n - function extractions \n \n These processors are based on [TreeSitter](https://tree-sitter.github.io/tree-sitter/) parsers. As these parsers are available in more than 30 programming languages, one can easily create a new programming language processor.\n\nExample of code tokenization:\n\n```python\nfrom codegen_sources.preprocessing.lang_processors.java_processor import JavaProcessor\n\njava_code = r\"\"\"class HelloWorld {\n    public static void main(String[] args) {\n        System.out.println(\"Hello, World!\"); \n    }\n}\"\"\"\njava_processor = JavaProcessor(root_folder=\"\u003cYOUR_TREESITER_FOLDER\u003e\")\ntokenized_java_code = java_processor.tokenize_code(java_code)\nprint(tokenized_java_code)\n```\n\n### BPE\nThis repository provides wrappers for [fast BPE](codegen_sources/preprocessing/bpe_modes/fast_bpe_mode.py) and [Roberta BPE](codegen_sources/preprocessing/bpe_modes/roberta_bpe_mode.py) at file level.\n\n### Dataset Preprocessing\n\nThis repository contains a [pipeline](codegen_sources/preprocessing/preprocess.py) to create programming languages datasets. Now it supports [four datasets modes](codegen_sources/preprocessing/dataset_modes):\n- Monolingual (ex: Java source code) \n- Monolingual Functions (ex: Java functions) \n- Monolingual Obfuscated (ex: Obfuscated Java source code.)\n- Monolingual Obfuscated Functions (ex: Obfuscated Java functions)\n\nFirst, download C++ / Java / Python source code from [Google BigQuery](https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code). To run our preprocessing pipeline, you need to donwload the raw source code on your machine in a JSON format. A sample of it is given [here](data/test_dataset).\n\nThe pipeline does the following:\n- Source code extraction from json (`.json.gz`) and tokenization (`.tok`)\n- Train BPE codes and vocab \n- Apply BPE (`.bpe`)\n- Binarization (`.pth`)\n- Symlink folder with appropriate file names for `.pth` (XLM-syml). To be given as `data_path` argument for training.\n\nTo run the pipeline : \n\n```bash\npython -m codegen_sources.preprocessing.preprocess \\\n\u003cDATA_PATH\u003e \\                            # folder containing json.gz\n--langs java cpp python  \\               # languages to process\n--mode monolingual_functions \\           # dataset mode\n--bpe_mode=fast \\                    # BPE mode. by default it is fast. can be roberta\n--local=True \\                           # Run on your local machine if True. If False run on a cluster (requires submitit setup)\n--train_splits=1                         # Number of trainings splits\n```\nIf you give several languages, the BPE codes and vocab will be learned commonly on these languages , so that you will have a common vocabulary to train one model for several languages. If you do not want that, launch the pipeline on every language separatly. [These tests](codegen_sources/preprocessing/tests/pipeline/test_pipeline.py) test the pipeline on different modes. It will give you an overview of the possible options. \n\nAlso, we provide the BPE codes and vocabulary [here](data/bpe/cpp-java-python). These are the codes and vocabulary used for TransCoder and DOBF. They were learned on concatenated C++, Java, and Python data. If you want to use them instead of learning new ones, give the corresponding paths as ```fastbpe_code_path``` and ```fastbpe_vocab_path``` arguments.\n\nIn TransCoder and DOBF readmes, we provide the commands to preprocess the respective datasets.\n\n\n## Model\n\n### Overview\nIn this repository, we provide [code](codegen_sources/model) to [train](codegen_sources/model/train.py) transformer-based models (code based on [XLM repository](https://github.com/facebookresearch/XLM)). The available training tasks are the following:\n- Masked Language Model (MLM)\n- Causal Language Model (CLM)\n- Supervised Machine translation (MT)\n- Classification\n- Deobfuscation = DOBF \n- Unsupervised Machine translation = TransCoder (Denoising auto encoding AE + Back Translation BT) \n\nWe [evaluate](codegen_sources/model/src/evaluation/evaluator.py) our models with metrics adapted to each task (e.g. computation accuracy and BLEU score for TransCoder, subtoken score for Deobfuscation).\n\nAlso, we provide [wrappers](codegen_sources/wrappers) to fine-tune and evaluate our models on [CodeXGLUE](https://arxiv.org/pdf/2102.04664.pdf) benchmark.\n\n\n### Download models\nYou can download the following models:\n- [MLM](docs/dobf.md#pre-trained-models)\n- [TransCoder](docs/transcoder.md#pre-trained-models). Use it to translate some code [here](codegen_sources/model/translate.py).\n- [DOBF](docs/dobf.md#pre-trained-models). Use it to deobfuscate some code [here](codegen_sources/model/deobfuscate.py).\n\n### Re train specific models\n\nTo have details on how to retrain specific models, please refer to the README specific to each model.\n- [TransCoder README](docs/transcoder.md).\n- [DOBF README](docs/dobf.md)\n\n## References\n\n### TransCoder model (NeurIPS 2020)\n\n[1] B. Roziere*, M.A. Lachaux*, L. Chanussot, G. Lample [Unsupervised Translation of Programming Languages](https://research.fb.com/wp-content/uploads/2020/11/Unsupervised-Translation-of-Programming-Languages.pdf).\n\n```\n@article{roziere2020unsupervised,\n  title={Unsupervised translation of programming languages},\n  author={Roziere, Baptiste and Lachaux, Marie-Anne and Chanussot, Lowik and Lample, Guillaume},\n  journal={Advances in Neural Information Processing Systems},\n  volume={33},\n  year={2020}\n}\n```\n\n### DOBF\n\n[2] B. Roziere*, M.A. Lachaux*, M. Szafraniec , G. Lample [DOBF: A Deobfuscation Pre-Training Objective for Programming Languages](https://arxiv.org/abs/2102.07492).\n\n```\n@article{roziere2021dobf,\n  title={{DOBF}: A Deobfuscation Pre-Training Objective for Programming Languages},\n  author={Roziere, Baptiste and Lachaux, Marie-Anne and Szafraniec, Marc and Lample, Guillaume},\n  journal={arXiv preprint arXiv:2102.07492},\n  year={2021}\n}\n```\n\n### TransCoder-ST\n[3] B. Roziere, J.M. Zhang, F. Charton, M. Harman, G. Synnaeve, G. Lample [Leveraging Automated Unit Tests for Unsupervised Code Translation](https://arxiv.org/pdf/2110.06773.pdf).\n\n```\n@article{roziere2021leveraging,\n  title={Leveraging Automated Unit Tests for Unsupervised Code Translation},\n  author={Roziere, Baptiste and Zhang, Jie M and Charton, Francois and Harman, Mark and Synnaeve, Gabriel and Lample, Guillaume},\n  journal={ICLR},\n  year={2022}\n}\n```\n\n### TransCoder-IR\n```\n@article{szafraniec2022code,\n  title={Code translation with Compiler Representations},\n  author={Szafraniec, Marc and Roziere, Baptiste and Charton, Hugh Leather Francois and Labatut, Patrick and Synnaeve, Gabriel},\n  journal={ICLR},\n  year={2023}\n}\n```\n\n\\* Equal Contribution\n\n## License\nThe validation and test parallel datasets from GeeksForGeeks, and the evaluation scripts under [data/transcoder_evaluation_gfg](data/transcoder_evaluation_gfg) are released under the Creative Commons Attribution-ShareAlike 2.0 license. See https://creativecommons.org/licenses/by-sa/2.0/ for more information.\n\nThe rest of the `CodeGen` repository is under the MIT license. See [LICENSE](LICENSE) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FCodeGen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2FCodeGen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FCodeGen/lists"}