{"id":13427551,"url":"https://github.com/microsoft/CodeBERT","last_synced_at":"2025-03-16T00:32:06.891Z","repository":{"id":37392017,"uuid":"272909064","full_name":"microsoft/CodeBERT","owner":"microsoft","description":"CodeBERT","archived":false,"fork":false,"pushed_at":"2023-07-09T12:26:30.000Z","size":74348,"stargazers_count":2415,"open_issues_count":83,"forks_count":478,"subscribers_count":40,"default_branch":"master","last_synced_at":"2025-03-09T18:27:16.409Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-06-17T07:37:08.000Z","updated_at":"2025-03-09T12:12:32.000Z","dependencies_parsed_at":"2024-01-13T21:46:18.515Z","dependency_job_id":null,"html_url":"https://github.com/microsoft/CodeBERT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCodeBERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCodeBERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCodeBERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FCodeBERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/CodeBERT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243809865,"owners_count":20351403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T01:00:31.391Z","updated_at":"2025-03-16T00:32:06.885Z","avatar_url":"https://github.com/microsoft.png","language":"Python","readme":"# Code Pretraining Models\n\nThis repo contains code pretraining models in the CodeBERT series from Microsoft, including six models as of June 2023.\n- CodeBERT (EMNLP 2020)\n- GraphCodeBERT (ICLR 2021)\n- UniXcoder (ACL 2022)\n- CodeReviewer (ESEC/FSE 2022)\n- CodeExecutor (ACL 2023)\n- LongCoder (ICML 2023)\n\n# CodeBERT\n\nThis repo provides the code for reproducing the experiments in [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/pdf/2002.08155.pdf). CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go). \n\n### Dependency\n\n- pip install torch\n- pip install transformers\n\n### Quick Tour\nWe use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.\n```python\nimport torch\nfrom transformers import RobertaTokenizer, RobertaConfig, RobertaModel\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\ntokenizer = RobertaTokenizer.from_pretrained(\"microsoft/codebert-base\")\nmodel = RobertaModel.from_pretrained(\"microsoft/codebert-base\")\nmodel.to(device)\n```\n\n### NL-PL Embeddings\n\nHere, we give an example to obtain embedding from CodeBERT.\n\n```python\n\u003e\u003e\u003e from transformers import AutoTokenizer, AutoModel\n\u003e\u003e\u003e import torch\n\u003e\u003e\u003e tokenizer = AutoTokenizer.from_pretrained(\"microsoft/codebert-base\")\n\u003e\u003e\u003e model = AutoModel.from_pretrained(\"microsoft/codebert-base\")\n\u003e\u003e\u003e nl_tokens=tokenizer.tokenize(\"return maximum value\")\n['return', 'Ġmaximum', 'Ġvalue']\n\u003e\u003e\u003e code_tokens=tokenizer.tokenize(\"def max(a,b): if a\u003eb: return a else return b\")\n['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '\u003e', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb']\n\u003e\u003e\u003e tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]\n['\u003cs\u003e', 'return', 'Ġmaximum', 'Ġvalue', '\u003c/s\u003e', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '\u003e', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', '\u003c/s\u003e']\n\u003e\u003e\u003e tokens_ids=tokenizer.convert_tokens_to_ids(tokens)\n[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]\n\u003e\u003e\u003e context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]\ntorch.Size([1, 23, 768])\ntensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],\n        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],\n        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],\n        ...,\n        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],\n        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],\n        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],\n       grad_fn=\u003cSelectBackward\u003e)\n```\n\n\n### Probing\n\nAs stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.\n\n\nWe give an example on how to use CodeBERT(MLM) for mask prediction task.\n```python\nfrom transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline\n\nmodel = RobertaForMaskedLM.from_pretrained(\"microsoft/codebert-base-mlm\")\ntokenizer = RobertaTokenizer.from_pretrained(\"microsoft/codebert-base-mlm\")\n\nCODE = \"if (x is not None) \u003cmask\u003e (x\u003e1)\"\nfill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)\n\noutputs = fill_mask(CODE)\nprint(outputs)\n\n```\nResults\n```python\n'and', 'or', 'if', 'then', 'AND'\n```\nThe detailed outputs are as follows:\n```python\n{'sequence': '\u003cs\u003e if (x is not None) and (x\u003e1)\u003c/s\u003e', 'score': 0.6049249172210693, 'token': 8}\n{'sequence': '\u003cs\u003e if (x is not None) or (x\u003e1)\u003c/s\u003e', 'score': 0.30680200457572937, 'token': 50}\n{'sequence': '\u003cs\u003e if (x is not None) if (x\u003e1)\u003c/s\u003e', 'score': 0.02133703976869583, 'token': 114}\n{'sequence': '\u003cs\u003e if (x is not None) then (x\u003e1)\u003c/s\u003e', 'score': 0.018607674166560173, 'token': 172}\n{'sequence': '\u003cs\u003e if (x is not None) AND (x\u003e1)\u003c/s\u003e', 'score': 0.007619690150022507, 'token': 4248}\n```\n\n### Downstream Tasks\n\nFor Code Search and Code Documentation Generation tasks, please refer to the [CodeBERT](https://github.com/microsoft/CodeBERT/tree/master/CodeBERT) folder.\n\n\n\n# GraphCodeBERT\n\nThis repo also provides the code for reproducing the experiments in [GraphCodeBERT: Pre-training Code Representations with Data Flow](https://openreview.net/pdf?id=jLoC4ez43PZ). GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go). \n\nFor downstream tasks like code search, clone detection, code refinement and code translation, please refer to the [GraphCodeBERT](https://github.com/microsoft/CodeBERT/tree/master/GraphCodeBERT) folder.\n\n# UniXcoder\n\nThis repo will provide the code for reproducing the experiments in [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf). UniXcoder is a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks. \n\nPlease refer to the [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) folder for tutorials and downstream tasks.\n\n# CodeReviewer\n\nThis repo also provides the code for reproducing the experiments in [CodeReviewer: Pre-Training for Automating Code Review Activities](https://arxiv.org/abs/2203.09095). CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.\n\nPlease refer to the [CodeReviewer](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer) folder for tutorials and downstream tasks.\n\n# CodeExecutor\n\nThis repo provides the code for reproducing the experiments in [Code Execution with Pre-trained Language Models](https://arxiv.org/pdf/2305.05383.pdf). CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.\n\nPlease refer to the [CodeExecutor](https://github.com/microsoft/CodeBERT/tree/master/CodeExecutor) folder for details.\n\n# LongCoder\n\nThis repo will provide the code for reproducing the experiments on LCC datasets in [LongCoder: A Long-Range Pre-trained Language Model for Code Completion](https://arxiv.org/abs/2306.14893). LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.\n\nPlease refer to the [LongCoder](https://github.com/microsoft/CodeBERT/tree/master/LongCoder) folder for details.\n## Contact\n\nFeel free to contact Daya Guo (guody5@mail2.sysu.edu.cn), Shuai Lu (shuailu@microsoft.com) and Nan Duan (nanduan@microsoft.com) if you have any further questions.\n\n## Contributing\n\nWe appreciate all contributions and thank all the contributors!\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://contributors-img.web.app/image?repo=microsoft/CodeBERT\" /\u003e\n\u003c/p\u003e\n","funding_links":[],"categories":["Paper List","Uncategorized","others","Python","Stale","预训练模型","Software","Open Source Tools"],"sub_categories":["Transformer-based","Uncategorized","Python Tools","Memory/Cache Modeling/Analysis"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FCodeBERT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2FCodeBERT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2FCodeBERT/lists"}