https://github.com/microsoft/CodeBERT
CodeBERT
https://github.com/microsoft/CodeBERT
Last synced: about 1 month ago
JSON representation
CodeBERT
- Host: GitHub
- URL: https://github.com/microsoft/CodeBERT
- Owner: microsoft
- License: mit
- Created: 2020-06-17T07:37:08.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-07-09T12:26:30.000Z (almost 2 years ago)
- Last Synced: 2025-03-09T18:27:16.409Z (about 1 month ago)
- Language: Python
- Size: 70.9 MB
- Stars: 2,415
- Watchers: 40
- Forks: 478
- Open Issues: 83
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
- awesome-neural-code-intelligence - [GitHub
- StarryDivineSky - microsoft/CodeBERT - PL 对上进行预训练的多编程语言模型。 (预训练模型)
README
# Code Pretraining Models
This repo contains code pretraining models in the CodeBERT series from Microsoft, including six models as of June 2023.
- CodeBERT (EMNLP 2020)
- GraphCodeBERT (ICLR 2021)
- UniXcoder (ACL 2022)
- CodeReviewer (ESEC/FSE 2022)
- CodeExecutor (ACL 2023)
- LongCoder (ICML 2023)# CodeBERT
This repo provides the code for reproducing the experiments in [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/pdf/2002.08155.pdf). CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
### Dependency
- pip install torch
- pip install transformers### Quick Tour
We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.
```python
import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModeldevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)
```### NL-PL Embeddings
Here, we give an example to obtain embedding from CodeBERT.
```python
>>> from transformers import AutoTokenizer, AutoModel
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>> model = AutoModel.from_pretrained("microsoft/codebert-base")
>>> nl_tokens=tokenizer.tokenize("return maximum value")
['return', 'Ġmaximum', 'Ġvalue']
>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb']
>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
['', 'return', 'Ġmaximum', 'Ġvalue', '', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', '']
>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]
>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
torch.Size([1, 23, 768])
tensor([[-0.1423, 0.3766, 0.0443, ..., -0.2513, -0.3099, 0.3183],
[-0.5739, 0.1333, 0.2314, ..., -0.1240, -0.1219, 0.2033],
[-0.1579, 0.1335, 0.0291, ..., 0.2340, -0.8801, 0.6216],
...,
[-0.4042, 0.2284, 0.5241, ..., -0.2046, -0.2419, 0.7031],
[-0.3894, 0.4603, 0.4797, ..., -0.3335, -0.6049, 0.4730],
[-0.1433, 0.3785, 0.0450, ..., -0.2527, -0.3121, 0.3207]],
grad_fn=)
```### Probing
As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.
We give an example on how to use CodeBERT(MLM) for mask prediction task.
```python
from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipelinemodel = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")CODE = "if (x is not None) (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)outputs = fill_mask(CODE)
print(outputs)```
Results
```python
'and', 'or', 'if', 'then', 'AND'
```
The detailed outputs are as follows:
```python
{'sequence': ' if (x is not None) and (x>1)', 'score': 0.6049249172210693, 'token': 8}
{'sequence': ' if (x is not None) or (x>1)', 'score': 0.30680200457572937, 'token': 50}
{'sequence': ' if (x is not None) if (x>1)', 'score': 0.02133703976869583, 'token': 114}
{'sequence': ' if (x is not None) then (x>1)', 'score': 0.018607674166560173, 'token': 172}
{'sequence': ' if (x is not None) AND (x>1)', 'score': 0.007619690150022507, 'token': 4248}
```### Downstream Tasks
For Code Search and Code Documentation Generation tasks, please refer to the [CodeBERT](https://github.com/microsoft/CodeBERT/tree/master/CodeBERT) folder.
# GraphCodeBERT
This repo also provides the code for reproducing the experiments in [GraphCodeBERT: Pre-training Code Representations with Data Flow](https://openreview.net/pdf?id=jLoC4ez43PZ). GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the [GraphCodeBERT](https://github.com/microsoft/CodeBERT/tree/master/GraphCodeBERT) folder.
# UniXcoder
This repo will provide the code for reproducing the experiments in [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf). UniXcoder is a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.
Please refer to the [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) folder for tutorials and downstream tasks.
# CodeReviewer
This repo also provides the code for reproducing the experiments in [CodeReviewer: Pre-Training for Automating Code Review Activities](https://arxiv.org/abs/2203.09095). CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.
Please refer to the [CodeReviewer](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer) folder for tutorials and downstream tasks.
# CodeExecutor
This repo provides the code for reproducing the experiments in [Code Execution with Pre-trained Language Models](https://arxiv.org/pdf/2305.05383.pdf). CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.
Please refer to the [CodeExecutor](https://github.com/microsoft/CodeBERT/tree/master/CodeExecutor) folder for details.
# LongCoder
This repo will provide the code for reproducing the experiments on LCC datasets in [LongCoder: A Long-Range Pre-trained Language Model for Code Completion](https://arxiv.org/abs/2306.14893). LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.
Please refer to the [LongCoder](https://github.com/microsoft/CodeBERT/tree/master/LongCoder) folder for details.
## ContactFeel free to contact Daya Guo ([email protected]), Shuai Lu ([email protected]) and Nan Duan ([email protected]) if you have any further questions.
## Contributing
We appreciate all contributions and thank all the contributors!
![]()