https://github.com/microsoft/CodeBERT

CodeBERT
https://github.com/microsoft/CodeBERT

Last synced: 8 months ago
JSON representation

CodeBERT

Host: GitHub
URL: https://github.com/microsoft/CodeBERT
Owner: microsoft
License: mit
Created: 2020-06-17T07:37:08.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-07-09T12:26:30.000Z (over 2 years ago)
Last Synced: 2025-03-09T18:27:16.409Z (9 months ago)
Language: Python
Size: 70.9 MB
Stars: 2,415
Watchers: 40
Forks: 478
Open Issues: 83
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

awesome-neural-code-intelligence - [GitHub
StarryDivineSky - microsoft/CodeBERT - PL 对上进行预训练的多编程语言模型。 (预训练模型)
awesome-machine-learning-in-compilers - CodeBert - pre-trained DNN models for programming languages ([paper](https://arxiv.org/pdf/2002.08155.pdf)). (Software / Memory/Cache Modeling/Analysis)
awesome-ai-coding - CodeBERT - Pre-trained model for programming languages by Microsoft Research. (Open Source Tools)

README

          # Code Pretraining Models

This repo contains code pretraining models in the CodeBERT series from Microsoft, including six models as of June 2023.

- CodeBERT (EMNLP 2020)

- GraphCodeBERT (ICLR 2021)

- UniXcoder (ACL 2022)

- CodeReviewer (ESEC/FSE 2022)

- CodeExecutor (ACL 2023)

- LongCoder (ICML 2023)

# CodeBERT

This repo provides the code for reproducing the experiments in [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/pdf/2002.08155.pdf). CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go). 

### Dependency

- pip install torch

- pip install transformers

### Quick Tour

We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model.

```python

import torch

from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")

model = RobertaModel.from_pretrained("microsoft/codebert-base")

model.to(device)

```

### NL-PL Embeddings

Here, we give an example to obtain embedding from CodeBERT.

```python

>>> from transformers import AutoTokenizer, AutoModel

>>> import torch

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

>>> model = AutoModel.from_pretrained("microsoft/codebert-base")

>>> nl_tokens=tokenizer.tokenize("return maximum value")

['return', 'Ġmaximum', 'Ġvalue']

>>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")

['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb']

>>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]

['', 'return', 'Ġmaximum', 'Ġvalue', '', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', '']

>>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)

[0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2]

>>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]

torch.Size([1, 23, 768])

tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],

        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],

        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],

        ...,

        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],

        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],

        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],

       grad_fn=)

```

### Probing

As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task.

We give an example on how to use CodeBERT(MLM) for mask prediction task.

```python

from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm")

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm")

CODE = "if (x is not None)  (x>1)"

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(CODE)

print(outputs)

```

Results

```python

'and', 'or', 'if', 'then', 'AND'

```

The detailed outputs are as follows:

```python

{'sequence': ' if (x is not None) and (x>1)', 'score': 0.6049249172210693, 'token': 8}

{'sequence': ' if (x is not None) or (x>1)', 'score': 0.30680200457572937, 'token': 50}

{'sequence': ' if (x is not None) if (x>1)', 'score': 0.02133703976869583, 'token': 114}

{'sequence': ' if (x is not None) then (x>1)', 'score': 0.018607674166560173, 'token': 172}

{'sequence': ' if (x is not None) AND (x>1)', 'score': 0.007619690150022507, 'token': 4248}

```

### Downstream Tasks

For Code Search and Code Documentation Generation tasks, please refer to the [CodeBERT](https://github.com/microsoft/CodeBERT/tree/master/CodeBERT) folder.

# GraphCodeBERT

This repo also provides the code for reproducing the experiments in [GraphCodeBERT: Pre-training Code Representations with Data Flow](https://openreview.net/pdf?id=jLoC4ez43PZ). GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go). 

For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the [GraphCodeBERT](https://github.com/microsoft/CodeBERT/tree/master/GraphCodeBERT) folder.

# UniXcoder

This repo will provide the code for reproducing the experiments in [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf). UniXcoder is a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks. 

Please refer to the [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder) folder for tutorials and downstream tasks.

# CodeReviewer

This repo also provides the code for reproducing the experiments in [CodeReviewer: Pre-Training for Automating Code Review Activities](https://arxiv.org/abs/2203.09095). CodeReviewer is a model pre-trained with code change and code review data to support code review tasks.

Please refer to the [CodeReviewer](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer) folder for tutorials and downstream tasks.

# CodeExecutor

This repo provides the code for reproducing the experiments in [Code Execution with Pre-trained Language Models](https://arxiv.org/pdf/2305.05383.pdf). CodeExecutor is a pre-trained model that learns to predict the execution traces using a code execution pre-training task and curriculum learning.

Please refer to the [CodeExecutor](https://github.com/microsoft/CodeBERT/tree/master/CodeExecutor) folder for details.

# LongCoder

This repo will provide the code for reproducing the experiments on LCC datasets in [LongCoder: A Long-Range Pre-trained Language Model for Code Completion](https://arxiv.org/abs/2306.14893). LongCoder is a sparse and efficient pre-trained Transformer model for long code modeling.

Please refer to the [LongCoder](https://github.com/microsoft/CodeBERT/tree/master/LongCoder) folder for details.

## Contact

Feel free to contact Daya Guo (guody5@mail2.sysu.edu.cn), Shuai Lu (shuailu@microsoft.com) and Nan Duan (nanduan@microsoft.com) if you have any further questions.

## Contributing

We appreciate all contributions and thank all the contributors!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/microsoft/CodeBERT

Awesome Lists containing this project

README