Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/engineeringsoftware/coditt5
CoditT5: Pretraining for Source Code and Natural Language Editing
https://github.com/engineeringsoftware/coditt5
machine-learning pretrained-language-model software-engineering
Last synced: 3 months ago
JSON representation
CoditT5: Pretraining for Source Code and Natural Language Editing
- Host: GitHub
- URL: https://github.com/engineeringsoftware/coditt5
- Owner: EngineeringSoftware
- License: mit
- Created: 2022-08-22T17:02:57.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-23T23:03:32.000Z (3 months ago)
- Last Synced: 2024-10-24T12:24:19.528Z (3 months ago)
- Topics: machine-learning, pretrained-language-model, software-engineering
- Language: Python
- Homepage:
- Size: 91.9 MB
- Stars: 28
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CoditT5: Pretraining for Source Code and Natural Language Editing
This repo hosts the code and data for the following ASE 2022 paper:
Title: [CoditT5: Pretraining for Source Code and Natural Language Editing](https://arxiv.org/abs/2208.05446)
Authors: [Jiyang Zhang](https://jiyangzhang.github.io/), [Sheena Panthaplackel](https://panthap2.github.io/), [Pengyu Nie](https://pengyunie.github.io/), [Junyi Jessy Li](https://jessyli.com/), [Milos Gligoric](http://users.ece.utexas.edu/~gligoric/)
```bibtex
@inproceedings{ZhangETAL22CoditT5,
author = {Zhang, Jiyang and Panthaplackel, Sheena and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
title = {Codit{T}5: Pretraining for Source Code and Natural Language Editing},
booktitle = {International Conference on Automated Software Engineering},
year = {2022},
}
```## News
**Aug 2023****Pretrained CoditT5** model is released on 🤗 ! 🔥
[link](https://huggingface.co/JiyangZhang/CoditT5)
\
Note: It is recommended fine-tuning it before applying to downstream tasks.
## IntroductionThis repo contains the code and artifacts for reproducing the experiments in [CoditT5: Pretraining for Source Code and Natural Language Editing](https://arxiv.org/abs/2208.05446).
In this work, we introduce CoditT5 for software **edit** tasks. CoditT5 is a large Language Model pretrained with a novel objective to explicitly model edits. CoditT5 sets the state-of-the-art for downstream tasks including comment updating, bug fixing and automated code review.The code includes:
- scripts for synthesizing pretraining data for CoditT5
- scripts for processing data for downstream tasks
- scripts for training and evaluating CoditT5 on three downstream tasks
- scripts for combining CoditT5 and CodeT5 through rerankingThe artifacts include:
- dataset used for pretraining CoditT5
- datasets for downstream tasks
- checkpoint for the pretrained CoditT5
- checkpoints for the CoditT5 models fine-tuned for downstream tasks## Table of Contents
1. [How to Use][sec-howto]
2. [Dependency][sec-dependency]
3. [Data Downloads][sec-downloads]
4. [Code for Pretraining][sec-pretrain]
5. [Code for Processing Fine-tuning Data][sec-process]
6. [Code for Training and Evaluating Models][sec-traineval]
7. [Code for Combining CodeT5 and CoditT5][sec-rerank]## How to Use
[sec-howto]: #how-to-use
```python
from transformers import T5ForConditionalGeneration, AutoTokenizercheckpoint = "JiyangZhang/CoditT5"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)code_input = """class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!")"""
input_ids = tokenizer(code_input, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=200)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# output: ; } } ; class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!") ; } } ;
```## Dependency
[sec-dependency]: #dependency
Our code require the following hardware and software environments.
- Operating system: Linux (tested on Ubuntu 20.04)
- Minimum disk space: 4 GB
- Python: 3.8
- Anaconda/Miniconda: appropriate versions for Python 3.8 or higherAdditional requirements for training and evaluating ML models:
- GPU: NVIDIA GTX 1080 Ti or better (with >= 11GB memory)
- CUDA: 10.2 or 11.3
- Disk space: 2 GB per trained model[Anaconda](https://www.anaconda.com/products/individual#Downloads) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) is required for installing the other Python library dependencies. Once Anaconda/Miniconda is installed, you can use the following command to setup a virtual environment, named `cdt`, with the Python library dependencies installed:
```
cd python/
./prepare_conda_env.sh
```And then use `conda activate cdt` to activate the created virtual environment.
## Data Downloads
[sec-downloads]: #data-downloads
All our data is hosted on UTBox via [a zip file](https://utexas.box.com/s/9rkqnlp6wjhwyfmxce97pgb4ersfc1f9).
Data should be downloaded to this directory with the same directory structure (e.g., `data/` from the shared folder should be downloaded as `data/` under current directory).
## Code for Pretraining
[sec-pretrain]: #code-for-pretraining
### Synthesize Pretraining Data
We provide sample scripts to synthesize the pretraining dataset (by corrupting programming language code snippets and natural language comments) for CoditT5.
First, prepare the programming language and natural language data for pretraining; Then specify the following variables in the function `corrupt_pretrain_data()` in `python/run.sh`:
- `source_pl_file`: the path of data file where each line is a programming language function;
- `tokenized_pl_file`: the path of tokenized version of `source_pl_file`;
- `corrupt_pl_file`: corrupted version of `tokenized_pl_file` which is the input of pretrained model.
- `source_nl_file`: the path of data file where each line is a natural language sequence;
- `tokenized_nl_file`: the path of tokenized version of `source_nl_file`;
- `corrupt_nl_file`: corrupted version of `tokenized_nl_file` which is the input of pretrained model.```
cd python/
./run.sh corrupt_pretrain_data
```### Pretrain CoditT5
Requires the pretrain dataset at `data/CoditT5/pretrain/`
```
cd python/
./run.sh pretrain_CoditT5
```## Code for Processing Fine-tuning Data
[sec-process]: #code-for-processing-fine-tuning-data
We provide the sample script to process the downstream datasets for CoditT5. Requires the raw data files at `raw_data/`.
```
cd python/
./run.sh process_coditT5_dataset --dataset ${dataset}# Example: ./run.sh process_coditT5_dataset --dataset comment-update
```Where `${dataset}` is the name of the dataset (comment-update, code-review, bf-small, bf-medium). The data files are generated to `data/CoditT5/${dataset}/`.
Notes:
- CoditT5's input data file name ends with `.buggy`; CoditT5's target output (edit plan + generation) file name ends with `.fixed`; target generation file name ends with `.seq`.
- CoditT5's input is in the form of `source_sequence context_sequence`; and CoditT5's output is in the form of `edit_plan target_sequence`
- Raw data files are stored in `raw_data/` (we provide some examples for demo), processed data files are generated to `data/CoditT5/${dataset}`
- Note that for the comment-update dataset, the processed `edit_plan` is the edits applied to the comment w/o parameter (@return, @param)## Code for Training and Evaluating Models
[sec-traineval]: #code-for-training-and-evaluating-models
### Train ML models
Requires the dataset at `data/${model}/${dataset}/`, where `${model}` is the name of the model (CodeT5, CoditT5); `${dataset}` is the name of the dataset.
```
cd python/
./run.sh ${model}_train ${dataset}# Example: ./run.sh CoditT5_train comment-update
```Results are generated to `models/${model}/${dataset}/`, where:
- `model/`: stores the trained model.
- `logs/`: stores logs during training.
### Evaluate ML models
Requires the dataset at `data/${model}/${dataset}/`, the trained model at `models/${model}/${dataset}/model/`.
```
cd python/
./run.sh ${model}_generate ${dataset}# Example: ./run.sh CoditT5_generate comment-update
```Results are generated to `models/${model}/${dataset}/`, where:
- `output.hyp`: the predictions.
### Compute automatic metrics
Requires the model's predictions at `models/${model}/${dataset}/`. Note that the provided script assumes the names for the data files conform the what described in [Code for Processing Fine-tuning Data][sec-process]
```
./run.sh ${model}_eval ${dataset}# Example: ./run.sh CoditT5_eval comment-update
```Results are generated to `results/`:
- `results-${dataset}-${model}.json`: the average of automatic metrics.
- `scores-${dataset}-${model}.json`: the list of automatic metrics per sample.
## Code for Combining CodeT5 and CoditT5
[sec-rerank]: #code-for-combining-codet5-and-coditt5
Requires the dataset at `data/${model}/${dataset}/`, the trained models at `models/${model}/${dataset}/model/`.
### Rerank Models' outputs
```
cd python/
# Rerank CodeT5's outputs with CoditT5
./run.sh CodeT5_rerank ${dataset}
# Rerank CoditT5's outputs with CodeT5
./run.sh CodeT5_rerank ${dataset}# Example: ./run.sh CoditT5_rerank comment-update
```Main results are generated to `results/reranks/`:
- `test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`: `${model}`'s top 20 beam outputs and `${reranker}`'s likelihood score for each beam output.
### Compute automatic metrics
Requires the model's reranking results file
`results/reranks/test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`.```
./run.sh eval_rerank_${model}_${reranker} ${dataset}# Example: compute metrics for top 1 CoditT5 prediction reranked by CodeT5
./run.sh eval_rerank_CoditT5_CodeT5 comment-update
```Results are generated to `results/`:
- `results-${dataset}-${model}-rerank-${reranker}.json`: the average of automatic metrics.
- `scores-${dataset}-${model}-rerank-${reranker}.json`: the list of automatic metrics per sample.