Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/engineeringsoftware/codeditor
Multilingual Code Co-Evolution Using Large Language Models
https://github.com/engineeringsoftware/codeditor
co-evolution code evolution large-language-models llm
Last synced: about 1 month ago
JSON representation
Multilingual Code Co-Evolution Using Large Language Models
- Host: GitHub
- URL: https://github.com/engineeringsoftware/codeditor
- Owner: EngineeringSoftware
- License: mit
- Created: 2023-08-31T17:32:19.000Z (over 1 year ago)
- Default Branch: public
- Last Pushed: 2024-05-27T20:07:40.000Z (7 months ago)
- Last Synced: 2024-05-28T05:47:55.422Z (7 months ago)
- Topics: co-evolution, code, evolution, large-language-models, llm
- Language: Python
- Homepage:
- Size: 53.7 KB
- Stars: 11
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# Multilingual Code Co-Evolution Using Large Language Models
This repo hosts the code and data for the following FSE 2023 paper:
Title: [Multilingual Code Co-Evolution Using Large Language Models](https://arxiv.org/abs/2307.14991)
Authors: [Jiyang Zhang](https://jiyangzhang.github.io/), [Pengyu Nie](https://pengyunie.github.io/), [Junyi Jessy Li](https://jessyli.com/), [Milos Gligoric](http://users.ece.utexas.edu/~gligoric/)
```bibtex
@inproceedings{ZhangETAL23Codeditor,
author = {Zhang, Jiyang and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
title = {Multilingual Code Co-Evolution Using Large Language Models},
booktitle = {Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
year = {2023},
}
```## News
May 2024
The fine-tuned EditsTranslation model is released on 🤗 ! 🔥[cs2java](https://huggingface.co/EngineeringSoftware/EditsTranslation-cs2java) and [java2cs](https://huggingface.co/EngineeringSoftware/EditsTranlation-java2cs/settings)## How to Use
[sec-howto]: #how-to-use
```python
from transformers import T5ForConditionalGeneration, AutoTokenizercheckpoint = "EngineeringSoftware/EditsTranlation-java2cs"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)code_input = """class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!")"""
input_ids = tokenizer(code_input, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=200)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# output: ; } } ; class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!") ; } } ;
```## Introduction
This repo contains the code and artifacts for reproducing the experiments in [Multilingual Code Co-Evolution Using Large Language Models](https://arxiv.org/abs/2307.14991).
In this work, we introduce Codeditor for co-evolving software implemented in multiple programming languages.The code includes:
- scripts for processing dataset
- scripts for training and evaluating codeditor modelsThe artifacts include:
- Java to C# raw paired changes
- Java to C# translation dataset processed for codeditor models## Data Downloads
[sec-downloads]: #data-downloads
All our data is hosted on UTBox via [a shared folder](https://utexas.box.com/s/iwcvwgx23g9xvowu9joa661rz74k9eea).
## Code for Processing Fine-tuning Data
[sec-process]: #code-for-processing-fine-tuning-data
We provide the sample script to process the datasets for edit-translation. Requires the raw data files at `raw_data/`.
```
cd python/
python -m deltr.collector.DataProcessor edit_translation_data_process --exp cs2java --src_lang cs --tgt_lang java```
## Code for Training and Evaluating Models
[sec-traineval]: #code-for-training-and-evaluating-models
### Train ML models
```
cd python/
python -m deltr.coditT5.CodeT5 fit --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config configs/coditT5.yaml# Example: python -m deltr.coditT5.CodeT5 fit --exp_dir models/edit-translation/java2cs --data.dataset java2cs --data.model edit-translation --config configs/coditT5.yaml
```Results are generated to `models/${model}/${dataset}/`, where:
- `model/`: stores the trained model.
- `logs/`: stores logs during training.
### Run ML models to do inference
Requires the dataset at `data/${model}/${dataset}/`, the trained model at `models/${model}/${dataset}/model/`.
```
cd python/
python -m deltr.coditT5.CodeT5 predict --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config configs/coditT5.yaml```
Results are generated to `models/${model}/${dataset}/`, where:
- `output.hyp`: the predictions.