https://github.com/terryyz/ice-score

[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code
https://github.com/terryyz/ice-score

automatic-evaluation code-generation code-quality evaluation gpt-4 large-language-models llm

Last synced: 6 months ago
JSON representation

[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code

Host: GitHub
URL: https://github.com/terryyz/ice-score
Owner: terryyz
License: mit
Created: 2023-04-26T12:21:48.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-06-16T10:34:36.000Z (about 1 year ago)
Last Synced: 2024-12-09T06:11:40.074Z (6 months ago)
Topics: automatic-evaluation, code-generation, code-quality, evaluation, gpt-4, large-language-models, llm
Language: Python
Homepage: https://arxiv.org/abs/2304.14317
Size: 21.8 MB
Stars: 69
Watchers: 2
Forks: 8
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # ICE-Score: Instructing Large Language Models to Evaluate Code

_**January 2024**_ - ICE-Score has been accepted to EACL 2024 🎉🎉🎉

---

* [Example](#example)

* [Environment Setup](#environment-setup)

* [Folder Description](#folder-description)

* [Usage](#usage)

* [Citation](#citation)

* [Acknowledgement](#acknowledgement)

## Example

![](./assets/ice-score.png "Example")

## Environment Setup

Our experiment is mainly built on the [codegen-metrics](https://github.com/JetBrains-Research/codegen-metrics) and [code-bert-score](https://github.com/neulab/code-bert-score) repositories. To replicate all experiments, please follow their instructions to set up the environment.

To run `compute_results.ipynb` and modules in `llm-code-eval` folder, use the following command to install all dependencies:

```bash

pip install -r requirements.txt

```

## Folder Description

- `data/` contains all processed data used in the paper.

    - `data/conala/` contains the CoNaLa dataset with all automatic evaluation results.

    - `data/humaneval/` contains the HumanEval dataset with all automatic evaluation results.

      - `data/humaneval/humaneval_java_grade.json`: Java split

      - `data/humaneval/humaneval_cpp_grade.json`: C++ split

      - `data/humaneval/humaneval_python_grade.json`: Python split

      - `data/humaneval/humaneval_js_grade.json`: JavaScript split

 

- `experiment_source/` contains the scripts to collect all automatic evaluation results. They require specific modifications to run on your machine. Note that for any of these scripts using `metrics_evaluation.metrics`, you need to use the implementations in `metrics_evaluation` folder from [codegen-metrics](https://github.com/JetBrains-Research/codegen-metrics).

- `llm_code_eval` contains the implementation of a minimum viable product (MVP) of this project. You are able to use it to evaluate any generated code snippet. Please refer to the `Use Large Language Models To Downstream Tasks Of Source Code` for more details.

## Usage

We implement a minimum viable product (MVP) for this project. To install the project, please use the following command:

```bash

pip install -e .

```

You can use it to evaluate any generated code snippet, with the inputs of `problem`, `output`, `task`, `aspect` and `model`, like the following example:

```python

from llm_code_eval import evaluate

score = evaluate(problem="Given a list of integers, return the sum of all the integers.", 

                    output="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 

                    task="code-gen", aspect="usefulness", model="gpt-3.5-turbo")

print(score)

```

If you want to evaluate with reference code, you can use the option of `reference` in the following example:

```python

from llm_code_eval import evaluate

score = evaluate(problem="Given a list of integers, return the sum of all the integers.", 

                output="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 

                reference="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 

                task="code-gen", aspect="usefulness", model="gpt-3.5-turbo")

print(score)

```

You can also use the option of `cot=True` to enable the zero-shot chain-of-thought evaluation in the following example:

```python

from llm_code_eval import evaluate

score, eval_step = evaluate(problem="Given a list of integers, return the sum of all the integers.", 

                            output="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 

                            task="code-gen", aspect="usefulness", model="gpt-3.5-turbo", cot=True)

print(score)

print(eval_step)

```

## Citation

```

@inproceedings{zhuo2024ice,

  title={ICE-Score: Instructing Large Language Models to Evaluate Code},

  author={Zhuo, Terry Yue},

  booktitle={Findings of the Association for Computational Linguistics: EACL 2024},

  pages={2232--2242},

  year={2024}

}

```

## Acknowledgement

We thank [JetBrains Research](https://research.jetbrains.org/) and [NeuLab](http://www.cs.cmu.edu/~neulab/) for their open-source code and data.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/terryyz/ice-score

Awesome Lists containing this project

README