Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/networks-learning/prediction-powered-ranking

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.
https://github.com/networks-learning/prediction-powered-ranking

llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets ranking-algorithm

Last synced: 3 months ago
JSON representation

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

Host: GitHub
URL: https://github.com/networks-learning/prediction-powered-ranking
Owner: Networks-Learning
License: mit
Created: 2024-02-27T13:21:18.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-10-28T10:35:13.000Z (4 months ago)
Last Synced: 2024-10-28T13:55:36.379Z (4 months ago)
Topics: llm-eval, llm-evaluation, llm-evaluation-framework, prediction-powered-inference, rank-sets, ranking-algorithm
Language: Jupyter Notebook
Homepage:
Size: 4.74 MB
Stars: 7
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Prediction-Powered Ranking

This repository contains the code for the paper [Prediction-Powered Ranking of Large Language Models](https://arxiv.org/abs/2402.17826) published in NeurIPS (2024) by Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi and Manuel Gomez-Rodriguez.

__Contents__:

- [Dependencies](#dependencies)

- [Usage](#usage)

- [Repository structure](#repository-structure)

- [Contact & attribution](#contact--attribution)

## Dependencies

All the code is written in Python 3.11.2\

In order to create a virtual environment and install the project dependencies you can run the following commands:

```commandline

python3 -m venv env

source env/bin/activate

pip install -r requirements.txt

```

In addition to the above dependencies, to run the notebooks and produce the figures it is necessary to install the additional requirements from [notebooks/requirements.txt](notebooks/requirements.txt).

## Usage

To reproduce the main experiments of the paper (Section 5 and Appendix D), run:

```

./scripts/llm-ranking.sh

```

To reproduce the synthetic experiments in Appendix E of the paper, run:

```

./scripts/synthetic.sh

```

To create the figures, run the notebooks in [notebooks](notebooks/).

## Repository structure

```

├── data

│   ├── human.json

│   ├── gpt-4-0125-preview.json

│   ├── claude-3-opus-20240229.json

│   └── gpt-3.5-turbo.json

├── figures

├── notebooks

├── outputs

├── scripts

│   ├── llm-ranking.sh

│   ├── synthetic.sh

│   └── config.json

└── src

    ├── data_process.py

    ├── estimate.py

    ├── llm-ranking.py

    ├── plot_utils.py

    ├── ranksets.py

    ├── run_experiments.py

    └── synthetic.py

```

The folder [data](data/) contains the datasets used for our experimentation:

- [human.json](data/human.json): pairwise comparisons by humans.

- [gpt-4-0125-preview.json](data/gpt-4-0125-preview.json): pairwise comparisons by GPT 4.

- [claude-3-opus-20240229.json](data/claude-3-opus-20240229.json): pairwise comparisons by Claude 3.

- [gpt-3.5-turbo.json](data/gpt-3.5-turbo.json): pairwise comparisons by GPT 3.5.

The folder [figures](figures/) contains all the figures presented in the paper.

The folder [notebooks](notebooks/) contains python notebooks that generate all the figures included in the paper. 

The folder [outputs](outputs/) contains the output files produced by the experiments' scripts.

The folder [scripts](scripts/) contains bash scripts used to run all the experiments presented in the paper:

- [llm-ranking.sh](scripts/llm-ranking.sh): runs the main experiments of the paper, in Section 5 and Appendix D.

- [synthetic.sh](scripts/synthetic.sh): runs the synthetic experiments in Appendix E of the paper.

- [config.json](scripts/config.json): configuration file with parameters for the experiments ran by [llm-ranking.sh](scripts/llm-ranking.sh).

The folder [src](src/) contains all the code necessary to reproduce the results in the paper. Specifically:

- [data_process.py](src/data_process.py): inputs and subsamples from datasets.

- [estimate.py](src/estimate.py): implements Algorithms 1,3,4 from the paper to compute $\hat{\theta}$ and $\widehat{\Sigma}$.

- [llm-ranking.py](src/llm-ranking.py): reads config file and runs the experiments in Section 5 and Appendix D of the paper.

- [plot_utils.py](src/plot_utils.py): contains auxiliary functions for plotting.

- [ranksets.py](src/ranksets.py): implements Algorithm 2 from the paper to construct rank-sets.

- [run_experiments.py](src/run_experiments.py): runs experiments for all input parameters.

- [synthetic.py](src/synthetic.py): generates synthetic data and runs the synthetic experiments in Appendix E of the paper.

## Contact & attribution

In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact [Ivi Chatzi](mailto:[email protected]).

If you use parts of the code in this repository for your own research purposes, please consider citing:

    @inproceedings{chatzi2024prediction,

      title={Prediction-Powered Ranking of Large Language Models},

      author={Ivi Chatzi and Eleni Straitouri and Suhas Thejaswi and Manuel Gomez Rodriguez},

      year={2024},

      booktitle = {Advances in Neural Information Processing Systems},

      publisher = {Curran Associates, Inc.}

      }