Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/h2oai/h2o-LLM-eval?tab=readme-ov-file

Large-language Model Evaluation framework with Elo Leaderboard and A-B testing
https://github.com/h2oai/h2o-LLM-eval?tab=readme-ov-file

Last synced: 13 days ago
JSON representation

Large-language Model Evaluation framework with Elo Leaderboard and A-B testing

Awesome Lists containing this project

README

        

# H2O Large Language Model (LLM) Evaluation

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications, the need for comprehensive evaluation and comparison of these models has never been more critical.
This repository is an effort in that direction, providing an evaluation method and the toolkit for the assessment of Large Language Models.

Please read the [Blog Post](https://h2o.ai/blog/h2o-llm-evalgpt-a-comprehensive-tool-for-evaluating-large-language-models/) for more context.

- [EvalGPT.ai](#evalgptai)
- [Elo Leaderboard](#elo-leaderboard)
- [Prompts](#prompts)
- [Responses](#responses)
- [A/B Tests](#ab-tests)
- [Docker Compose Setup](#docker-compose-setup)
- [Local Setup](#local-setup)
- [Reproducing Leaderboard](#reproducing-leaderboard-results)
- [Roadmap](#roadmap)

## EvalGPT.ai

[evalgpt.ai](https://evalgpt.ai/) hosts the Leaderboard of some of the top LLMs ranked by their Elo scores. The leaderboard is updated frequently and provides a comprehensive and fair assessment of Large Language Models. Different features of the website are described below.

### Elo Leaderboard

The Elo Leaderboard provides a ranking of the top LLMs based on their Elo scores. The Elo scores are computed from the results of A/B tests, wherein the LLMs are pitted against each other in a series of games. The ranking system employed is based on the [Elo Rating System](https://en.wikipedia.org/wiki/Elo_rating_system). The procedure for Elo score computation closely follows the methodology outlined at [this resource](https://lmsys.org/blog/2023-05-25-leaderboard/).

![Elo Leaderboard](docs/images/leaderboard.png)

### Prompts

Prompts tab has the list of 60 prompts used to evaluate the LLMs. The prompts are categorized into different categories based on the type of task they are designed for.

![Prompts](./docs/images/testset.png)

### Responses

In the Responses section, you can see the responses generated by the LLMs for the prompts. You can also select the LLMs and prompts to compare the responses.

![Responses](./docs/images/responses.png)

Click on the "Select Models" button to select the LLMs to compare. You can also select a different prompt using the "Previous" and "Next" buttons.

![select models and prompts](./docs/images/evalgpt_responses_toolbar.png)

For any two selected models and the prompt, you can see the evaluation by GPT4 by clicking on the "Show GPT Eval" button on the top right.

![show gpt eval button](./docs/images/evalgpt_gpt_eval_button.png)

![show eval gpt](./docs/images/gpt_eval.png)

### A/B Tests

"Which is Better: A or B?" provides the interface to perform human evaluation of the LLMs. Each A/B test consists of a prompt and two responses generated by two different LLMs. The user is asked to select the better response among the two.

![A/B Tests](./docs/images/abtests.png)

## Docker Compose Setup

### 1. Clone the repository

```bash
git clone https://github.com/h2oai/h2o-LLM-eval.git
cd h2o-LLM-eval
```

### 2. Run Docker Compose

```bash
docker compose up -d
```

Navigate to http://localhost:10101/ in your browser

## Local Setup

### 1. Clone the repository

```bash
git clone https://github.com/h2oai/h2o-LLM-eval.git
```

### 2. Setup Database

#### a. Create a docker volume for the database

```bash
docker volume create llm-eval-db-data
```

#### b. Start PostgreSQL 14 in docker

```bash
docker run -d --name=llm-eval-db -p 5432:5432 -v llm-eval-db-data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=pgpassword postgres:14
```

#### c. Install PostgreSQL client

- On Ubuntu:

```bash
sudo apt update
sudo apt install postgresql-client
```

- On macOS:

```bash
brew install libpq
echo 'export PATH="/usr/local/opt/libpq/bin:$PATH"' >> ~/.zshrc
```

#### d. Load the latest data dump into the database

```bash
PGPASSWORD=pgpassword psql --host=localhost --port=5432 --username=postgres < data/10_init.sql
```

### 3. Setup the environment

The setup is tested on Python 3.10

```bash
python -m venv .venv
```

```bash
. .venv/bin/activate
```

```bash
pip install --upgrade pip
pip install -r requirements.txt
```

### 4. Run the App

```bash
POSTGRES_HOST=localhost POSTGRES_USER=maker POSTGRES_PASSWORD=makerpassword POSTGRES_DB=llm_eval_db H2O_WAVE_NO_LOG=true wave run llm_eval/app.py
```

Navigate to http://localhost:10101/ in your browser

## Reproducing Leaderboard Results

We provide [notebooks](notebooks) to generate leaderboard results and reproduce [evalgpt.ai](https://evalgpt.ai).

1. Run [run_all_evaluations.ipynb](notebooks/run_all_evaluations.ipynb) to evaluate any A/B tests that have not yet been evaluated by a chosen evaluation model and insert the outcomes into the database. An A/B test is considered unevaluated by the given model if no evaluation by the model exists for the given combination of models and prompt. After adding a model, running this evaluates all A/B tests for the model against all other models.

2. Run all cells in [calculate_elo_rating_public_leaderboard.ipynb](notebooks/calculate_elo_rating_public_leaderboard.ipynb) to get the Elo leaderboard and relevant charts given the evaluations in the database.

## Roadmap

### Models

1. Add [FreeWilly2](https://stability.ai/blog/freewilly-large-instruction-fine-tuned-models) to the Leaderboard

### Application

1. v2 architecture
2. Option for users to submit new models

### Eval

1. More prompts in each category
2. Document Q/A and Retrieval Category with ground truth
3. Document Summarization Category