https://github.com/illuin-tech/grouse
Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models
https://github.com/illuin-tech/grouse
Last synced: about 1 year ago
JSON representation
Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models
- Host: GitHub
- URL: https://github.com/illuin-tech/grouse
- Owner: illuin-tech
- License: mit
- Created: 2024-07-05T09:10:08.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-30T09:37:21.000Z (over 1 year ago)
- Last Synced: 2025-04-12T13:17:47.395Z (about 1 year ago)
- Language: Python
- Size: 1.26 MB
- Stars: 12
- Watchers: 6
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
- Citation: CITATION.cff
Awesome Lists containing this project
README
# GroUSE
[](https://arxiv.org/abs/2409.06595)
[](https://huggingface.co/datasets/illuin/grouse)
[](https://huggingface.co/spaces/illuin/grouse)
[](https://github.com/NirDiamant/RAG_Techniques/blob/main/evaluation/evaluation_grouse.ipynb)
---
Evaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.
- [Install](#install)
- [Command Line Usage](#command-line-usage)
- [Evaluation of the Grounded Question Answering task](#evaluation-of-the-grounded-question-answering-task)
- [Unit Testing of Evaluators with GroUSE](#unit-testing-of-evaluators-with-grouse)
- [Plot Matrices of unit tests success](#plot-matrices-of-unit-tests-success)
- [Python Usage](#python-usage)
- [Links](#links)
- [Citation](#citation)
## Install
```bash
pip install grouse
```
Then, setup your OpenAI credentials by creating an `.env` file by copying the `.env.dist` file, filling in your OpenAI API key and organization id and exporting the environment variables `export $(cat .env | xargs)`.
## Command Line Usage
### Evaluation of the Grounded Question Answering task
You can build a dataset in a `jsonl` file with the following format per line:
```json
{
"references": ["", ...], // List of references
"input": "", // Query
"actual_output": "", // Predicted answer generated by the model we want to evaluate
"expected_output": "" // Ground truth answer to the input
}
```
You can also check this example `example_data/grounded_qa.jsonl`.
Then, run this command:
```bash
grouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o
```
We recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and prompts using the otional arguments :
- `--evaluator_model_name`: Name of the evaluator model. It can be any LiteLLM model. The default model is GPT-4.
- `--prompts_path`: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.
### Unit Testing of Evaluators with GroUSE
Meta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.
```bash
grouse meta-evaluate gpt-4o meta-outputs/gpt-4o
```
Optional arguments :
- `--prompts_path`: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.
- `--train_set`: Optional flag to meta-evaluate on the train set (16 tests) instead of the test set (144 tests). The train set is meant to be used during the prompt engineering phase.
### Plot Matrices of unit tests success
You can plot the results of unit tests in the shape of matrices:
```bash
grouse plot meta-outputs/gpt-4o
```
The resulting matrices look like this:

## Python Usage
```python
from grouse import EvaluationSample, GroundedQAEvaluator
sample = EvaluationSample(
input="What is the capital of France?",
# Replace this with the actual output from your LLM application
actual_output="The capital of France is Marseille.[1]",
expected_output="The capital of France is Paris.[1]",
references=["Paris is the capital of France."]
)
evaluator = GroundedQAEvaluator()
evaluator.evaluate([sample])
```
### Tutorial
You can check this [tutorial](https://github.com/NirDiamant/RAG_Techniques/blob/main/evaluation/evaluation_grouse.ipynb) to get started on some examples.
## Links
- [Paper](https://arxiv.org/abs/2409.06595)
- [Unit Tests](https://huggingface.co/datasets/illuin/grouse)
- [Finetuned model](https://huggingface.co/illuin/llama-3-grouse)
## Citation
```latex
@misc{muller2024grousebenchmarkevaluateevaluators,
title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering},
author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
year={2024},
eprint={2409.06595},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.06595},
}
```