https://github.com/vectara/mirage-bench
https://github.com/vectara/mirage-bench
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/vectara/mirage-bench
- Owner: vectara
- Created: 2024-09-17T13:30:59.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-22T00:44:14.000Z (7 months ago)
- Last Synced: 2025-03-22T01:29:28.393Z (7 months ago)
- Size: 109 KB
- Stars: 7
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[](https://huggingface.co/collections/nthakur/mirage-bench-naacl25-67ddb6166a7938a37436a455)
[][#github-license]
[][#pypi-package]
[][#pypi-package][#github-license]: https://github.com/vectara/mirage-bench/blob/master/LICENSE
[#pypi-package]: https://pypi.org/project/mirage-bench/# MIRAGE-BENCH: Benchmarking LLM Generation Across Multiple Languages
This repository provides an easy way to achieve the following four objectives:
1. Generate RAG-based answers to multilingual questions, with support for many open-source LLMs integrated via [vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html), as well as closed-source LLMs through APIs such as Azure OpenAI, Cohere, Anthropic, etc.
2. Evaluate multilingual RAG answers based on a variety of heuristic features (e.g., support, fluency) or automatic evaluations using open-source LLMs supported in vLLM.
3. Conduct an LLM-as-a-Judge design to compare pairwise multilingual RAG answers and train a Bradley-Terry model (with bootstrapping) to build an offline multilingual RAG arena.
4. Train a surrogate judge (linear regression) to learn from and bootstrap the expensive LLM-as-a-Judge approach using heuristic features.For more information, check out our publication:
- [MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems](https://arxiv.org/abs/2410.13716) (Accepted at NAACL 2025 Main Conference :star:)## Installation
We recommend **Python 3.9+** and installing the latest version of **[vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html)**.
**Install with pip:**
```bash
pip install -U mirage-bench
```**Install from sources**
Alternatively, you can also clone the latest version from the [repository](https://github.com/vectara/mirage-bench) and install it directly from the source code:
```bash
pip install -e .
```## Datasets
| Resource | Description |
|:---------|:------------|
| :hugs: [mirage-bench](https://huggingface.co/datasets/nthakur/mirage-bench) | All queries & input prompts available in MIRAGE-Bench |
| :hugs: [mirage-bench-output](https://huggingface.co/datasets/nthakur/mirage-bench-output) | Pre-computed RAG answers and all feature scores for 21 models |
| :hugs: [mirage-bench-pairwise-judgments](https://huggingface.co/datasets/nthakur/mirage-bench-pairwise-judgments) | Pairwise judgments using GPT-4o LLM judge across all 19 models |## Getting Started
Make sure you have the latest **[vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html)** installed correctly.
### 1. Multilingual RAG Answer Generation
Generate the RAG answer for given multilingual queries in mirage-bench using an LLM model.
> Similarly, you can even generate answers with HF models on single/multiple GPU instances with [vLLM](https://github.com/vectara/mirage-bench/blob/main/examples/generation/vllm_generation.py).```python
# export AZURE_OPENAI_ENDPOINT="xxxxx"
# export AZURE_OPENAI_API_KEY="xxxx"from mirage_bench import util
from mirage_bench.generate import AzureOpenAIClient# Many other clients also available, e.g., Cohere or Anthropic
client = AzureOpenAIClient(model_name_or_path="gpt-4o-mini")### Prompts_dict contains query_id as key and prompt as value
prompts_dict = util.load_prompts(
dataset_name="nthakur/mirage-bench",
language_code="en", # or "ar", "bn" ... 18 languages supported
split="dev" # only dev split is available in mirage-bench
)
query_ids = list(prompts_dict.keys())
outputs = client.batch_call(
prompts=list(prompts_dict.values()),
temperature=0.1,
max_new_tokens=2048,
)
#### output contains the List of RAG outputs
```### 2. Heuristic \& Automatic RAG Evaluation
After generating RAG answers, we evaluate the quality of the response using heuristic features:
```python
from mirage_bench import util
from mirage_bench.evaluate import RougeBleuEvaluatorevaluator = RougeBleuEvaluator(language_code="en")
# Load the documents (relevant & non-relevant)
documents = util.load_documents(
dataset_name="nthakur/mirage-bench",
language_code="en",
split="dev"
)# Load the multilingual RAG predictions available for 20+ models.
# In this example, we are evaluating: meta-llama/Meta-Llama-3-8B-Instruct
predictions = util.load_predictions(
dataset_name="nthakur/mirage-bench-output",
language_code="en",
split="dev",
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
)# Need to load the reference model, i.e., ground_truth predictions
# This step is not necessary in all heuristic features
reference_predictions = util.load_predictions(
dataset_name="nthakur/mirage-bench-output",
language_code="en",
split="dev",
model_name="gpt-4-azure",
)# Evaluate the predictions
scores = evaluator.evaluate(
predictions=predictions,
reference_predictions=reference_predictions,
documents=documents
)
# => query_id: {"answer_bleu": 0.9, "answer_rougeL": 0.75}
```### 3. LLM-as-a-Judge Pairwise Evaluation
After generating RAG answers, we can also use a LLM as a judge to compare two RAG outputs and decide which output is better.
```python
from mirage_bench import util
from mirage_bench.evaluate import PairwiseLLMJudgeEvaluatorevaluator = PairwiseLLMJudgeEvaluator(
client="azure_openai",
model_name_or_path="gpt-4o-mini"
)# Load the documents (relevant & non-relevant)
documents = util.load_documents(
dataset_name="nthakur/mirage-bench",
language_code="en",
split="dev"
)
queries = util.load_queries(
dataset_name="nthakur/mirage-bench",
language_code="en",
split="dev"
)# In this example we will evaluate two models:
models = [
"meta-llama/Meta-Llama-3-8B-Instruct",
"meta-llama/Meta-Llama-3-70B-Instruct"
]for model_name in models:
predictions[model_name] = util.load_predictions(
dataset_name="nthakur/mirage-bench-output",
language_code="en",
split="dev",
model_name=model_name,
)scores = evaluator.evaluate(
predictions=predictions,
all_model_names=models, # provide all model names
documents=documents,
queries=queries
)
# IMP: model_A and model_B are randomly switched
# => [{"query_id": 1,
# "judge": "gpt-4o-mini",
# "model_A": "meta-llama/Meta-Llama-3-8B-Instruct",
# "model_B": "meta-llama/Meta-Llama-3-70B-Instruct",
# "output": ,
# "verdict": A/B/Tie.
# }]
```## Application Examples
You can use this framework for:
- [Multilingual RAG Generation](https://github.com/vectara/mirage-bench/tree/main/examples/generation)
- [Heuristic RAG Evaluations](https://github.com/vectara/mirage-bench/tree/main/examples/heuristic_evals)
- [Arena RAG Evaluations](https://github.com/vectara/mirage-bench/tree/main/examples/arena_evals)
- [Surrogate Judge Training \& Inference](https://github.com/vectara/mirage-bench/tree/main/examples/surrogate_judge)## Citing & Authors
This work was done in a collaboration between Vectara and University of Waterloo.
If you find this repository helpful, feel free to cite our publication [MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems](https://arxiv.org/abs/2410.13716):
```bibtex
@article{thakur-mirage-bench:2024,
author = {Nandan Thakur and
Suleman Kazi and
Ge Luo and
Jimmy Lin and
Amin Ahmad},
title = {MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented
Generation Systems},
journal = {CoRR},
volume = {abs/2410.13716},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2410.13716},
doi = {10.48550/ARXIV.2410.13716},
eprinttype = {arXiv},
eprint = {2410.13716},
timestamp = {Wed, 27 Nov 2024 09:01:16 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2410-13716.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```Maintainer: [Nandan Thakur](https://github.com/thakur-nandan), PhD Student @ University of Waterloo
Don't hesitate to open an issue if something is broken (and it shouldn't be) or if you have further questions.
> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.