https://github.com/gomate-community/rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.
https://github.com/gomate-community/rageval

evalution llm rag

Last synced: 5 months ago
JSON representation

Evaluation tools for Retrieval-augmented Generation (RAG) methods.

Host: GitHub
URL: https://github.com/gomate-community/rageval
Owner: gomate-community
License: apache-2.0
Created: 2024-01-30T07:27:32.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-13T07:35:57.000Z (6 months ago)
Last Synced: 2024-11-13T08:28:14.265Z (6 months ago)
Topics: evalution, llm, rag
Language: Python
Homepage:
Size: 1.81 MB
Stars: 129
Watchers: 7
Forks: 10
Open Issues: 20
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome-production-machine-learning - Rageval - community/rageval.svg?style=social) - Rageval is a tool to evaluate RAG system. (Evaluation and Monitoring)
awesome_ai_agents - Rageval - Evaluation tools for Retrieval-augmented Generation (RAG) methods. (Building / Tools)
awesome_ai_agents - Rageval - Evaluation tools for Retrieval-augmented Generation (RAG) methods. (Building / Tools)

README

        # Rageval

Evaluation tools for Retrieval-augmented Generation (RAG) methods.

[![python](https://img.shields.io/badge/Python-3.8.18-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)

![workflow status](https://github.com/gomate-community/rageval/actions/workflows/makefile.yml/badge.svg)

[![codecov](https://codecov.io/gh/gomate-community/rageval/graph/badge.svg?token=AH4DNR46HL)](https://codecov.io/gh/gomate-community/rageval)

[![pydocstyle](https://img.shields.io/badge/pydocstyle-enabled-AD4CD3)](http://www.pydocstyle.org/en/stable/)

[![PEP8](https://img.shields.io/badge/code%20style-pep8-orange.svg)](https://www.python.org/dev/peps/pep-0008/)

Rageval is a tool that helps you evaluate RAG system. The evaluation consists of six sub-tasks, including query rewriting, document ranking, information compression, evidence verify, answer generating, and result validating.

## Definition of tasks and metrics

### 1. [The generate task](./rageval/tasks/_generate.py)

The generate task is to answer the question based on the contexts provided by retrieval modules in RAG. Typically, the context could be extracted/generated text snippets from the compressor, or relevant documents from the re-ranker. Here, we divide metrics used in the generate task into two categories, namely *answer correctness* and *answer groundedness*.

(1) **Answer Correctness**: this category of metrics is to evaluate the correctness by comparing the generated answer with the groundtruth answer. Here are some commonly used metrics:

* [Answer F1 Correctness](./rageval/metrics/_answer_f1.py): is widely used in [the paper (Jiang et al.)](https://arxiv.org/abs/2305.06983), [the paper (Yu et al.)](https://arxiv.org/abs/2311.09210), [the paper (Xu et al.)](https://arxiv.org/abs/2310.04408), and others.

* [Answer NLI Correctness](./rageval/metrics/_answer_claim_recall.py): also known as *claim recall* in [the paper (Tianyu et al.)](https://arxiv.org/abs/2305.14627).

* [Answer EM Correctness](./rageval/metrics/_answer_exact_match.py): also known as *Exact Match* as used in [the paper (Ivan Stelmakh et al.)](https://arxiv.org/abs/2204.06092).

* [Answer Bleu Score](./rageval/metrics/_answer_bleu.py): also known as *Bleu* as used in [the paper (Kishore Papineni et al.)](https://www.aclweb.org/anthology/P02-1040.pdf).

* [Answer Ter Score](./rageval/metrics/_answer_ter.py): also known as *Translation Edit Rate* as used in [the paper (Snover et al.)](https://aclanthology.org/2006.amta-papers.25).

* [Answer chrF Score](./rageval/metrics/_answer_chrf.py): also known as *character n-gram F-score* as used in [the paper (Popovic et al.)](https://aclanthology.org/W15-3049).

* [Answer Disambig-F1](./rageval/metrics/_answer_disambig_f1.py): also known as *Disambig-F1* as used in [the paper (Ivan Stelmakh et al.)](https://arxiv.org/abs/2204.06092) and [the paper (Zhengbao Jiang et al.)](https://arxiv.org/abs/2305.06983).

* [Answer Rouge Correctness](./rageval/metrics/_answer_rouge_correctness.py): also known as *Rouge* as used in [the paper (Chin-Yew Lin)](https://aclanthology.org/W04-1013.pdf).

* [Answer Accuracy](./rageval/metrics/_answer_accuracy.py): also known as *Accuracy* as used in [the paper (Dan Hendrycks et al.)](https://arxiv.org/abs/2009.03300).

* [Answer LCS Ratio](./rageval/metrics/_answer_lcs_ratio.py): also know as *LCS(%)* as used in [the paper (Nashid et al.)](https://ieeexplore.ieee.org/abstract/document/10172590).

* [Answer Edit Distance](./rageval/metrics/_answer_edit_distance.py): also know as *Edit distance* as used in [the paper (Nashid et al.)](https://ieeexplore.ieee.org/abstract/document/10172590).

(2) **Answer Groundedness**: this category of metrics is to evaluate the groundedness (also known as factual consistency) by comparing the generated answer with the provided contexts. Here are some commonly used metrics:

* [Answer Citation Precision](./rageval/metrics/_answer_citation_precision.py): also known as *citation precision* in [the paper (Tianyu et al.)](https://arxiv.org/abs/2305.14627).

* [Answer Citation Recall](./rageval/metrics/_answer_citation_recall.py): also known as *citation recall* in [the paper (Tianyu et al.)](https://arxiv.org/abs/2305.14627).

* [Context Reject Rate](./rageval/metrics/_context_reject_rate.py): also known as *reject rate* in [the paper (Wenhao Yu et al.)](https://arxiv.org/abs/2311.09210).

### 2. [The rewrite task](./rageval/tasks/_rewrite.py)

The rewrite task is to reformulate user question into a set of queries, making them more friendly to the search module in RAG. 

### 3. [The search task](./rageval/tasks/_search.py)

The search task is to retrieve relevant documents from the knowledge base.

(1) **Context Adequacy**: this category of metrics is to evaluate the adequacy by comparing the retrieved documents with the groundtruth contexts. Here are some commonly used metrics:

(2) **Context Relevance**: this category of metrics is to evaluate the relevance by comparing the retrieved documents with the groundtruth answers. Here are some commonly used metrics:

* [Context Recall](./rageval/metrics/_context_recall.py): also known as *Context Recall* in [RAGAS framework](https://github.com/explodinggradients/ragas).

## Setup Evaluator LLMs

Some metrics evaluations rely on LLMs as evaluators. You can either directly call OpenAI's API or deploy an open-source model as a RESTful API in the OpenAI format for evaluation.

- OpenAI

```python

os.environ["OPENAI_API_KEY"] = ""

```

- Open source LLMs

Please use [vllm](https://github.com/vllm-project/vllm) to setup the API server for open source LLMs. For example, use the following command to deploy a Llama-3-8B model hosted on HuggingFace:

```bash

python -m vllm.entrypoints.openai.api_server \

  --model meta-llama/Meta-Llama-3-8B-Instruct \

  --tensor-parallel-size 8 \

  --dtype auto \

  --api-key sk-123456789 \

  --gpu-memory-utilization 0.9 \

  --port 5000

```

## Benchmark Results

### 1. [ASQA benchmark](benchmarks/ASQA/README.md)

[ASQA dataset](https://huggingface.co/datasets/din0s/asqa) is a question-answering dataset that contains factoid questions and long-form answers. The benchmark evaluates the correctness of the answer in the dataset. All detailed results can be download from [this repo](https://huggingface.co/datasets/golaxy/rag-bench/viewer/asqa). Besides, these results can be reproduced based on [the script](./benchmarks/ASQA/run.sh) in this repo.

 

 

 

 

  Model

  Retriever

  Metric

 

 

  String EM

  Rouge L

  Disambig F1

  D-R Score

 

 

  gpt-3.5-turbo-instruct

  no-retrieval

  33.8

  30.2

  30.7

  30.5

 

 

  mistral-7b

  no-retrieval

  20.6

  31.1

  26.6

  28.7

 

 

  llama2-7b-chat

  no-retrieval

  21.7

  30.7

  28.0

  29.3

 

 

  llama3-8b-base

  no-retrieval

  25.7

  31.0

  28.4

  29.7

 

 

  llama3-8b-instruct

  no-retrieval

  27.1

  30.9

  29.4

  30.1

 

 

  solar-10.7b-instruct

  no-retrieval

  23.0

  24.9

  28.1

  26.5

 

### 2. [ALCE Benchmark](benchmarks/ALCE)

[ALCE](https://github.com/princeton-nlp/ALCE) is a benchmark for Automatic LLMs' Citation Evaluation. ALCE contains three datasets: ASQA, QAMPARI, and ELI5. All detailed results can be download from [this repo](https://huggingface.co/datasets/golaxy/rag-bench/viewer/alce_eli5_bm25). Besides, these results can be reproduced based on [the script](./benchmarks/ALCE/ASQA/run.sh) in this repo.

For more evaluation results, please view the benchmark's README: [ALCE-ASQA](benchmarks/ALCE/ASQA/README.md) and [ALCE-ELI5](benchmarks/ALCE/ELI5/README.md).

 

 

 

 

 

 

  Dataset

  Model

  Method

  Metric

 

 

  retriever

  prompt

  MAUVE

  EM Recall

  Claim Recall

  Citation Recall

  Citation Precision

 

 

  

  ASQA

  llama2-7b-chat

  GTR

  vanilla(5-psg)

  -

  33.3

  -

  55.9

  80.0

 

 

 

  DPR

  vanilla(5-psg)

  -

  29.2

  -

  49.2

  81.0

 

 

  Oracle

  vanilla(5-psg)

  -

  41.7

  -

  58.1

  78.9

 

 

  

  ELI5

  llama2-7b-chat

  BM25

  vanilla(5-psg)

  -

  -

  11.5

  26.6

  74.5

 

 

 

  Oracle

  vanilla(5-psg)

  -

  -

  17.8

  34.0

  75.6

 

## Installation

```

git clone https://github.com/gomate-community/rageval.git

cd rageval

python setup.py install

```

## Usage

### 1. Metric

Take F1 as an example.

```

from datasets import Dataset

import rageval as rl

sample = {

    "answers": [

        "Democrat rick kriseman won the 2016 mayoral election, while re- publican former mayor rick baker did so in the 2017 mayoral election."

    ],

    "gt_answers": [

        [

            "Kriseman",

            "Rick Kriseman"

        ]

    ]

}

dataset = Dataset.from_dict(sample)

metric = rl.metrics.AnswerF1Correctness()

score, dataset = metric.compute(dataset)

```

### 2. Benchmark

Benchmarks can be run directly using scripts (Take ALCE-ELI5 as an example).

```

bash benchmarks/ALCE/ELI5/run.sh

```

## Contribution

Please make sure to read the [Contributing Guide](./CONTRIBUTING.md) before creating a pull request.

## About

This project is currently at its preliminary stage.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gomate-community/rageval

Awesome Lists containing this project

README