https://github.com/vero-labs-ai/vero-eval

Open source framework for evaluating AI Agents
https://github.com/vero-labs-ai/vero-eval

dataset-generation datasets evals evaluation evaluation-framework evaluation-metrics langgraph llm-evaluation llm-evaluation-framework python rag-evaluation rag-testing synthetic-dataset-generation testing testing-framework testing-library user-persona

Last synced: 3 months ago
JSON representation

Open source framework for evaluating AI Agents

Host: GitHub
URL: https://github.com/vero-labs-ai/vero-eval
Owner: vero-labs-ai
License: apache-2.0
Created: 2025-09-10T09:09:25.000Z (10 months ago)
Default Branch: main
Last Pushed: 2026-02-24T17:54:26.000Z (4 months ago)
Last Synced: 2026-02-24T21:55:11.575Z (4 months ago)
Topics: dataset-generation, datasets, evals, evaluation, evaluation-framework, evaluation-metrics, langgraph, llm-evaluation, llm-evaluation-framework, python, rag-evaluation, rag-testing, synthetic-dataset-generation, testing, testing-framework, testing-library, user-persona
Language: Python
Homepage: https://vero.co.in/
Size: 1.05 MB
Stars: 28
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

          # Vero

**Vero** is an open platform for evaluating and monitoring AI pipelines with real-world rigor.




It goes beyond standard benchmarking by understanding your business use-case to generate edge-case **user personas** and **stress-test** your Agent across challenging scenarios, help identify risks and build highly reliable AI systems.

> Most eval tools say "You're broken". Vero tells you where, and how to fix it

> ⭐ If you find this project helpful, consider giving it a star. It genuinely helps the project grow and encourages us to keep improving.

# Index

- [Key Features](#key-features)

- [Flowchart](#flowchart)

- [Project Structure](#project-structure)

- [Getting Started](#getting-started)

- [Metrics Overview](#metrics-overview)

- [Evaluator](#evaluator)

- [Test Dataset Generation](#test-dataset-generation)

- [Report Generation](#report-generation)

- [Containerised evaluation (Docker)](#containerised-evaluation-docker)




# Key Features

- **Trace & Log Execution**: Each query runs through the RAG pipeline is logged into an SQLite database, capturing the user query, retrieved context, reranked items, and the model’s output.

- **Component-level Metrics**: Evaluate intermediate pipeline stages using metrics like Precision, Recall, Sufficiency, Citation, Overlap, and Ranking metrics (e.g. MRR, MAP, NDCG).

- **Generation Metrics**: Measure semantic, factual, and alignment quality of generated outputs using metrics such as BERTScore, ROUGE, SEMScore, AlignScore, BLEURT, and G-Eval.

- **Modular & Extensible**: Easily plug in new metric classes or custom scoring logic; the framework is designed to grow with your needs.

- **End-to-End Evaluation**: Combine component metrics to understand the holistic performance of your RAG system — not just individual parts.

# Flowchart

![flowchart(v2).png](docs/flowchart%28v2%29.png)

# Project Structure

```

.

├── src/

|   ├── vero/

│   │   ├── evaluator         # Main package for evaluation

│   │   ├── report generation workflow  # Report generation workflow

│   │   ├── test dataset generator  # Test dataset generator

|   │   ├── metrics/          # Main package for metrics

|   └── └──  all the metrics  # All the metrics are in here

└── tests/

│   └── test_main.py/         # file for all the testing

└── docker/

│   ├── run_eval_pipeline.py  # script to run the evaluation pipeline

│   └── Dockerfile/           # dockerfile for building the image    

└── vero-deploy/

│   ├── data/                 # folder for input and output data

│   │   ├── inputs            # folder for input data

│   │   │   ├── input.csv     # example input csv file

│   │   │   └── ground_truth.csv # example ground truth csv file

│   │   └── outputs           # folder for output data

│   ├── config.yaml           # yaml file for configuration

└── └── docker-compose.yaml   # docker compose file

```

# Getting Started

### Setup

Install via pip (recommended inside a virtualenv):

```py

pip install vero-eval

```

### Example Usage

```py

from vero.rag import SimpleRAGPipeline

from vero.trace import TraceDB

from vero.eval import Evaluator

trace_db = TraceDB(db_path="runs.db")

pipeline = SimpleRAGPipeline(retriever="faiss", generator="openai", trace_db=trace_db)

# Run your pipeline

run = pipeline.run("Who invented the transistor?")

print("Answer:", run.answer)

# Later, compute metrics for all runs

evaluator = Evaluator(trace_db=trace_db)

results = evaluator.evaluate()

print(results)

```

## Metrics Overview

The RAG Evaluation Framework supports three classes of metrics:

- Generation Metrics — measure semantic/factual quality of answers.

- Ranking Metrics — measure ranking quality of rerankers.

- Retrieval Metrics — measure the quality and sufficiency of the retrieved context.

# Evaluator

## **Overview**

- The Evaluator is a convenience wrapper to run multiple metrics over model outputs and retrieval results.

- It orchestrates generation evaluation (text-generation metrics), retrieval evaluation (precision/recall/sufficiency), and reranker evaluation (NDCG, MAP, MRR). It produces CSV summaries by default.

> Quick notes

>

> - The Evaluator uses project metric classes (e.g., BartScore, BertScore, RougeScore, SemScore, PrecisionScore, RecallScore, MeanAP, MeanRR, RerankerNDCG, CumulativeNDCG, etc.). These metrics are in vero.metrics and are referenced internally.

> - Many methods expect particular CSV column names (see "Expected CSV schemas").

> - We highly recommend to install **gpu version** of torch library.

### Steps to evaluate your pipeline

**Step 1 - Generation evaluation**

- Input: a CSV with "Context Retrieved" and "Answer" columns.

- Result: Generation_Scores.csv with columns such as SemScore, BertScore, RougeLScore, BARTScore, BLUERTScore, G-Eval (Faithfulness).

Example:

```py

from vero.evaluator import Evaluator

evaluator = Evaluator()

# data_path must point to a CSV with columns "Context Retrieved" and "Answer"

df_scores = evaluator.evaluate_generation(data_path='testing.csv')

print(df_scores.head())

```

**Step 2 - Preparing reranker inputs (parse ground truth + retriever output)**

- Use parse_retriever_data to convert ground-truth chunk ids and retriever outputs into a ranked_chunks_data.csv suitable for reranker evaluation.

Example:

```py

from vero.evaluator import Evaluator

evaluator = Evaluator()

# ground_truth_path: dataset with 'Chunk IDs' and 'Less Relevant Chunk IDs' columns

# data_path: retriever output with 'Context Retrieved' containing "id='...'"

evaluator.parse_retriever_data(

    ground_truth_path='test_dataset_generator.csv',

    data_path='testing.csv'

)

# This will produce 'ranked_chunks_data.csv'

```

**Step 3 - Retrieval evaluation (precision, recall, sufficiency)**

- Inputs:

  - retriever_data_path: a CSV that contains 'Retrieved Chunk IDs' and 'True Chunk IDs' columns (lists or strings).

  - data_path: the generation CSV with 'Context Retrieved' and 'Question' (for sufficiency).

- Result: Retrieval_Scores.csv

Example:

```py

from vero.evaluator import Evaluator

evaluator = Evaluator()

df_retrieval_scores = evaluator.evaluate_retrieval(

    data_path='testing.csv',

    retriever_data_path='ranked_chunks_data.csv'

)

print(df_retrieval_scores.head())

```

**Step 4 - Reranker evaluation (MAP, MRR, NDCG)**

Example:

```py

from vero.evaluator.evaluator import Evaluator

evaluator = Evaluator()

df_reranker_scores = evaluator.evaluate_reranker(

    ground_truth_path='test_dataset_generator.csv',

    retriever_data_path='ranked_chunks_data.csv'

)

print(df_reranker_scores)

```

#### Lower-level metric usage

To run a single metric directly you can instantiate the metric class. For example, to compute BARTScore or BertScore per pair:

```py

from vero.metrics import BartScore, BertScore

with BartScore() as bs:

    bart_results = [bs.evaluate(context, answer) for context, answer in zip(contexts, answers)]

with BertScore() as bert:

    bert_results = [bert.evaluate(context, answer) for context, answer in zip(contexts, answers)]

```

# Test Dataset Generation

## **Overview**

- The Test Dataset Generation module creates high-quality question-answer pairs derived from your document collection. It generates challenging queries designed to reveal retrieval and reasoning failures in RAG systems, and considering edge-case user personas.

- Internally it chunks documents, clusters related chunks, and uses an LLM to produce QA items with ground-truth chunk IDs and metadata.

### **Example**

```py

from vero.test_dataset_generator import generate_and_save

# Generate 100 queries from PDFs stored in ./data/pdfs directory and save outputs in test_dataset directory

generate_and_save(

    data_path='./data/pdfs/',

    usecase='Vitamin chatbot catering to general users for their daily queries',

    save_path_dir='test_dataset',

    n_queries=100

)

```

# Report Generation

## Overview

- The Report Generation module consolidates evaluation outputs from generation, retrieval, and reranking into a final report.

- It orchestrates a stateful workflow that processes CSV results from various evaluators and synthesizes comprehensive insights and recommendations.

## Example Usage

```py

from vero.report_generation_workflow import ReportGenerator

# Initialize the report generator

report_generator = ReportGenerator()

# Generate the final report by providing:

# - Pipeline configuration JSON file

# - Generation, Retrieval, and Reranker evaluation CSV files

report_generator.generate_report(

    'pipe_config_data.json',

    'Generation_Scores.csv',

    'Retrieval_Scores.csv',

    'Reranked_Scores.csv'

)

```

# Containerised evaluation (Docker)

## Prequisites

- Install [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/).

- Disk spcae >= 50 GB

- CPU >= 4 cores (GPU recommended for faster metric computation)

- RAM >= 8 GB

**Step 1 - Clone the repository and switch to the deploy folder**

```bash

git clone https://github.com/vero-eval/vero-eval.git

cd vero-deploy

```

**Step 2 - Inspect example inputs**

- Open the `data/inputs` folder inside `vero-deploy`.

- Review the example CSV provided (for example `data/inputs/input.csv`) and structure your input CSV the same way. Common column names used across the project include `Context Retrieved`, `Answer`, `Question`, `Retrieved Chunk IDs`, `True Chunk IDs`, `Chunk IDs`, and `Less Relevant Chunk IDs`.

**Step-3 Configure the input filename**

- Open `config.yaml` in the `vero-deploy` directory and check the key that references the input CSV filename. Update it to point to your CSV if needed (ensure the filename you pick matches the file in `data/inputs`).

**Step-4 Set your OpenAI key in Docker Compose**

- Open `docker-compose.yaml` and set the `OPEN_AI_KEY` environment variable for the evaluation service. Example snippet:

```yaml

x-base-config: &base-config

  image: crimsonceres/demo-test

  environment:

    - HF_HOME=/root/.cache/huggingface

    - TRANSFORMERS_CACHE=/root/.cache/huggingface/transformers

    - OPENAI_API_KEY=

```

You can also use environment substitution (for example `OPEN_AI_KEY=${OPEN_AI_KEY}`) and export the variable in your shell before running Docker.

**Step-5 Run the evaluation container**

- If you have a GPU available (recommended):

```bash

docker compose run evaluation-runner-gpu

```

- If you are using CPU only:

```bash

docker compose run evaluation-runner-cpu

```

>That is all required to run a containerised evaluation. Ensure `data/inputs` contains the CSV referenced in `config.yaml` and that `docker-compose.yaml` has `OPEN_AI_KEY` set before running the appropriate `docker compose run` command.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vero-labs-ai/vero-eval

Awesome Lists containing this project

README