https://github.com/chandralegend/saf-eval

SAF-Eval (Search-Augmented Factuality Evaluator) is a modular Python package for evaluating the factuality of AI-generated responses.
https://github.com/chandralegend/saf-eval

fact-check llm-factuality long-form-factuality

Last synced: about 2 months ago
JSON representation

SAF-Eval (Search-Augmented Factuality Evaluator) is a modular Python package for evaluating the factuality of AI-generated responses.

Host: GitHub
URL: https://github.com/chandralegend/saf-eval
Owner: chandralegend
License: mit
Created: 2025-03-26T19:43:34.000Z (2 months ago)
Default Branch: main
Last Pushed: 2025-03-26T21:52:18.000Z (2 months ago)
Last Synced: 2025-03-26T22:29:32.622Z (2 months ago)
Topics: fact-check, llm-factuality, long-form-factuality
Language: Python
Homepage:
Size: 92.8 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # SAF-Eval

[![Run Tests](https://github.com/chandralegend/saf-eval/actions/workflows/test.yml/badge.svg)](https://github.com/chandralegend/saf-eval/actions/workflows/test.yml)

[![Lint](https://github.com/chandralegend/saf-eval/actions/workflows/lint.yml/badge.svg)](https://github.com/chandralegend/saf-eval/actions/workflows/lint.yml)

SAF-Eval (Search-Augmented Factuality Evaluator) is a modular Python package for evaluating the factuality of AI-generated responses. Based on academic research, it implements a systematic approach to measuring factual accuracy by breaking down responses into atomic facts and evaluating them against retrieved evidence.

## Features

- **Modular Pipeline**: Extract atomic facts, check relevance, retrieve supporting documents, evaluate factuality

- **Self-Containment Processing**: Automatically detect and fix non-self-contained facts by adding context

- **Few-Shot Learning**: Improve fact extraction using domain-specific examples

- **Fact Deduplication**: Identify and remove similar or duplicate facts to avoid redundant evaluations

- **Comprehensive Logging**: Detailed logging of the entire evaluation process for analysis and debugging

- **Customizable Evaluation**: Define your own categories and scoring rubrics

- **Provider-Agnostic**: Use any LLM provider through a consistent interface

- **Flexible Retrieval**: Integrate with any document retrieval system

- **Comprehensive Metrics**: Get detailed factuality scores and evaluations

## Installation

SAF-Eval requires Python 3.12 or later.

### Using Poetry (recommended)

```bash

# Clone the repository

git clone https://github.com/chandralegend/saf-eval.git

cd saf-eval

# Install dependencies with Poetry

poetry install

```

### Using pip

```bash

pip install saf-eval

```

## Quick Start

```python

import asyncio

import os

from dotenv import load_dotenv

from saf_eval.config import Config, LoggingConfig

from saf_eval.core.pipeline import EvaluationPipeline

from saf_eval.extraction.extractor import FactExtractor

from saf_eval.containment.checker import ContainmentChecker

from saf_eval.llm.providers.openai import OpenAILLM

from saf_eval.retrieval.providers.simple import SimpleRetriever

from saf_eval.evaluation.classifier import FactClassifier

from saf_eval.evaluation.scoring import FactualityScorer

# Load environment variables (for API keys)

load_dotenv()

async def evaluate_response():

    # Initialize components with shared config

    config = Config(

        scoring_rubric={

            "supported": 1.0,

            "contradicted": 0.0, 

            "unverifiable": 0.5

        },

        retrieval_config={"top_k": 3},

        logging=LoggingConfig(level="INFO", console=True, file=True, log_dir="./logs")

    )

    

    llm = OpenAILLM(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))

    

    # Create a knowledge base

    knowledge_base = {

        "Mount Everest": "Mount Everest is the highest mountain above sea level at 29,032 feet (8,849 meters)."

    }

    

    # Setup the pipeline with components that share the same config

    pipeline = EvaluationPipeline(

        config=config,

        extractor=FactExtractor(config=config, llm=llm),

        retriever=SimpleRetriever(config=config, knowledge_base=knowledge_base),

        classifier=FactClassifier(config=config, llm=llm),

        scorer=FactualityScorer(config=config),

        containment_checker=ContainmentChecker(config=config, llm=llm)  # Add containment checker

    )

    

    # Evaluate a response

    response = "Mount Everest, at 29,032 feet, is the tallest mountain on Earth."

    context = "Information about geographical features"

    

    result = await pipeline.run(response, context)

    print(f"Factuality Score: {result.factuality_score:.2f}")

if __name__ == "__main__":

    asyncio.run(evaluate_response())

```

## Project Structure

SAF-Eval follows a modular architecture with the following key components:

- **Core**: Pipeline coordination and data models

- **Extraction**: Breaking down responses into atomic facts

- **Containment**: Checking and fixing non-self-contained facts

- **Relevancy**: Assessing relevance of facts to the context

- **Retrieval**: Finding supporting documents for verification

- **Evaluation**: Classifying facts and calculating factuality scores

- **LLM**: Abstraction layer for language model providers

- **Utils**: Utility functions including deduplication and logging

## Advanced Usage

### Self-Containment Processing

The `ContainmentChecker` helps identify and fix facts that require additional context to be understood:

```python

from saf_eval.containment.checker import ContainmentChecker

from saf_eval.llm.providers.openai import OpenAILLM

# Initialize the components

llm = OpenAILLM(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))

containment_checker = ContainmentChecker(config=config, llm=llm)

# Check if facts are self-contained

checked_facts = await containment_checker.check_containment(facts, response)

# Fix non-self-contained facts by adding context

self_contained_facts = await containment_checker.self_contain_facts(

    checked_facts, 

    response, 

    context="Optional additional context to help with self-containment"

)

# Example: Converting "He wrote it in 1851" to "Herman Melville wrote Moby Dick in 1851"

```

### Using Context in Fact Extraction

The fact extractor can now use context to improve extraction quality:

```python

extractor = FactExtractor(config=config, llm=llm)

facts = await extractor.extract_facts(

    response="Melville's masterpiece is considered one of the Great American Novels.",

    context="Discussion about the novel Moby Dick by Herman Melville"

)

```

### Using Examples for Fact Extraction

You can provide examples to improve fact extraction through few-shot learning:

```python

from typing import List, Tuple, Optional

# Define an example provider function

def my_example_provider(response: str, context: Optional[str] = None, **kwargs) -> List[Tuple[str, List[str]]]:

    """Provide domain-specific examples for fact extraction."""

    # Return a list of (example_text, [fact1, fact2, ...]) tuples

    return [

        (

            "The Eiffel Tower was completed in 1889 and stands at 330 meters tall.",

            ["The Eiffel Tower was completed in 1889.", "The Eiffel Tower is 330 meters tall."]

        ),

        # More examples...

    ]

# Use the example provider in the extractor

extractor = FactExtractor(

    config=config,

    llm=llm,

    example_provider=my_example_provider

)

# Extract facts with examples to guide the LLM

facts = await extractor.extract_facts(

    response="The Golden Gate Bridge was completed in 1937.",

    context="Information about famous structures"

)

```

Example providers can dramatically improve extraction quality by demonstrating the desired level of granularity and format.

### Custom Retrieval System

You can implement your own retrieval system:

```python

from typing import List

from saf_eval.core.models import AtomicFact, RetrievedDocument

from saf_eval.retrieval.base import RetrieverBase

class MyCustomRetriever(RetrieverBase):

    async def retrieve(self, fact: AtomicFact, **kwargs) -> List[RetrievedDocument]:

        # Implement your retrieval logic here

        # ...

        return documents

```

### Custom Evaluation Categories

Customize how facts are classified:

```python

from saf_eval.evaluation.classifier import FactClassifier

classifier = FactClassifier(

    llm=my_llm,

    categories=["accurate", "partially_accurate", "inaccurate", "uncertain"]

)

```

### Fact Deduplication

The pipeline automatically deduplicates similar facts to avoid redundant evaluations:

```python

from saf_eval.utils.deduplication import deduplicate_facts

from typing import List

from saf_eval.core.models import AtomicFact

# Default deduplication is already included in the pipeline, but can be customized:

def my_custom_deduplication(facts: List List

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chandralegend/saf-eval

Awesome Lists containing this project

README