https://github.com/chandralegend/saf-eval
SAF-Eval (Search-Augmented Factuality Evaluator) is a modular Python package for evaluating the factuality of AI-generated responses.
https://github.com/chandralegend/saf-eval
fact-check llm-factuality long-form-factuality
Last synced: about 2 months ago
JSON representation
SAF-Eval (Search-Augmented Factuality Evaluator) is a modular Python package for evaluating the factuality of AI-generated responses.
- Host: GitHub
- URL: https://github.com/chandralegend/saf-eval
- Owner: chandralegend
- License: mit
- Created: 2025-03-26T19:43:34.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-03-26T21:52:18.000Z (2 months ago)
- Last Synced: 2025-03-26T22:29:32.622Z (2 months ago)
- Topics: fact-check, llm-factuality, long-form-factuality
- Language: Python
- Homepage:
- Size: 92.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SAF-Eval
[](https://github.com/chandralegend/saf-eval/actions/workflows/test.yml)
[](https://github.com/chandralegend/saf-eval/actions/workflows/lint.yml)SAF-Eval (Search-Augmented Factuality Evaluator) is a modular Python package for evaluating the factuality of AI-generated responses. Based on academic research, it implements a systematic approach to measuring factual accuracy by breaking down responses into atomic facts and evaluating them against retrieved evidence.
## Features
- **Modular Pipeline**: Extract atomic facts, check relevance, retrieve supporting documents, evaluate factuality
- **Self-Containment Processing**: Automatically detect and fix non-self-contained facts by adding context
- **Few-Shot Learning**: Improve fact extraction using domain-specific examples
- **Fact Deduplication**: Identify and remove similar or duplicate facts to avoid redundant evaluations
- **Comprehensive Logging**: Detailed logging of the entire evaluation process for analysis and debugging
- **Customizable Evaluation**: Define your own categories and scoring rubrics
- **Provider-Agnostic**: Use any LLM provider through a consistent interface
- **Flexible Retrieval**: Integrate with any document retrieval system
- **Comprehensive Metrics**: Get detailed factuality scores and evaluations## Installation
SAF-Eval requires Python 3.12 or later.
### Using Poetry (recommended)
```bash
# Clone the repository
git clone https://github.com/chandralegend/saf-eval.git
cd saf-eval# Install dependencies with Poetry
poetry install
```### Using pip
```bash
pip install saf-eval
```## Quick Start
```python
import asyncio
import os
from dotenv import load_dotenvfrom saf_eval.config import Config, LoggingConfig
from saf_eval.core.pipeline import EvaluationPipeline
from saf_eval.extraction.extractor import FactExtractor
from saf_eval.containment.checker import ContainmentChecker
from saf_eval.llm.providers.openai import OpenAILLM
from saf_eval.retrieval.providers.simple import SimpleRetriever
from saf_eval.evaluation.classifier import FactClassifier
from saf_eval.evaluation.scoring import FactualityScorer# Load environment variables (for API keys)
load_dotenv()async def evaluate_response():
# Initialize components with shared config
config = Config(
scoring_rubric={
"supported": 1.0,
"contradicted": 0.0,
"unverifiable": 0.5
},
retrieval_config={"top_k": 3},
logging=LoggingConfig(level="INFO", console=True, file=True, log_dir="./logs")
)
llm = OpenAILLM(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))
# Create a knowledge base
knowledge_base = {
"Mount Everest": "Mount Everest is the highest mountain above sea level at 29,032 feet (8,849 meters)."
}
# Setup the pipeline with components that share the same config
pipeline = EvaluationPipeline(
config=config,
extractor=FactExtractor(config=config, llm=llm),
retriever=SimpleRetriever(config=config, knowledge_base=knowledge_base),
classifier=FactClassifier(config=config, llm=llm),
scorer=FactualityScorer(config=config),
containment_checker=ContainmentChecker(config=config, llm=llm) # Add containment checker
)
# Evaluate a response
response = "Mount Everest, at 29,032 feet, is the tallest mountain on Earth."
context = "Information about geographical features"
result = await pipeline.run(response, context)
print(f"Factuality Score: {result.factuality_score:.2f}")if __name__ == "__main__":
asyncio.run(evaluate_response())
```## Project Structure
SAF-Eval follows a modular architecture with the following key components:
- **Core**: Pipeline coordination and data models
- **Extraction**: Breaking down responses into atomic facts
- **Containment**: Checking and fixing non-self-contained facts
- **Relevancy**: Assessing relevance of facts to the context
- **Retrieval**: Finding supporting documents for verification
- **Evaluation**: Classifying facts and calculating factuality scores
- **LLM**: Abstraction layer for language model providers
- **Utils**: Utility functions including deduplication and logging## Advanced Usage
### Self-Containment Processing
The `ContainmentChecker` helps identify and fix facts that require additional context to be understood:
```python
from saf_eval.containment.checker import ContainmentChecker
from saf_eval.llm.providers.openai import OpenAILLM# Initialize the components
llm = OpenAILLM(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))
containment_checker = ContainmentChecker(config=config, llm=llm)# Check if facts are self-contained
checked_facts = await containment_checker.check_containment(facts, response)# Fix non-self-contained facts by adding context
self_contained_facts = await containment_checker.self_contain_facts(
checked_facts,
response,
context="Optional additional context to help with self-containment"
)# Example: Converting "He wrote it in 1851" to "Herman Melville wrote Moby Dick in 1851"
```### Using Context in Fact Extraction
The fact extractor can now use context to improve extraction quality:
```python
extractor = FactExtractor(config=config, llm=llm)
facts = await extractor.extract_facts(
response="Melville's masterpiece is considered one of the Great American Novels.",
context="Discussion about the novel Moby Dick by Herman Melville"
)
```### Using Examples for Fact Extraction
You can provide examples to improve fact extraction through few-shot learning:
```python
from typing import List, Tuple, Optional# Define an example provider function
def my_example_provider(response: str, context: Optional[str] = None, **kwargs) -> List[Tuple[str, List[str]]]:
"""Provide domain-specific examples for fact extraction."""
# Return a list of (example_text, [fact1, fact2, ...]) tuples
return [
(
"The Eiffel Tower was completed in 1889 and stands at 330 meters tall.",
["The Eiffel Tower was completed in 1889.", "The Eiffel Tower is 330 meters tall."]
),
# More examples...
]# Use the example provider in the extractor
extractor = FactExtractor(
config=config,
llm=llm,
example_provider=my_example_provider
)# Extract facts with examples to guide the LLM
facts = await extractor.extract_facts(
response="The Golden Gate Bridge was completed in 1937.",
context="Information about famous structures"
)
```Example providers can dramatically improve extraction quality by demonstrating the desired level of granularity and format.
### Custom Retrieval System
You can implement your own retrieval system:
```python
from typing import List
from saf_eval.core.models import AtomicFact, RetrievedDocument
from saf_eval.retrieval.base import RetrieverBaseclass MyCustomRetriever(RetrieverBase):
async def retrieve(self, fact: AtomicFact, **kwargs) -> List[RetrievedDocument]:
# Implement your retrieval logic here
# ...
return documents
```### Custom Evaluation Categories
Customize how facts are classified:
```python
from saf_eval.evaluation.classifier import FactClassifierclassifier = FactClassifier(
llm=my_llm,
categories=["accurate", "partially_accurate", "inaccurate", "uncertain"]
)
```### Fact Deduplication
The pipeline automatically deduplicates similar facts to avoid redundant evaluations:
```python
from saf_eval.utils.deduplication import deduplicate_facts
from typing import List
from saf_eval.core.models import AtomicFact# Default deduplication is already included in the pipeline, but can be customized:
def my_custom_deduplication(facts: List List