Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/relari-ai/continuous-eval
Open-Source Evaluation for GenAI Application Pipelines
https://github.com/relari-ai/continuous-eval
evaluation-framework evaluation-metrics information-retrieval llm-evaluation llmops rag retrieval-augmented-generation
Last synced: 3 months ago
JSON representation
Open-Source Evaluation for GenAI Application Pipelines
- Host: GitHub
- URL: https://github.com/relari-ai/continuous-eval
- Owner: relari-ai
- License: apache-2.0
- Created: 2023-12-08T21:30:39.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-22T15:57:38.000Z (8 months ago)
- Last Synced: 2024-05-22T19:16:54.072Z (8 months ago)
- Topics: evaluation-framework, evaluation-metrics, information-retrieval, llm-evaluation, llmops, rag, retrieval-augmented-generation
- Language: Python
- Homepage: https://docs.relari.ai/
- Size: 2.44 MB
- Stars: 330
- Watchers: 4
- Forks: 16
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- StarryDivineSky - relari-ai/continuous-eval - ai/continuous-eval 是一个为大型语言模型(LLM)驱动的应用提供数据驱动评估的开源项目。它旨在通过持续监控和评估来提升LLM应用的性能和可靠性。该项目核心在于使用真实用户数据来创建评估数据集,并利用这些数据自动评估LLM的输出质量。它支持多种评估指标,可以根据不同的应用场景进行定制。该项目的工作原理是收集用户交互数据,将其转化为评估数据,然后运行评估并提供反馈。它提供了一个灵活的框架,可以集成到现有的LLM应用开发流程中。continuous-eval的目标是帮助开发者更好地理解LLM应用的表现,并根据评估结果进行改进。该项目还提供了示例和文档,方便用户快速上手。总而言之,它是一个用于持续评估和改进LLM应用性能的强大工具。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- awesome-production-machine-learning - continuous-eval - ai/continuous-eval.svg?style=social) - continuous-eval is a framework for data-driven evaluation of LLM-powered applications. (Evaluation and Monitoring)
README
![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/pyversions/continuous-eval.svg)
![https://GitHub.com/relari-ai/continuous-eval/releases](https://img.shields.io/github/release/relari-ai/continuous-eval)
![https://github.com/Naereen/badges/](https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github)
![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/l/continuous-eval.svg)
Data-Driven Evaluation for LLM-Powered Applications
## Overview
`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.
## How is continuous-eval different?
- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.
- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.
- **Leverage User Feedback in Evaluation**: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.
- **Synthetic Dataset Generation**: Generate large-scale synthetic dataset to test your pipeline.
## Getting Started
This code is provided as a PyPi package. To install it, run the following command:
```bash
python3 -m pip install continuous-eval
```if you want to install from source:
```bash
git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval
poetry install --all-extras
```To run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.
## Run a single metric
Here's how you run a single metric on a datum.
Check all available metrics here: [link](https://continuous-eval.docs.relari.ai/)```python
from continuous_eval.metrics.retrieval import PrecisionRecallF1datum = {
"question": "What is the capital of France?",
"retrieved_context": [
"Paris is the capital of France and its largest city.",
"Lyon is a major city in France.",
],
"ground_truth_context": ["Paris is the capital of France."],
"answer": "Paris",
"ground_truths": ["Paris"],
}metric = PrecisionRecallF1()
print(metric(**datum))
```### Available Metrics
Module
Category
Metrics
Retrieval
Deterministic
PrecisionRecallF1, RankedRetrievalMetrics, TokenCount
LLM-based
LLMBasedContextPrecision, LLMBasedContextCoverage
Text Generation
Deterministic
DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability
Semantic
DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity
LLM-based
LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency
Classification
Deterministic
ClassificationAccuracy
Code Generation
Deterministic
CodeStringMatch, PythonASTSimilarity, SQLSyntaxMatch, SQLASTSimilarity
LLM-based
LLMBasedCodeGeneration
Agent Tools
Deterministic
ToolSelectionAccuracy
Custom
Define your own metrics
To define your own metrics, you only need to extend the [Metric](continuous_eval/metrics/base.py#L23C7-L23C13) class implementing the `__call__` method.
Optional methods are `batch` (if it is possible to implement optimizations for batch processing) and `aggregate` (to aggregate metrics results over multiple samples_).## Run evaluation on a pipeline
Define modules in your pipeline and select corresponding metrics.
```python
from continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset, EvaluationRunner
from continuous_eval.eval.logger import PipelineLogger
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness
from typing import List, Dictdataset = Dataset("dataset_folder")
# Simple 3-step RAG pipeline with Retriever->Reranker->Generation
retriever = Module(
name="Retriever",
input=dataset.question,
output=List[str],
eval=[
PrecisionRecallF1().use(
retrieved_context=ModuleOutput(),
ground_truth_context=dataset.ground_truth_context,
),
],
)reranker = Module(
name="reranker",
input=retriever,
output=List[Dict[str, str]],
eval=[
RankedRetrievalMetrics().use(
retrieved_context=ModuleOutput(),
ground_truth_context=dataset.ground_truth_context,
),
],
)llm = Module(
name="answer_generator",
input=reranker,
output=str,
eval=[
FleschKincaidReadability().use(answer=ModuleOutput()),
DeterministicAnswerCorrectness().use(
answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
),
],
)pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
print(pipeline.graph_repr()) # optional: visualize the pipeline
```Now you can run the evaluation on your pipeline
```python
pipelog = PipelineLogger(pipeline=pipeline)# now run your LLM application pipeline, and for each module, log the results:
pipelog.log(uid=sample_uid, module="module_name", value=data)# Once you finish logging the data, you can use the EvaluationRunner to evaluate the logs
evalrunner = EvaluationRunner(pipeline)
metrics = evalrunner.evaluate(pipelog)
metrics.results() # returns a dictionary with the results
```To run evaluation over an existing dataset (BYODataset), you can run the following:
```python
dataset = Dataset(...)
evalrunner = EvaluationRunner(pipeline)
metrics = evalrunner.evaluate(dataset)
```## Synthetic Data Generation
Ground truth data, or reference data, is important for evaluation as it can offer a comprehensive and consistent measurement of system performance. However, it is often costly and time-consuming to manually curate such a golden dataset.
We have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes.To generate custom synthetic data, please visit [Relari](https://www.relari.ai/) to create a free account and you can then generate custom synthetic golden datasets through the Relari Cloud.
## 💡 Contributing
Interested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.
## Resources
- **Docs:** [link](https://continuous-eval.docs.relari.ai/)
- **Examples Repo**: [end-to-end example repo](https://github.com/relari-ai/examples)
- **Blog Posts:**
- Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
- How important is a Golden Dataset for LLM evaluation?
[(link)](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)
- How to evaluate complex GenAI Apps: a granular approach [(link)](https://medium.com/relari/how-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)
- How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https://medium.com/towards-data-science/how-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)
- Generate Synthetic Data to Test LLM Applications [(link)](https://medium.com/relari/generate-synthetic-data-to-test-llm-applications-4bffeb51b80e)
- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)
- **Reach out to founders:** [Email](mailto:[email protected]) or [Schedule a chat](https://cal.com/relari/intro)## License
This project is licensed under the Apache 2.0 - see the [LICENSE](LICENSE) file for details.
## Open Analytics
We monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement.
You can take a look at exactly what we track in the [telemetry code](continuous_eval/utils/telemetry.py)To disable usage-tracking you set the `CONTINUOUS_EVAL_DO_NOT_TRACK` flag to `true`.