Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/confident-ai/deepeval
The LLM Evaluation Framework
https://github.com/confident-ai/deepeval
evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics
Last synced: 5 days ago
JSON representation
The LLM Evaluation Framework
- Host: GitHub
- URL: https://github.com/confident-ai/deepeval
- Owner: confident-ai
- License: apache-2.0
- Created: 2023-08-10T05:35:04.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-09T10:56:30.000Z (2 months ago)
- Last Synced: 2024-11-09T23:45:13.503Z (2 months ago)
- Topics: evaluation-framework, evaluation-metrics, llm-evaluation, llm-evaluation-framework, llm-evaluation-metrics
- Language: Python
- Homepage: https://docs.confident-ai.com/
- Size: 58 MB
- Stars: 3,548
- Watchers: 20
- Forks: 280
- Open Issues: 105
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
Awesome Lists containing this project
- awesome - confident-ai/deepeval - The LLM Evaluation Framework (Python)
- awesome-ChatGPT-repositories - deepeval - The Evaluation Framework for LLMs (NLP)
- awesome-generative-ai - confident-ai/deepeval
- awesome-chatgpt - confident-ai/deepeval - DeepEval is a Python library that provides an evaluation framework for LLM applications, allowing for unit testing and performance evaluation based on various metrics. (SDK, Libraries, Frameworks / Python)
- awesome-production-machine-learning - DeepEval - ai/deepeval.svg?style=social) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications. (Evaluation and Monitoring)
- StarryDivineSky - confident-ai/deepeval - Eval、幻觉、答案相关性、RAGAS等指标来评估LLM输出,并使用在本地机器上运行的LLM和其他NLP模型进行评估。DeepEval支持各种应用,包括RAG、微调、LangChain和LlamaIndex。它可以帮助您轻松确定最佳超参数,以改进RAG管道,防止提示漂移,甚至从OpenAI过渡到自信地托管自己的Llama2。DeepEval提供各种现成的LLM评估指标,包括G-Eval、摘要、答案相关性、忠实度、上下文召回、上下文精度、RAGAS、幻觉等,并支持自定义指标。它可以并行评估整个数据集,并与任何CI/CD环境无缝集成。DeepEval还提供用于在流行的LLM基准上对任何LLM进行基准测试的功能,包括MMLU、HellaSwag、DROP、BIG-Bench Hard、TruthfulQA、HumanEval等。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- Awesome-LLM-RAG-Application - deepeval
README
The LLM Evaluation Framework
Documentation |
Metrics and Features |
Getting Started |
Integrations |
Confident AI
**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs **locally on your machine** for evaluation.
Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.
> Want to talk LLM evaluation? [Come join our discord.](https://discord.com/invite/a3K9c8GRGt)
# 🔥 Metrics and Features
> ‼️ You can now run DeepEval's metrics on the cloud for free directly on [Confident AI](https://confident-ai.com?utm_source=GitHub)'s infrastructure 🥳
- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that runs **locally on your machine**:
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- Hallucination
- etc.
- [Red team your LLM application](https://docs.confident-ai.com/docs/red-teaming-introduction) for 40+ safety vulnerabilities in a few lines of code, including:
- Toxicity
- Bias
- SQL Injection
- etc., using advanced 10+ attack enhancement strategies such as prompt injections.
- Evaluate your entire dataset in bulk in under 20 lines of Python code **in parallel**. Do this via the CLI in a Pytest-like manner, or through our `evaluate()` function.
- Create your own custom metrics that are automatically integrated with DeepEval's ecosystem by inheriting DeepEval's base metric class.
- Integrates seamlessly with **ANY** CI/CD environment.
- Easily benchmark **ANY** LLM on popular LLM benchmarks in [under 10 lines of code.](https://docs.confident-ai.com/docs/benchmarks-introduction?utm_source=GitHub), which includes:
- MMLU
- HellaSwag
- DROP
- BIG-Bench Hard
- TruthfulQA
- HumanEval
- GSM8K
- [Automatically integrated with Confident AI](https://app.confident-ai.com?utm_source=GitHub) for continous evaluation throughout the lifetime of your LLM (app):
- log evaluation results and analyze metrics pass / fails
- compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
- debug evaluation results via LLM traces
- manage evaluation test cases / datasets in one place
- track events to identify live LLM responses in production
- real-time evaluation in production
- add production events to existing evaluation datasets to strength evals over time(Note that while some metrics are for RAG, others are better for a fine-tuning use case. Make sure to consult our docs to pick the right metric.)
# 🔌 Integrations
- 🦄 LlamaIndex, to [**unit test RAG applications in CI/CD**](https://docs.confident-ai.com/docs/integrations-llamaindex?utm_source=GitHub)
- 🤗 Hugging Face, to [**enable real-time evaluations during LLM fine-tuning**](https://docs.confident-ai.com/docs/integrations-huggingface?utm_source=GitHub)
# 🚀 QuickStart
Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.
## Installation
```
pip install -U deepeval
```## Create an account (highly recommended)
Although optional, creating an account on our platform will allow you to log test results, enabling easy tracking of changes and performances over iterations. This step is optional, and you can run test cases even without logging in, but we highly recommend giving it a try.
To login, run:
```
deepeval login
```Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy [here](https://docs.confident-ai.com/docs/data-privacy?utm_source=GitHub)).
## Writing your first test case
Create a test file:
```bash
touch test_chatbot.py
```Open `test_chatbot.py` and write your first test case using DeepEval:
```python
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCasedef test_case():
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output from your LLM application
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
assert_test(test_case, [answer_relevancy_metric])
```
Set your `OPENAI_API_KEY` as an environment variable (you can also evaluate using your own custom model, for more details visit [this part of our docs](https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm?utm_source=GitHub)):```
export OPENAI_API_KEY="..."
```And finally, run `test_chatbot.py` in the CLI:
```
deepeval test run test_chatbot.py
```**Your test should have passed ✅** Let's breakdown what happened.
- The variable `input` mimics user input, and `actual_output` is a placeholder for your chatbot's intended output based on this query.
- The variable `retrieval_context` contains the relevant information from your knowledge base, and `AnswerRelevancyMetric(threshold=0.5)` is an out-of-the-box metric provided by DeepEval. It helps evaluate the relevancy of your LLM output based on the provided context.
- The metric score ranges from 0 - 1. The `threshold=0.5` threshold ultimately determines whether your test has passed or not.[Read our documentation](https://docs.confident-ai.com/docs/getting-started?utm_source=GitHub) for more information on how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.
## Evaluating Without Pytest Integration
Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.
```python
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCaseanswer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output from your LLM application
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
evaluate([test_case], [answer_relevancy_metric])
```## Using Standalone Metrics
DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:
```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCaseanswer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the actual output from your LLM application
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
# Most metrics also offer an explanation
print(answer_relevancy_metric.reason)
```Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.
## Evaluating a Dataset / Test Cases in Bulk
In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:
```python
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDatasetfirst_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
``````bash
# Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_.py -n 4
```
Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using our Pytest integration:
```python
from deepeval import evaluate
...evaluate(dataset, [answer_relevancy_metric])
# or
dataset.evaluate([answer_relevancy_metric])
```# Real-time Evaluations on Confident AI
We offer a [web platform](https://app.confident-ai.com?utm_source=Github) for you to:
1. Log and view all the test results / metrics data from DeepEval's test runs.
2. Debug evaluation results via LLM traces.
3. Compare and pick the optimal hyperparameteres (prompt templates, models, chunk size, etc.).
4. Create, manage, and centralize your evaluation datasets.
5. Track events in production and augment your evaluation dataset for continous evaluation.
6. Track events in production, view evaluation results and historical insights.Everything on Confident AI, including how to use Confident is available [here](https://docs.confident-ai.com/docs/confident-ai-introduction?utm_source=GitHub).
To begin, login from the CLI:
```bash
deepeval login
```Follow the instructions to log in, create your account, and paste your API key into the CLI.
Now, run your test file again:
```bash
deepeval test run test_chatbot.py
```You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!
![ok](https://d2lsxfc3p6r9rv.cloudfront.net/confident-test-cases.png)
# Contributing
Please read [CONTRIBUTING.md](https://github.com/confident-ai/deepeval/blob/main/CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
# Roadmap
Features:
- [x] Implement G-Eval
- [x] Referenceless Evaluation
- [x] Production Evaluation & Logging
- [x] Evaluation Dataset CreationIntegrations:
- [x] lLamaIndex
- [ ] langChain
- [ ] Guidance
- [ ] Guardrails
- [ ] EmbedChain
# Authors
Built by the founders of Confident AI. Contact [email protected] for all enquiries.
# License
DeepEval is licensed under Apache 2.0 - see the [LICENSE.md](https://github.com/confident-ai/deepeval/blob/main/LICENSE.md) file for details.