https://github.com/confident-ai/deepeval

The LLM Evaluation Framework
https://github.com/confident-ai/deepeval

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Last synced: about 1 month ago
JSON representation

The LLM Evaluation Framework

Host: GitHub
URL: https://github.com/confident-ai/deepeval
Owner: confident-ai
License: apache-2.0
Created: 2023-08-10T05:35:04.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-05-12T23:50:27.000Z (about 1 month ago)
Last Synced: 2025-05-13T00:31:14.872Z (about 1 month ago)
Topics: evaluation-framework, evaluation-metrics, llm-evaluation, llm-evaluation-framework, llm-evaluation-metrics
Language: Python
Homepage: https://deepeval.com
Size: 82.9 MB
Stars: 6,273
Watchers: 29
Forks: 547
Open Issues: 172
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Citation: CITATION.cff

Awesome Lists containing this project

awesome - confident-ai/deepeval - The LLM Evaluation Framework (Python)
awesome-LLM-resources - DeepEval - to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. (评估 Evaluation)
StarryDivineSky - confident-ai/deepeval - Eval、幻觉、答案相关性、RAGAS等指标来评估LLM输出，并使用在本地机器上运行的LLM和其他NLP模型进行评估。DeepEval支持各种应用，包括RAG、微调、LangChain和LlamaIndex。它可以帮助您轻松确定最佳超参数，以改进RAG管道，防止提示漂移，甚至从OpenAI过渡到自信地托管自己的Llama2。DeepEval提供各种现成的LLM评估指标，包括G-Eval、摘要、答案相关性、忠实度、上下文召回、上下文精度、RAGAS、幻觉等，并支持自定义指标。它可以并行评估整个数据集，并与任何CI/CD环境无缝集成。DeepEval还提供用于在流行的LLM基准上对任何LLM进行基准测试的功能，包括MMLU、HellaSwag、DROP、BIG-Bench Hard、TruthfulQA、HumanEval等。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-ChatGPT-repositories - deepeval - The Evaluation Framework for LLMs (NLP)
awesome-generative-ai - confident-ai/deepeval
awesome-chatgpt - confident-ai/deepeval - DeepEval is a Python library that provides an evaluation framework for LLM applications, allowing for unit testing and performance evaluation based on various metrics. (SDK, Libraries, Frameworks / Python)
awesome-production-machine-learning - DeepEval - ai/deepeval.svg?style=social) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications. (Evaluation and Monitoring)
Awesome-LLM-RAG-Application - deepeval
awesome-safety-critical-ai - `confident-ai/deepeval` - to-use, open-source LLM evaluation framework, for evaluating and testing LLM systems (<a id="tools"></a>🛠️ Tools / Model Evaluation)
awesome-safety-critical-ai - `confident-ai/deepeval` - to-use, open-source LLM evaluation framework, for evaluating and testing LLM systems (<a id="tools"></a>🛠️ Tools / Model Evaluation)
awesome-hacking-lists - confident-ai/deepeval - The LLM Evaluation Framework (Python)
AiTreasureBox - confident-ai/deepeval - 06-19_8344_17](https://img.shields.io/github/stars/confident-ai/deepeval.svg)|The LLM Evaluation Framework| (Repos)

README

        


    





    
The LLM Evaluation Framework




    

        

    





    


        Documentation |

        Metrics and Features |

        Getting Started |

        Integrations |

        DeepEval Platform

    




    

        

    

    

        

    

    

        

    



**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs **locally on your machine** for evaluation.

Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemented via LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.

> [!IMPORTANT]

> Need a place for your DeepEval testing data to live 🏡❤️? [Sign up to the DeepEval platform](https://confident-ai.com?utm_source=GitHub) to compare iterations of your LLM app, generate & share testing reports, and more.

>

> ![Demo GIF](assets/demo.gif)

> Want to talk LLM evaluation, need help picking metrics, or just to say hi? [Come join our discord.](https://discord.com/invite/a3K9c8GRGt)




# 🔥 Metrics and Features

> 🥳 You can now share DeepEval's test results on the cloud directly on [Confident AI](https://confident-ai.com?utm_source=GitHub)'s infrastructure

- Supports both end-to-end and component-level LLM evaluation.

- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that runs **locally on your machine**:

  - G-Eval

  - DAG ([deep acyclic graph](https://deepeval.com/docs/metrics-dag))

  - **RAG metrics:**

    - Answer Relevancy

    - Faithfulness

    - Contextual Recall

    - Contextual Precision

    - Contextual Relevancy

    - RAGAS

  - **Agentic metrics:**

    - Task Completion

    - Tool Correctness

  - **Others:**

    - Hallucination

    - Summarization

    - Bias

    - Toxicity

  - **Conversational metrics:**

    - Knowledge Retention

    - Conversation Completeness

    - Conversation Relevancy

    - Role Adherence

  - etc.

- Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.

- Generate synthetic datasets for evaluation.

- Integrates seamlessly with **ANY** CI/CD environment.

- [Red team your LLM application](https://deepeval.com/docs/red-teaming-introduction) for 40+ safety vulnerabilities in a few lines of code, including:

  - Toxicity

  - Bias

  - SQL Injection

  - etc., using advanced 10+ attack enhancement strategies such as prompt injections.

- Easily benchmark **ANY** LLM on popular LLM benchmarks in [under 10 lines of code.](https://deepeval.com/docs/benchmarks-introduction?utm_source=GitHub), which includes:

  - MMLU

  - HellaSwag

  - DROP

  - BIG-Bench Hard

  - TruthfulQA

  - HumanEval

  - GSM8K

- [100% integrated with Confident AI](https://confident-ai.com?utm_source=GitHub) for the full evaluation lifecycle:

  - Curate/annotate evaluation datasets on the cloud

  - Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best

  - Fine-tune metrics for custom results

  - Debug evaluation results via LLM traces

  - Monitor & evaluate LLM responses in product to improve datasets with real-world data

  - Repeat until perfection

> [!NOTE]

> Confident AI is the DeepEval platform. Create an account [here.](https://app.confident-ai.com?utm_source=GitHub)




# 🔌 Integrations

- 🦄 LlamaIndex, to [**unit test RAG applications in CI/CD**](https://www.deepeval.com/integrations/frameworks/llamaindex?utm_source=GitHub)

- 🤗 Hugging Face, to [**enable real-time evaluations during LLM fine-tuning**](https://www.deepeval.com/integrations/frameworks/huggingface?utm_source=GitHub)




# 🚀 QuickStart

Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.

## Installation

```

pip install -U deepeval

```

## Create an account (highly recommended)

Using the `deepeval` platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.

To login, run:

```

deepeval login

```

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy [here](https://deepeval.com/docs/data-privacy?utm_source=GitHub)).

## Writing your first test case

Create a test file:

```bash

touch test_chatbot.py

```

Open `test_chatbot.py` and write your first test case to run an **end-to-end** evaluation using DeepEval, which treats your LLM app as a black-box:

```python

import pytest

from deepeval import assert_test

from deepeval.metrics import GEval

from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def test_case():

    correctness_metric = GEval(

        name="Correctness",

        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",

        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],

        threshold=0.5

    )

    test_case = LLMTestCase(

        input="What if these shoes don't fit?",

        # Replace this with the actual output from your LLM application

        actual_output="You have 30 days to get a full refund at no extra cost.",

        expected_output="We offer a 30-day full refund at no extra costs.",

        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]

    )

    assert_test(test_case, [correctness_metric])

```

Set your `OPENAI_API_KEY` as an environment variable (you can also evaluate using your own custom model, for more details visit [this part of our docs](https://deepeval.com/docs/metrics-introduction#using-a-custom-llm?utm_source=GitHub)):

```

export OPENAI_API_KEY="..."

```

And finally, run `test_chatbot.py` in the CLI:

```

deepeval test run test_chatbot.py

```

**Congratulations! Your test case should have passed ✅** Let's breakdown what happened.

- The variable `input` mimics a user input, and `actual_output` is a placeholder for what your application's supposed to output based on this input.

- The variable `expected_output` represents the ideal answer for a given `input`, and [`GEval`](https://deepeval.com/docs/metrics-llm-evals) is a research-backed metric provided by `deepeval` for you to evaluate your LLM output's on any custom custom with human-like accuracy.

- In this example, the metric `criteria` is correctness of the `actual_output` based on the provided `expected_output`.

- All metric scores range from 0 - 1, which the `threshold=0.5` threshold ultimately determines if your test have passed or not.

[Read our documentation](https://deepeval.com/docs/getting-started?utm_source=GitHub) for more information on more options to run end-to-end evaluation, how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.




## Evaluating Nested Components

If you wish to evaluate individual components within your LLM app, you need to run **component-level** evals - a powerful way to evaluate any component within an LLM system.

Simply trace "components" such as LLM calls, retrievers, tool calls, and agents within your LLM application using the `@observe` decorator to apply metrics on a component-level. Tracing with `deepeval` is non-instrusive (learn more [here](https://deepeval.com/docs/evaluation-llm-tracing#dont-be-worried-about-tracing)) and helps you avoid rewriting your codebase just for evals:

```python

from deepeval.tracing import observe, update_current_span

from deepeval.test_case import LLMTestCase

from deepeval.dataset import Golden

from deepeval.metrics import GEval

from deepeval import evaluate

correctness = GEval(name="Correctness", criteria="Determine if the 'actual output' is correct based on the 'expected output'.", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT])

@observe(metrics=[correctness])

def inner_component():

    # Component can be anything from an LLM call, retrieval, agent, tool use, etc.

    update_current_span(test_case=LLMTestCase(input="...", actual_output="..."))

    return

@observe

def llm_app(input: str):

    inner_component()

    return

evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])

```

You can learn everything about component-level evaluations [here.](https://www.deepeval.com/docs/evaluation-component-level-llm-evals)




## Evaluating Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

```python

from deepeval import evaluate

from deepeval.metrics import AnswerRelevancyMetric

from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

test_case = LLMTestCase(

    input="What if these shoes don't fit?",

    # Replace this with the actual output from your LLM application

    actual_output="We offer a 30-day full refund at no extra costs.",

    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]

)

evaluate([test_case], [answer_relevancy_metric])

```

## Using Standalone Metrics

DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:

```python

from deepeval.metrics import AnswerRelevancyMetric

from deepeval.test_case import LLMTestCase

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

test_case = LLMTestCase(

    input="What if these shoes don't fit?",

    # Replace this with the actual output from your LLM application

    actual_output="We offer a 30-day full refund at no extra costs.",

    retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]

)

answer_relevancy_metric.measure(test_case)

print(answer_relevancy_metric.score)

# All metrics also offer an explanation

print(answer_relevancy_metric.reason)

```

Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.

## Evaluating a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:

```python

import pytest

from deepeval import assert_test

from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric

from deepeval.test_case import LLMTestCase

from deepeval.dataset import EvaluationDataset

first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])

dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])

@pytest.mark.parametrize(

    "test_case",

    dataset,

)

def test_customer_chatbot(test_case: LLMTestCase):

    hallucination_metric = HallucinationMetric(threshold=0.3)

    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])

```

```bash

# Run this in the CLI, you can also add an optional -n flag to run tests in parallel

deepeval test run test_.py -n 4

```




Alternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using our Pytest integration:

```python

from deepeval import evaluate

...

evaluate(dataset, [answer_relevancy_metric])

# or

dataset.evaluate([answer_relevancy_metric])

```

# LLM Evaluation With Confident AI

The correct LLM evaluation lifecycle is only achievable with [the DeepEval platform](https://confident-ai.com?utm_source=Github). It allows you to:

1. Curate/annotate evaluation datasets on the cloud

2. Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best

3. Fine-tune metrics for custom results

4. Debug evaluation results via LLM traces

5. Monitor & evaluate LLM responses in product to improve datasets with real-world data

6. Repeat until perfection

Everything on Confident AI, including how to use Confident is available [here](https://documentation.confident-ai.com?utm_source=GitHub).

To begin, login from the CLI:

```bash

deepeval login

```

Follow the instructions to log in, create your account, and paste your API key into the CLI.

Now, run your test file again:

```bash

deepeval test run test_chatbot.py

```

You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!

![Demo GIF](assets/demo.gif)




# Contributing

Please read [CONTRIBUTING.md](https://github.com/confident-ai/deepeval/blob/main/CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.




# Roadmap

Features:

- [x] Integration with Confident AI

- [x] Implement G-Eval

- [x] Implement RAG metrics

- [x] Implement Conversational metrics

- [x] Evaluation Dataset Creation

- [x] Red-Teaming

- [ ] DAG custom metrics

- [ ] Guardrails




# Authors

Built by the founders of Confident AI. Contact [email protected] for all enquiries.




# License

DeepEval is licensed under Apache 2.0 - see the [LICENSE.md](https://github.com/confident-ai/deepeval/blob/main/LICENSE.md) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/confident-ai/deepeval

Awesome Lists containing this project

README

The LLM Evaluation Framework