{"id":13456085,"url":"https://github.com/confident-ai/deepeval","last_synced_at":"2025-05-13T15:03:24.380Z","repository":{"id":188480660,"uuid":"676829188","full_name":"confident-ai/deepeval","owner":"confident-ai","description":"The LLM Evaluation Framework","archived":false,"fork":false,"pushed_at":"2025-05-12T23:50:27.000Z","size":86930,"stargazers_count":6273,"open_issues_count":172,"forks_count":547,"subscribers_count":29,"default_branch":"main","last_synced_at":"2025-05-13T00:31:14.872Z","etag":null,"topics":["evaluation-framework","evaluation-metrics","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics"],"latest_commit_sha":null,"homepage":"https://deepeval.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/confident-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-08-10T05:35:04.000Z","updated_at":"2025-05-13T00:17:37.000Z","dependencies_parsed_at":"2023-09-26T04:34:17.419Z","dependency_job_id":"3dce8c01-7782-4b07-9330-f2a70e93cb70","html_url":"https://github.com/confident-ai/deepeval","commit_stats":null,"previous_names":["mr-gpt/deepeval","confident-ai/deepeval","mr-gpt/evals"],"tags_count":243,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/confident-ai%2Fdeepeval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/confident-ai%2Fdeepeval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/confident-ai%2Fdeepeval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/confident-ai%2Fdeepeval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/confident-ai","download_url":"https://codeload.github.com/confident-ai/deepeval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253968338,"owners_count":21992253,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation-framework","evaluation-metrics","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics"],"created_at":"2024-07-31T08:01:15.956Z","updated_at":"2025-05-13T15:03:24.373Z","avatar_url":"https://github.com/confident-ai.png","language":"Python","funding_links":[],"categories":["🤖 LLM \u0026 Chatbot Testing","🛠️ Popular Open-Source Libraries for LLM Development","📊 Evaluation \u0026 Benchmarking","📊 エージェント評価とオブザーバビリティ","评估 Evaluation","Open-source LLM-backed app evaluation products","Python","A01_文本生成_文本对话","NLP","Production ([home](#awesome-llm))","Large Language Models (LLMs)","SDK, Libraries, Frameworks","9. Evaluation, Benchmarks \u0026 Datasets","🏗️ Reference Implementations \u0026 Case Studies","python","Evaluation and Monitoring","others","开源工具","Evaluation \u0026 Quality Control","Evaluation \u0026 Observability","Testing \u0026 Evaluation","HarmonyOS","Evals \u0026 Verification","LLM and Agent Observability","Tools","\u003ca id=\"tools\"\u003e\u003c/a\u003e🛠️ Tools","5. 数据集","LLM-as-Judge Evaluation","Repos","Testing Frameworks","Tools \u0026 Platforms","Frameworks","Tools \u0026 Services","Evaluation Frameworks","Librerías para usar NLP en español","3）参考实现与开源工具（GitHub）","LLM Evaluation Framework","🤖 AI \u0026 Machine Learning","3. Prompt Optimization","Observability \u0026 Monitoring","Tools and Code","Agent Observability and Testing","🧠 AI Applications \u0026 Platforms","Research Feeds, Benchmarks, and Model/Data Hubs","Advanced Techniques","📋 List of Open-Source Projects","Evaluation Metrics and Benchmarks","AI \u0026 LLM","Supporting Infrastructure","Catalog","Orchestration","🛠️ Hands-on Projects and Examples","🧱 Infrastructure and Building Blocks","Tools and Platforms","*Ops for AI"],"sub_categories":["3. The Enterprise / High-Scale Stack (The 1%)","自動運転","大语言对话模型及数据","Courses","LLM Evaluation","Python","T13 · Evaluation","评测框架","Windows Manager","Adjacent Collections","Typical Emotional Quotient (EQ)-Alignment Ability evaluation benchmarks","Model Evaluation","5.1 评测基准","LLM Evaluations and Benchmarks","Language-Specific Tools","Open Source Frameworks","Eval \u0026 Observability","Individual Episodes","Vector Store Tutorials","Herramientas de observabilidad","沙箱、可观测与评测","Evaluators and Test Harnesses","Rust","Resources","LLM Evaluation Tools","Benchmark Reality Check (real-world tool use)","Tools","Models, Datasets, and Evaluation","Evaluation","Evals and Benchmarks","Comparison Guides","LLM Apps \u0026 Interfaces","Evaluation Harnesses \u0026 Benchmarks","Application Framework","🛠️ Key Frameworks \u0026 Code Samples","🔍 Evaluation Frameworks and Judge Models","Scanners","LLMOps"],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/confident-ai/deepeval/blob/main/docs/static/img/deepeval.png\" alt=\"DeepEval Logo\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ch1 align=\"center\"\u003eThe LLM Evaluation Framework\u003c/h1\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://discord.com/invite/a3K9c8GRGt\"\u003e\n        \u003cimg alt=\"discord-invite\" src=\"https://dcbadge.vercel.app/api/server/a3K9c8GRGt?style=flat\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch4 align=\"center\"\u003e\n    \u003cp\u003e\n        \u003ca href=\"https://deepeval.com/docs/getting-started?utm_source=GitHub\"\u003eDocumentation\u003c/a\u003e |\n        \u003ca href=\"#-metrics-and-features\"\u003eMetrics and Features\u003c/a\u003e |\n        \u003ca href=\"#-quickstart\"\u003eGetting Started\u003c/a\u003e |\n        \u003ca href=\"#-integrations\"\u003eIntegrations\u003c/a\u003e |\n        \u003ca href=\"https://confident-ai.com?utm_source=GitHub\"\u003eDeepEval Platform\u003c/a\u003e\n    \u003cp\u003e\n\u003c/h4\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/confident-ai/deepeval/releases\"\u003e\n        \u003cimg alt=\"GitHub release\" src=\"https://img.shields.io/github/release/confident-ai/deepeval.svg?color=violet\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://colab.research.google.com/drive/1PPxYEBa6eu__LquGoFFJZkhYgWVYE6kh?usp=sharing\"\u003e\n        \u003cimg alt=\"Try Quickstart in Colab\" src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/confident-ai/deepeval/blob/master/LICENSE.md\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/confident-ai/deepeval.svg?color=yellow\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs **locally on your machine** for evaluation.\n\nWhether your LLM applications are RAG pipelines, chatbots, AI agents, implemented via LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.\n\n\u003e [!IMPORTANT]\n\u003e Need a place for your DeepEval testing data to live 🏡❤️? [Sign up to the DeepEval platform](https://confident-ai.com?utm_source=GitHub) to compare iterations of your LLM app, generate \u0026 share testing reports, and more.\n\u003e\n\u003e ![Demo GIF](assets/demo.gif)\n\n\u003e Want to talk LLM evaluation, need help picking metrics, or just to say hi? [Come join our discord.](https://discord.com/invite/a3K9c8GRGt)\n\n\u003cbr /\u003e\n\n# 🔥 Metrics and Features\n\n\u003e 🥳 You can now share DeepEval's test results on the cloud directly on [Confident AI](https://confident-ai.com?utm_source=GitHub)'s infrastructure\n\n- Supports both end-to-end and component-level LLM evaluation.\n- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that runs **locally on your machine**:\n  - G-Eval\n  - DAG ([deep acyclic graph](https://deepeval.com/docs/metrics-dag))\n  - **RAG metrics:**\n    - Answer Relevancy\n    - Faithfulness\n    - Contextual Recall\n    - Contextual Precision\n    - Contextual Relevancy\n    - RAGAS\n  - **Agentic metrics:**\n    - Task Completion\n    - Tool Correctness\n  - **Others:**\n    - Hallucination\n    - Summarization\n    - Bias\n    - Toxicity\n  - **Conversational metrics:**\n    - Knowledge Retention\n    - Conversation Completeness\n    - Conversation Relevancy\n    - Role Adherence\n  - etc.\n- Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.\n- Generate synthetic datasets for evaluation.\n- Integrates seamlessly with **ANY** CI/CD environment.\n- [Red team your LLM application](https://deepeval.com/docs/red-teaming-introduction) for 40+ safety vulnerabilities in a few lines of code, including:\n  - Toxicity\n  - Bias\n  - SQL Injection\n  - etc., using advanced 10+ attack enhancement strategies such as prompt injections.\n- Easily benchmark **ANY** LLM on popular LLM benchmarks in [under 10 lines of code.](https://deepeval.com/docs/benchmarks-introduction?utm_source=GitHub), which includes:\n  - MMLU\n  - HellaSwag\n  - DROP\n  - BIG-Bench Hard\n  - TruthfulQA\n  - HumanEval\n  - GSM8K\n- [100% integrated with Confident AI](https://confident-ai.com?utm_source=GitHub) for the full evaluation lifecycle:\n  - Curate/annotate evaluation datasets on the cloud\n  - Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best\n  - Fine-tune metrics for custom results\n  - Debug evaluation results via LLM traces\n  - Monitor \u0026 evaluate LLM responses in product to improve datasets with real-world data\n  - Repeat until perfection\n\n\u003e [!NOTE]\n\u003e Confident AI is the DeepEval platform. Create an account [here.](https://app.confident-ai.com?utm_source=GitHub)\n\n\u003cbr /\u003e\n\n# 🔌 Integrations\n\n- 🦄 LlamaIndex, to [**unit test RAG applications in CI/CD**](https://www.deepeval.com/integrations/frameworks/llamaindex?utm_source=GitHub)\n- 🤗 Hugging Face, to [**enable real-time evaluations during LLM fine-tuning**](https://www.deepeval.com/integrations/frameworks/huggingface?utm_source=GitHub)\n\n\u003cbr /\u003e\n\n# 🚀 QuickStart\n\nLet's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.\n\n## Installation\n\n```\npip install -U deepeval\n```\n\n## Create an account (highly recommended)\n\nUsing the `deepeval` platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.\n\nTo login, run:\n\n```\ndeepeval login\n```\n\nFollow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy [here](https://deepeval.com/docs/data-privacy?utm_source=GitHub)).\n\n## Writing your first test case\n\nCreate a test file:\n\n```bash\ntouch test_chatbot.py\n```\n\nOpen `test_chatbot.py` and write your first test case to run an **end-to-end** evaluation using DeepEval, which treats your LLM app as a black-box:\n\n```python\nimport pytest\nfrom deepeval import assert_test\nfrom deepeval.metrics import GEval\nfrom deepeval.test_case import LLMTestCase, LLMTestCaseParams\n\ndef test_case():\n    correctness_metric = GEval(\n        name=\"Correctness\",\n        criteria=\"Determine if the 'actual output' is correct based on the 'expected output'.\",\n        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],\n        threshold=0.5\n    )\n    test_case = LLMTestCase(\n        input=\"What if these shoes don't fit?\",\n        # Replace this with the actual output from your LLM application\n        actual_output=\"You have 30 days to get a full refund at no extra cost.\",\n        expected_output=\"We offer a 30-day full refund at no extra costs.\",\n        retrieval_context=[\"All customers are eligible for a 30 day full refund at no extra costs.\"]\n    )\n    assert_test(test_case, [correctness_metric])\n```\n\nSet your `OPENAI_API_KEY` as an environment variable (you can also evaluate using your own custom model, for more details visit [this part of our docs](https://deepeval.com/docs/metrics-introduction#using-a-custom-llm?utm_source=GitHub)):\n\n```\nexport OPENAI_API_KEY=\"...\"\n```\n\nAnd finally, run `test_chatbot.py` in the CLI:\n\n```\ndeepeval test run test_chatbot.py\n```\n\n**Congratulations! Your test case should have passed ✅** Let's breakdown what happened.\n\n- The variable `input` mimics a user input, and `actual_output` is a placeholder for what your application's supposed to output based on this input.\n- The variable `expected_output` represents the ideal answer for a given `input`, and [`GEval`](https://deepeval.com/docs/metrics-llm-evals) is a research-backed metric provided by `deepeval` for you to evaluate your LLM output's on any custom custom with human-like accuracy.\n- In this example, the metric `criteria` is correctness of the `actual_output` based on the provided `expected_output`.\n- All metric scores range from 0 - 1, which the `threshold=0.5` threshold ultimately determines if your test have passed or not.\n\n[Read our documentation](https://deepeval.com/docs/getting-started?utm_source=GitHub) for more information on more options to run end-to-end evaluation, how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.\n\n\u003cbr /\u003e\n\n## Evaluating Nested Components\n\nIf you wish to evaluate individual components within your LLM app, you need to run **component-level** evals - a powerful way to evaluate any component within an LLM system.\n\nSimply trace \"components\" such as LLM calls, retrievers, tool calls, and agents within your LLM application using the `@observe` decorator to apply metrics on a component-level. Tracing with `deepeval` is non-instrusive (learn more [here](https://deepeval.com/docs/evaluation-llm-tracing#dont-be-worried-about-tracing)) and helps you avoid rewriting your codebase just for evals:\n\n```python\nfrom deepeval.tracing import observe, update_current_span\nfrom deepeval.test_case import LLMTestCase\nfrom deepeval.dataset import Golden\nfrom deepeval.metrics import GEval\nfrom deepeval import evaluate\n\ncorrectness = GEval(name=\"Correctness\", criteria=\"Determine if the 'actual output' is correct based on the 'expected output'.\", evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT])\n\n@observe(metrics=[correctness])\ndef inner_component():\n    # Component can be anything from an LLM call, retrieval, agent, tool use, etc.\n    update_current_span(test_case=LLMTestCase(input=\"...\", actual_output=\"...\"))\n    return\n\n@observe\ndef llm_app(input: str):\n    inner_component()\n    return\n\nevaluate(observed_callback=llm_app, goldens=[Golden(input=\"Hi!\")])\n```\n\nYou can learn everything about component-level evaluations [here.](https://www.deepeval.com/docs/evaluation-component-level-llm-evals)\n\n\u003cbr /\u003e\n\n## Evaluating Without Pytest Integration\n\nAlternatively, you can evaluate without Pytest, which is more suited for a notebook environment.\n\n```python\nfrom deepeval import evaluate\nfrom deepeval.metrics import AnswerRelevancyMetric\nfrom deepeval.test_case import LLMTestCase\n\nanswer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)\ntest_case = LLMTestCase(\n    input=\"What if these shoes don't fit?\",\n    # Replace this with the actual output from your LLM application\n    actual_output=\"We offer a 30-day full refund at no extra costs.\",\n    retrieval_context=[\"All customers are eligible for a 30 day full refund at no extra costs.\"]\n)\nevaluate([test_case], [answer_relevancy_metric])\n```\n\n## Using Standalone Metrics\n\nDeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:\n\n```python\nfrom deepeval.metrics import AnswerRelevancyMetric\nfrom deepeval.test_case import LLMTestCase\n\nanswer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)\ntest_case = LLMTestCase(\n    input=\"What if these shoes don't fit?\",\n    # Replace this with the actual output from your LLM application\n    actual_output=\"We offer a 30-day full refund at no extra costs.\",\n    retrieval_context=[\"All customers are eligible for a 30 day full refund at no extra costs.\"]\n)\n\nanswer_relevancy_metric.measure(test_case)\nprint(answer_relevancy_metric.score)\n# All metrics also offer an explanation\nprint(answer_relevancy_metric.reason)\n```\n\nNote that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.\n\n## Evaluating a Dataset / Test Cases in Bulk\n\nIn DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:\n\n```python\nimport pytest\nfrom deepeval import assert_test\nfrom deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric\nfrom deepeval.test_case import LLMTestCase\nfrom deepeval.dataset import EvaluationDataset\n\nfirst_test_case = LLMTestCase(input=\"...\", actual_output=\"...\", context=[\"...\"])\nsecond_test_case = LLMTestCase(input=\"...\", actual_output=\"...\", context=[\"...\"])\n\ndataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])\n\n@pytest.mark.parametrize(\n    \"test_case\",\n    dataset,\n)\ndef test_customer_chatbot(test_case: LLMTestCase):\n    hallucination_metric = HallucinationMetric(threshold=0.3)\n    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)\n    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])\n```\n\n```bash\n# Run this in the CLI, you can also add an optional -n flag to run tests in parallel\ndeepeval test run test_\u003cfilename\u003e.py -n 4\n```\n\n\u003cbr/\u003e\n\nAlternatively, although we recommend using `deepeval test run`, you can evaluate a dataset/test cases without using our Pytest integration:\n\n```python\nfrom deepeval import evaluate\n...\n\nevaluate(dataset, [answer_relevancy_metric])\n# or\ndataset.evaluate([answer_relevancy_metric])\n```\n\n# LLM Evaluation With Confident AI\n\nThe correct LLM evaluation lifecycle is only achievable with [the DeepEval platform](https://confident-ai.com?utm_source=Github). It allows you to:\n\n1. Curate/annotate evaluation datasets on the cloud\n2. Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best\n3. Fine-tune metrics for custom results\n4. Debug evaluation results via LLM traces\n5. Monitor \u0026 evaluate LLM responses in product to improve datasets with real-world data\n6. Repeat until perfection\n\nEverything on Confident AI, including how to use Confident is available [here](https://documentation.confident-ai.com?utm_source=GitHub).\n\nTo begin, login from the CLI:\n\n```bash\ndeepeval login\n```\n\nFollow the instructions to log in, create your account, and paste your API key into the CLI.\n\nNow, run your test file again:\n\n```bash\ndeepeval test run test_chatbot.py\n```\n\nYou should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!\n\n![Demo GIF](assets/demo.gif)\n\n\u003cbr /\u003e\n\n# Contributing\n\nPlease read [CONTRIBUTING.md](https://github.com/confident-ai/deepeval/blob/main/CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.\n\n\u003cbr /\u003e\n\n# Roadmap\n\nFeatures:\n\n- [x] Integration with Confident AI\n- [x] Implement G-Eval\n- [x] Implement RAG metrics\n- [x] Implement Conversational metrics\n- [x] Evaluation Dataset Creation\n- [x] Red-Teaming\n- [ ] DAG custom metrics\n- [ ] Guardrails\n\n\u003cbr /\u003e\n\n# Authors\n\nBuilt by the founders of Confident AI. Contact jeffreyip@confident-ai.com for all enquiries.\n\n\u003cbr /\u003e\n\n# License\n\nDeepEval is licensed under Apache 2.0 - see the [LICENSE.md](https://github.com/confident-ai/deepeval/blob/main/LICENSE.md) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconfident-ai%2Fdeepeval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconfident-ai%2Fdeepeval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconfident-ai%2Fdeepeval/lists"}