{"id":13576676,"url":"https://github.com/relari-ai/continuous-eval","last_synced_at":"2025-04-05T08:32:49.026Z","repository":{"id":211497313,"uuid":"729307404","full_name":"relari-ai/continuous-eval","owner":"relari-ai","description":"Data-Driven Evaluation for LLM-Powered Applications","archived":false,"fork":false,"pushed_at":"2024-09-02T00:22:08.000Z","size":1784,"stargazers_count":446,"open_issues_count":13,"forks_count":29,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-11-11T09:22:24.330Z","etag":null,"topics":["evaluation-framework","evaluation-metrics","information-retrieval","llm-evaluation","llmops","rag","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"https://continuous-eval.docs.relari.ai/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/relari-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-08T21:30:39.000Z","updated_at":"2024-11-03T05:42:41.000Z","dependencies_parsed_at":"2023-12-29T16:39:57.286Z","dependency_job_id":"6d15cf5a-fd19-4ef1-8910-96b26e4b49d2","html_url":"https://github.com/relari-ai/continuous-eval","commit_stats":null,"previous_names":["relari-ai/continuous-eval"],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/relari-ai%2Fcontinuous-eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/relari-ai%2Fcontinuous-eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/relari-ai%2Fcontinuous-eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/relari-ai%2Fcontinuous-eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/relari-ai","download_url":"https://codeload.github.com/relari-ai/continuous-eval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247311914,"owners_count":20918340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation-framework","evaluation-metrics","information-retrieval","llm-evaluation","llmops","rag","retrieval-augmented-generation"],"created_at":"2024-08-01T15:01:12.737Z","updated_at":"2025-04-05T08:32:49.011Z","avatar_url":"https://github.com/relari-ai.png","language":"Python","funding_links":[],"categories":["LLMOps","Python","A01_文本生成_文本对话","Evaluation and Monitoring","*Ops for AI"],"sub_categories":["LLM Evaluation \u0026 Testing","大语言对话模型及数据","LLMOps"],"readme":"\u003ch3 align=\"center\"\u003e\n  \u003cimg\n    src=\"docs/public/continuous-eval-logo.png\"\n    width=\"350\"\n  \u003e\n\u003c/h3\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n  \n  \u003ca href=\"https://docs.relari.ai/\" target=\"_blank\"\u003e\u003cimg src=\"https://img.shields.io/badge/docs-view-blue\" alt=\"Documentation\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.python.org/pypi/continuous-eval\"\u003e![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/pyversions/continuous-eval.svg)\u003c/a\u003e\n  \u003ca href=\"https://github.com/relari-ai/continuous-eval/releases\"\u003e![https://GitHub.com/relari-ai/continuous-eval/releases](https://img.shields.io/github/release/relari-ai/continuous-eval)\u003c/a\u003e\n  \u003ca href=\"https://pypi.python.org/pypi/continuous-eval/\"\u003e![https://github.com/Naereen/badges/](https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github)\u003c/a\u003e\n  \u003ca a href=\"https://github.com/relari-ai/continuous-eval/blob/main/LICENSE\"\u003e![https://pypi.python.org/pypi/continuous-eval/](https://img.shields.io/pypi/l/continuous-eval.svg)\u003c/a\u003e\n\n\n\u003c/div\u003e\n\n\u003ch2 align=\"center\"\u003e\n  \u003cp\u003eData-Driven Evaluation for LLM-Powered Applications\u003c/p\u003e\n\u003c/h2\u003e\n\n\n\n## Overview\n\n`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.\n\n\u003ch1 align=\"center\"\u003e\n  \u003cimg\n    src=\"docs/public/module-level-eval.png\"\n  \u003e\n\u003c/h1\u003e\n\n## How is continuous-eval different?\n\n- **Modularized Evaluation**: Measure each module in the pipeline with tailored metrics.\n\n- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.\n\n- **Probabilistic Evaluation**: Evaluate your pipeline with probabilistic metrics\n\n## Getting Started\n\nThis code is provided as a PyPi package. To install it, run the following command:\n\n```bash\npython3 -m pip install continuous-eval\n```\n\nif you want to install from source:\n\n```bash\ngit clone https://github.com/relari-ai/continuous-eval.git \u0026\u0026 cd continuous-eval\npoetry install --all-extras\n```\n\nTo run LLM-based metrics, the code requires at least one of the LLM API keys in `.env`. Take a look at the example env file `.env.example`.\n\n## Run a single metric\n\nHere's how you run a single metric on a datum.\nCheck all available metrics here: [link](https://continuous-eval.docs.relari.ai/)\n\n```python\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1\n\ndatum = {\n    \"question\": \"What is the capital of France?\",\n    \"retrieved_context\": [\n        \"Paris is the capital of France and its largest city.\",\n        \"Lyon is a major city in France.\",\n    ],\n    \"ground_truth_context\": [\"Paris is the capital of France.\"],\n    \"answer\": \"Paris\",\n    \"ground_truths\": [\"Paris\"],\n}\n\nmetric = PrecisionRecallF1()\n\nprint(metric(**datum))\n```\n\n## Run an evaluation\n\nIf you want to run an evaluation on a dataset, you can use the `EvaluationRunner` class.\n\n```python\nfrom time import perf_counter\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import EvaluationRunner, SingleModulePipeline\nfrom continuous_eval.eval.tests import GreaterOrEqualThan\nfrom continuous_eval.metrics.retrieval import (\n    PrecisionRecallF1,\n    RankedRetrievalMetrics,\n)\n\n\ndef main():\n    # Let's download the retrieval dataset example\n    dataset = example_data_downloader(\"retrieval\")\n\n    # Setup evaluation pipeline (i.e., dataset, metrics and tests)\n    pipeline = SingleModulePipeline(\n        dataset=dataset,\n        eval=[\n            PrecisionRecallF1().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n            RankedRetrievalMetrics().use(\n                retrieved_context=dataset.retrieved_contexts,\n                ground_truth_context=dataset.ground_truth_contexts,\n            ),\n        ],\n        tests=[\n            GreaterOrEqualThan(\n                test_name=\"Recall\", metric_name=\"context_recall\", min_value=0.8\n            ),\n        ],\n    )\n\n    # Start the evaluation manager and run the metrics (and tests)\n    tic = perf_counter()\n    runner = EvaluationRunner(pipeline)\n    eval_results = runner.evaluate()\n    toc = perf_counter()\n    print(\"Evaluation results:\")\n    print(eval_results.aggregate())\n    print(f\"Elapsed time: {toc - tic:.2f} seconds\\n\")\n\n    print(\"Running tests...\")\n    test_results = runner.test(eval_results)\n    print(test_results)\n\n\nif __name__ == \"__main__\":\n    # It is important to run this script in a new process to avoid\n    # multiprocessing issues\n    main()\n```\n\n## Run evaluation on a pipeline (modular evaluation)\n\nSometimes the system is composed of multiple modules, each with its own metrics and tests.\nContinuous-eval supports this use case by allowing you to define modules in your pipeline and select corresponding metrics.\n\n```python\nfrom typing import Any, Dict, List\n\nfrom continuous_eval.data_downloader import example_data_downloader\nfrom continuous_eval.eval import (\n    Dataset,\n    EvaluationRunner,\n    Module,\n    ModuleOutput,\n    Pipeline,\n)\nfrom continuous_eval.eval.result_types import PipelineResults\nfrom continuous_eval.metrics.generation.text import AnswerCorrectness\nfrom continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics\n\n\ndef page_content(docs: List[Dict[str, Any]]) -\u003e List[str]:\n    # Extract the content of the retrieved documents from the pipeline results\n    return [doc[\"page_content\"] for doc in docs]\n\n\ndef main():\n    dataset: Dataset = example_data_downloader(\"graham_essays/small/dataset\")\n    results: Dict = example_data_downloader(\"graham_essays/small/results\")\n\n    # Simple 3-step RAG pipeline with Retriever-\u003eReranker-\u003eGeneration\n    retriever = Module(\n        name=\"retriever\",\n        input=dataset.question,\n        output=List[str],\n        eval=[\n            PrecisionRecallF1().use(\n                retrieved_context=ModuleOutput(page_content),  # specify how to extract what we need (i.e., page_content)\n                ground_truth_context=dataset.ground_truth_context,\n            ),\n        ],\n    )\n\n    reranker = Module(\n        name=\"reranker\",\n        input=retriever,\n        output=List[Dict[str, str]],\n        eval=[\n            RankedRetrievalMetrics().use(\n                retrieved_context=ModuleOutput(page_content),\n                ground_truth_context=dataset.ground_truth_context,\n            ),\n        ],\n    )\n\n    llm = Module(\n        name=\"llm\",\n        input=reranker,\n        output=str,\n        eval=[\n            AnswerCorrectness().use(\n                question=dataset.question,\n                answer=ModuleOutput(),\n                ground_truth_answers=dataset.ground_truth_answers,\n            ),\n        ],\n    )\n\n    pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)\n    print(pipeline.graph_repr())  # visualize the pipeline in marmaid format\n\n    runner = EvaluationRunner(pipeline)\n    eval_results = runner.evaluate(PipelineResults.from_dict(results))\n    print(eval_results.aggregate())\n\n\nif __name__ == \"__main__\":\n    main()\n```\n\n\u003e Note: it is important to wrap your code in a main function (with the `if __name__ == \"__main__\":` guard) to make sure the parallelization works properly.\n\n## Custom Metrics\n\nThere are several ways to create custom metrics, see the [Custom Metrics](https://continuous-eval.docs.relari.ai/v0.3/metrics/overview) section in the docs.\n\nThe simplest way is to leverage the `CustomMetric` class to create a LLM-as-a-Judge.\n\n```python\nfrom continuous_eval.metrics.base.metric import Arg, Field\nfrom continuous_eval.metrics.custom import CustomMetric\nfrom typing import List\n\ncriteria = \"Check that the generated answer does not contain PII or other sensitive information.\"\nrubric = \"\"\"Use the following rubric to assign a score to the answer based on its conciseness:\n- Yes: The answer contains PII or other sensitive information.\n- No: The answer does not contain PII or other sensitive information.\n\"\"\"\n\nmetric = CustomMetric(\n    name=\"PIICheck\",\n    criteria=criteria,\n    rubric=rubric,\n    arguments={\"answer\": Arg(type=str, description=\"The answer to evaluate.\")},\n    response_format={\n        \"reasoning\": Field(\n            type=str,\n            description=\"The reasoning for the score given to the answer\",\n        ),\n        \"score\": Field(\n            type=str, description=\"The score of the answer: Yes or No\"\n        ),\n        \"identifies\": Field(\n            type=List[str],\n            description=\"The PII or other sensitive information identified in the answer\",\n        ),\n    },\n)\n\n# Let's calculate the metric for the first datum\nprint(metric(answer=\"John Doe resides at 123 Main Street, Springfield.\"))\n```\n\n## 💡 Contributing\n\nInterested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.\n\n## Resources\n\n- **Docs:** [link](https://continuous-eval.docs.relari.ai/)\n- **Examples Repo**: [end-to-end example repo](https://github.com/relari-ai/examples)\n- **Blog Posts:**\n  - Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)\n  - How important is a Golden Dataset for LLM evaluation?\n [(link)](https://medium.com/relari/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5)\n  - How to evaluate complex GenAI Apps: a granular approach [(link)](https://medium.com/relari/how-to-evaluate-complex-genai-apps-a-granular-approach-0ab929d5b3e2)\n  - How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https://medium.com/towards-data-science/how-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)\n  - Generate Synthetic Data to Test LLM Applications [(link)](https://medium.com/relari/generate-synthetic-data-to-test-llm-applications-4bffeb51b80e)\n- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)\n- **Reach out to founders:** [Email](mailto:founders@relari.ai) or [Schedule a chat](https://cal.com/relari/intro)\n\n## License\n\nThis project is licensed under the Apache 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Open Analytics\n\nWe monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement.\nYou can take a look at exactly what we track in the [telemetry code](continuous_eval/utils/telemetry.py)\n\nTo disable usage-tracking you set the `CONTINUOUS_EVAL_DO_NOT_TRACK` flag to `true`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frelari-ai%2Fcontinuous-eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frelari-ai%2Fcontinuous-eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frelari-ai%2Fcontinuous-eval/lists"}