{"id":48462820,"url":"https://github.com/sap/agent-quality-inspect","last_synced_at":"2026-04-17T08:05:37.253Z","repository":{"id":339536947,"uuid":"1147797107","full_name":"SAP/agent-quality-inspect","owner":"SAP","description":"Evaluation package that allows benchmarking of agentic AIs from various sources and frameworks by producing statistical results which can be compared across different use cases and datasets.","archived":false,"fork":false,"pushed_at":"2026-04-06T09:54:17.000Z","size":17572,"stargazers_count":7,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-06T10:26:12.585Z","etag":null,"topics":["agentic-ai","error-analysis","evaluation","llm","llm-as-a-judge","metrics","user-proxy"],"latest_commit_sha":null,"homepage":"https://sap.github.io/agent-quality-inspect/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SAP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-02T08:09:01.000Z","updated_at":"2026-04-06T08:13:55.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/SAP/agent-quality-inspect","commit_stats":null,"previous_names":["sap-samples/agent-quality-inspect","sap/agent-quality-inspect"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/SAP/agent-quality-inspect","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP%2Fagent-quality-inspect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP%2Fagent-quality-inspect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP%2Fagent-quality-inspect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP%2Fagent-quality-inspect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SAP","download_url":"https://codeload.github.com/SAP/agent-quality-inspect/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SAP%2Fagent-quality-inspect/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31498070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-06T17:22:55.647Z","status":"online","status_checked_at":"2026-04-07T02:00:07.164Z","response_time":105,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","error-analysis","evaluation","llm","llm-as-a-judge","metrics","user-proxy"],"created_at":"2026-04-07T03:00:43.978Z","updated_at":"2026-04-17T08:05:37.222Z","avatar_url":"https://github.com/SAP.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Agent Quality Inspect\n\n## Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis\n\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![REUSE status](https://api.reuse.software/badge/github.com/SAP/agent-quality-inspect)](https://api.reuse.software/info/github.com/SAP/agent-quality-inspect)\n[![ICLR 2026](https://img.shields.io/badge/ICLR-2026-red.svg)](https://iclr.cc/Conferences/2026)\n\nPaper Link: https://openreview.net/pdf?id=fHsVNklKOc\n\nDocumentation Link: https://sap.github.io/agent-quality-inspect/\n\n## Table of Contents\n- [Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis](#talk-evaluate-diagnose-user-aware-agent-evaluation-with-automated-error-analysis)\n  - [Table of Contents](#table-of-contents)\n  - [Overview](#overview)\n    - [Features](#features)\n  - [Installation](#installation)\n    - [Prerequisites](#prerequisites)\n    - [Setup](#setup)\n  - [Quick Start](#quick-start)\n    - [Option 1. Using it as a Metrics Package](#option-1-using-it-as-a-metrics-package)\n    - [Option 2. Using it via the provided runners](#option-2-using-it-via-the-provided-runners)\n    - [Viewing Results](#viewing-results)\n    - [Error Diagnosis UI](#error-diagnosis-ui)\n  - [Bring Your Own Agent](#bring-your-own-agent)\n    - [Creating your own evaluation dataset](#creating-your-own-evaluation-dataset)\n  - [Known Issues](#known-issues)\n  - [How to obtain support](#how-to-obtain-support)\n  - [Contributing](#contributing)\n  - [Citation](#citation)\n  - [License](#license)\n\n## Overview\n\n![Two-step automated error discovery approach. Identical error colors indicate\nthat similar low-level errors are clustered into the same high-level category.](error_analysis_framework.png)\n\n\nThis repository contains the implementation of **Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis (TED)**.\n\nThe agent-quality-inspect toolkit evaluates agentic systems under different user personas (expert and non-expert), reports metrics such as **Area Under the Curve (AUC)**, **Progress Per Turn (PPT)**, **pass@k**, **pass^k**, etc., and provides detailed error analysis to identify specific areas for improvement in the agent.\n\nAt the core of TED is a **subgoal-based evaluation**: users specify a set of natural-language subgoals (e.g., \"Agent should call `search_messages` after getting the current timestamp\" or \"Agent should explicitly state the final answer and justification\") in the evaluation dataset. During evaluation, TED:\n\n- Treats these subgoals (the `SubGoal` objects in the `EvaluationSample` schema) as the ground truth of what the agent should achieve over the course of the interaction.\n- Compares each subgoal against the **agent trace** (`AgentDialogueTrace`), including turns, intermediate tool calls, and agent responses.\n- Uses an LLM-as-a-judge to decide, for each turn, whether the behavior observed in the trace satisfies the relevant subgoals, and aggregates these judgments into progress curves and downstream metrics such as AUC and PPT.\n\n### Features\n\n- **User personas**: User proxy simulating users with different levels of domain expertise (expert and non-expert).\n- **Metrics**:\n  - **Area Under the Curve (AUC)**: Measures the area under the progress curve and evaluates how quickly the agent makes progress toward the goal.\n  - **Progress Per Turn (PPT)**: Measures the average progress made by the agent in each turn.\n  - **Task success and reliability**: Metrics such as pass@k and pass^k.\n- **Error analysis**: Automatic categorization of failure modes to highlight where and why agents fail.\n- **Agent-runner support**: Supports and is extensible to different agent runners. This repository currently includes integrations for **Tau2Bench** and **Toolsandbox** and provides a pattern to extend to other multi-turn benchmarks.\n\n## Installation\n\n### Prerequisites\n\n- Python 3.10 or higher\n- Azure OpenAI API access (for evaluation metrics and the user proxy)\n\n### Setup\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/SAP/agent-quality-inspect\ncd agent-quality-inspect\n```\n\n2. Install the package in editable mode:\n\n```bash\npip install -e .\n```\n\n3. Configure Azure OpenAI API credentials by creating a `.env` file in the project root:\n\n```bash\nAZURE_API_VERSION=your_api_version         # e.g., 2024-02-15-preview\nAZURE_API_BASE=https://your-resource.openai.azure.com/\nAZURE_API_KEY=your_api_key\n```\n\n4. (Optional) Set up agent runners:\n\n- Tau2Bench runner setup: [agent_runners/README_tau2_bench_setup.md](agent_runners/README_tau2_bench_setup.md)\n- Toolsandbox runner setup: [agent_runners/README_tool_sandbox_setup.md](agent_runners/README_tool_sandbox_setup.md)\n\n\u003e **Note:** The RapidAPI services used in these agent runners are third-party, user-subscribed services. SAP does not, provide, license or authorize their use.\n\n## Quick Start\n\nThere are two primary ways to use this repository:\n\n1. **As a metrics / evaluation package** inside your own code.\n2. **Via the provided experiment runners** to reproduce paper results and run new evaluations.\n\n### Option 1. Using it as a Metrics Package\n\nThe standard flow of using it as a metrics package is as follows:\n\n1. To use the package as an importable dependency, enter your command terminal and use this command.\n\n```bash\npip install git+https://github.com/SAP/agent-quality-inspect.git\n\nor\n\npip install agent-quality-inspect\n```\n\n2. Define your evaluation sample and agent trace.\n3. Evaluate the progress rates using the evaluation sample on your agent trace.\n4. Using the output of the Step 3, calculate any of the metric scores (AUC, PPT, pass@k, etc.).\n5. Optionally, run the error analysis on the outputs of previous steps.\n6. Visualize the error analysis results in the Streamlit UI.\n   \nExample: constructing a minimal trace and computing an AUC score with the metrics package: Run the script using `streamlit run \u003cscript\u003e.py`.\n\n**Important Note:** To import our package library use `from agent_inspect import ..`\n\n```python\nfrom typing import List\nfrom agent_inspect.clients import AzureOpenAIClient\nfrom agent_inspect.metrics.scorer import AUC, ProgressScoresThroughTurns\nfrom agent_inspect.metrics.constants import (\n    INCLUDE_VALIDATION_RESULTS,\n    INCLUDE_JUDGE_EXPLANATION,\n    OPTIMIZE_JUDGE_TRIALS\n)\nfrom agent_inspect.models.metrics.agent_trace import (\n    AgentDialogueTrace,\n    TurnTrace,\n    AgentResponse,\n)\nfrom agent_inspect.models.metrics.agent_data_sample import EvaluationSample, SubGoal\nfrom agent_inspect.models.tools import ErrorAnalysisDataSample\nfrom agent_inspect.tools import ErrorAnalysis\nfrom demo.ui_for_agent_diagnosis.app import launch_ui\n\n\n# Create LLM client (requires env vars: AZURE_API_VERSION, AZURE_API_BASE, AZURE_API_KEY)\nclient = AzureOpenAIClient(model=\"gpt-4.1\", max_tokens=4096)\n\n# Build a minimal agent trace with a single turn\nagent_trace = AgentDialogueTrace(\n    turns=[\n        TurnTrace(\n            id=\"turn_1\",\n            agent_input=\"What is my current account balance?\",\n            agent_response=AgentResponse(\n                response=\"Your current balance is 100 USD.\",\n            ),\n        )\n    ]\n)\n\n# 2. Define the evaluation data sample and subgoals\ndata_sample = EvaluationSample(\n    sub_goals=[\n        SubGoal(\n            details=\"Agent should correctly state the user's current account balance.\",\n        )\n    ]\n)\n\n# Step 1: Calculate progress rates using the evaluation sample and agent trace\nprogress_metric = ProgressScoresThroughTurns(\n    llm_client=client,\n    config={\n        INCLUDE_VALIDATION_RESULTS: True,\n        INCLUDE_JUDGE_EXPLANATION: True,\n        OPTIMIZE_JUDGE_TRIALS: False\n    }\n)\nprogress_scores = progress_metric.evaluate(\n    agent_trace=agent_trace,\n    evaluation_data_sample=data_sample\n)\n\nprint(f\"Progress scores calculated for {len(progress_scores)} turn(s)\")\nfor i, score in enumerate(progress_scores, 1):\n    print(f\"  Turn {i}: {score.score:.2f}\")\n\n\n# Step 2: Calculate AUC from progress scores\nauc_result = AUC.get_auc_score_from_progress_scores(progress_scores)\nprint(f\"AUC score: {auc_result.score:.2f}\")\n\n\n# Step 3: Run error analysis\n\n# Extract validation results from the final turn\nsubgoal_validations = progress_scores[-1].validation_results\n\n# Prepare data for error analysis\nerror_analysis_data_samples: List[ErrorAnalysisDataSample] = [\n    ErrorAnalysisDataSample(\n        data_sample_id=1,\n        agent_run_id=1,\n        subgoal_validations=subgoal_validations,\n    )\n]\n\n# Run error analysis\nerror_analyzer = ErrorAnalysis(llm_client=client, max_workers=3)\nerror_analysis_result = error_analyzer.analyze_batch(error_analysis_data_samples)\n\n# Display results\nerror_categories = list(error_analysis_result.analyzed_validations_clustered_by_errors.keys())\nprint(f\"\\nIdentified {len(error_categories)} error categories:\")\nfor i, category in enumerate(error_categories, 1):\n    count = len(error_analysis_result.analyzed_validations_clustered_by_errors[category])\n    print(f\"  {i}. {category} ({count} occurrences)\")\nprint(f\"Completed validations: {len(error_analysis_result.completed_subgoal_validations)}\")\n\n\n# Step 4: Launch UI for visualization\nlaunch_ui(\n    error_analysis_result=error_analysis_result,\n    data_samples=error_analysis_data_samples\n)\n```\n\nFor more information on viewing error analysis UI [demo/ui_for_agent_diagnosis/readme.md](demo/ui_for_agent_diagnosis/readme.md).\n\n\n### Option 2. Using it via the provided runners\n\nBenchmarks are orchestrated via the runners in [paper_experiments](paper_experiments/readme.md) and external agent runners (for example, Tau2Bench or ToolSandbox).\n\n1. **Start your agent runner** (for example, Tau2Bench):\n  - Set up the Tau2Bench environment as described in [agent_runners/README_tau2_bench_setup.md](agent_runners/README_tau2_bench_setup.md).\n\n2. **Run the evaluation experiments** from the project root using the paper experiments runner, for example:\n\n```bash\npython -m paper_experiments.runner \\\n  --agent tau2bench \\\n  --samples-file paper_experiments/datasets/tau2bench_dataset_easy.json \\\n  --user-proxy-persona expert\n```\n\nAdditional options (agent type, datasets, number of trials, max turns, etc.) are documented in [paper_experiments/readme.md](paper_experiments/readme.md).\n\n### Viewing Results\n\nAfter running evaluations, results and error analysis are written to timestamped folders under `paper_experiments/`.\n\u003c!-- \nFor each run, the [paper_experiments](paper_experiments/readme.md) runner creates an output directory such as `paper_experiments/experiment_outputs_\u003ctimestamp\u003e/` containing, among others:\n\n- `trial_\u003cN\u003e_results.json`: Per-trial, per-sample trajectories and metrics (AUC, PPT, turn counts, success flags).\n- `aggregate_metrics_results.json`: Aggregate metrics (e.g., MaxAUC@k, MaxPPT@k) across all trials.\n- `evaluation_results.pkl`: Serialized evaluation results.\n- `error_analysis.pkl`: Serialized error analysis inputs and outputs used by the diagnosis UI.\n- `evaluation.log`: Detailed logs for debugging and auditing. --\u003e\n\nSee [paper_experiments/readme.md](paper_experiments/readme.md) for a full description of the output format.\n\n### Error Diagnosis UI\n\nTo explore error analysis for a specific experiment run in a browser UI, you can launch the Streamlit viewer.\n\n```bash\npython -m streamlit run paper_experiments/view_results.py -- --output-dir paper_experiments/experiment_outputs_\u003ctimestamp\u003e\n```\n\nReplace `\u003ctimestamp\u003e` with the actual timestamp of your output directory. This loads the pickled results and starts a Streamlit app at `http://localhost:8501` that visualizes error categories and per-sample diagnostics. More details are in [paper_experiments/readme.md](paper_experiments/readme.md).\n\n### Download Pre-computed Results from HuggingFace\n\nWe provide pre-computed experiment results on HuggingFace so you can explore the error diagnosis UI without running the full evaluation pipeline yourself.\n\n**1. Install the HuggingFace `huggingface_hub` library** (if not already installed):\n\n```bash\npip install huggingface_hub\n```\n\n**2. Download the dataset using the provided script:**\n\n```bash\npython paper_experiments/download_hf_dataset.py --output-dir \u003cpath-to-output-folder\u003e\n```\n\nTo download a specific file:\n\n```bash\npython paper_experiments/download_hf_dataset.py --filename \u003cfilename\u003e --output-dir \u003cpath-to-output-folder\u003e\n```\n\nSee [paper_experiments/download_hf_dataset.py](paper_experiments/download_hf_dataset.py) for all available options (`--repo-id`, `--repo-type`, etc.).\n\n**3. Run error diagnosis on the downloaded results:**\n\nOnce you have downloaded the results, you can launch the Error Diagnosis UI to explore the pre-computed error analysis. Point the Streamlit viewer at the downloaded output directory:\n\n```bash\npython -m streamlit run paper_experiments/view_results.py -- --output-dir \u003cpath-to-downloaded-results\u003e\n```\n\nExample command:\n\n```bash\npython -m streamlit run paper_experiments/view_results.py -- --output-dir dataset/tau2bench/airline/gpt_4_1/expert\n```\n\nThis loads the `error_analysis.pkl` file from the downloaded results and starts a Streamlit app at `http://localhost:8501` where you can interactively browse error categories and per-sample diagnostics.\n\nIf you want to re-run the error analysis programmatically on the downloaded data, you can do so in Python:\n\n```python\nfrom agent_inspect.tools import ErrorAnalysis\nfrom agent_inspect.models.tools import ErrorAnalysisDataSample\n\n# Load the downloaded data samples\nerror_analyzer = ErrorAnalysis(llm_client=client, max_workers=3)\nerror_analysis_result = error_analyzer.analyze_batch(error_analysis_data_samples)\n```\n\n## Bring Your Own Agent\n\nYou can plug in your own agentic system as long as it exposes a suitable interface and you can convert its interaction traces into the data structures expected by the metrics.\n\nTypical steps:\n\n1. **Define an adapter** that maps your agent's conversation / tool-calling traces into `AgentDialogueTrace`. Extend the `BaseAdapter` class in `agent_inspect.metrics.adapters` to implement this mapping.\n2. **Define a session** that will orchestrate the connection of your agent to the evaluation framework. Extend the `BaseSession` class in `paper_experiments/session.py`.\n3. **Create your dataset** of evaluation samples with the required subgoals.\n\nThe code in [paper_experiments](paper_experiments/readme.md) and [agent_runners](agent_runners/README.md) provides concrete examples you can follow when integrating a new agent.\n\n### Creating your own evaluation dataset\n\nAfter connecting your agent to our evaluation framework, you will need to define your `EvaluationSample`, which contains the subgoals and user proxy instruction.\n\nWe provide a helper in [paper_experiments/convert_to_data_sample.py](paper_experiments/convert_to_data_sample.py) to convert your JSON dataset into the `EvaluationSample` format expected by our framework. This helper assumes your dataset follows the schema below (array of samples):\n\n```json\n[\n  {\n    \"id\": \"\u003cstring\u003e\",\n    \"input\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"\u003ctask description and instructions\u003e\",\n        \"terminating_condition\": \"\u003cnatural-language condition describing when the task is considered complete\u003e\"\n      }\n    ],\n    \"metadata\": {\n      \"subgoals\": [\n        {\n          \"type\": \"\u003cstring\u003e\",          \n          \"details\": \"\u003cnatural-language subgoal describing expected agent behavior\u003e\",\n          \"turn\": \"\u003cturn index or 'all'\u003e\"\n        },\n        ...\n      ],\n      \"expected_tools\": [\n        \"[{'tool_code': '\u003ctool_name\u003e(param1=value1, ...)', 'output': '$AnyValue'}]\",\n        \"... additional tool specifications ...\"\n      ],\n      \"trace_type\": \"\u003cstring\u003e\"         \n    },\n    \"target\": \"\u003coptional expected response or list of responses\u003e\",\n    \"domain\": \"\u003cstring\u003e\"               \n  }\n]\n```\n\nConcretely, each element in the top-level array represents one evaluation sample. The `subgoals` array defines the `SubGoal` objects used during evaluation, `input[0].content` is used as the `user_instruction`, and `metadata.expected_tools` (if present) encodes expected tool calls that are mapped into `ExpectedToolCall` and `ToolInputParameter` objects.\n\nNote: `expected_tools` is optional for now, in the future we plan to support tool call related metrics.\n\n## Known Issues\n\u003c!-- You may simply state \"No known issues. --\u003e\nNo known issues.\n\n## How to obtain support\n[Create an issue](https://github.com/SAP/agent-quality-inspect/issues) in this repository if you find a bug or have questions about the content.\n \nFor additional support, [ask a question in SAP Community](https://answers.sap.com/questions/ask.html).\n\n\n## Contributing\nPlease refer to [CONTRIBUTING.md](CONTRIBUTING.md) for more information.\n\n## Citation\n\nIf you use this repository or the TED evaluation methodology in your research, please consider citing us:\n\n```bibtex\n@inproceedings{\n  chong2026talk,\n  title={Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis},\n  author={Penny Chong and Harshavardhan Abichandani and Jiyuan Shen and Atin Ghosh and Min Pyae Moe and Yifan Mai and Daniel Dahlmeier},\n  booktitle={The Fourteenth International Conference on Learning Representations},\n  year={2026},\n  url={https://openreview.net/forum?id=fHsVNklKOc}\n}\n```\n\n## License\nCopyright (c) 2026 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the [LICENSE](LICENSE) file. \n\nDisclaimer: This repository uses third‑party APIs that are subject to their own terms, fees, and compliance obligations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsap%2Fagent-quality-inspect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsap%2Fagent-quality-inspect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsap%2Fagent-quality-inspect/lists"}