{"id":38701302,"url":"https://github.com/strands-agents/evals","last_synced_at":"2026-01-29T03:31:16.462Z","repository":{"id":328031972,"uuid":"1029970068","full_name":"strands-agents/evals","owner":"strands-agents","description":"A comprehensive evaluation framework for AI agents and LLM applications.","archived":false,"fork":false,"pushed_at":"2026-01-21T21:05:38.000Z","size":352,"stargazers_count":61,"open_issues_count":15,"forks_count":14,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-01-22T09:49:33.592Z","etag":null,"topics":["agentic","agentic-ai","ai","evaluation","machine-learning","python","strands-agents"],"latest_commit_sha":null,"homepage":"https://strandsagents.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/strands-agents.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-31T21:53:35.000Z","updated_at":"2026-01-21T20:42:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"662277e6-4ecc-4271-be92-20d0fb7f2843","html_url":"https://github.com/strands-agents/evals","commit_stats":null,"previous_names":["strands-agents/evals"],"tags_count":4,"template":false,"template_full_name":"amazon-archives/__template_Apache-2.0","purl":"pkg:github/strands-agents/evals","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strands-agents%2Fevals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strands-agents%2Fevals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strands-agents%2Fevals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strands-agents%2Fevals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/strands-agents","download_url":"https://codeload.github.com/strands-agents/evals/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strands-agents%2Fevals/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28862125,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-28T22:56:21.783Z","status":"online","status_checked_at":"2026-01-29T02:00:06.714Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic","agentic-ai","ai","evaluation","machine-learning","python","strands-agents"],"created_at":"2026-01-17T10:48:57.719Z","updated_at":"2026-01-29T03:31:16.446Z","avatar_url":"https://github.com/strands-agents.png","language":"Python","funding_links":[],"categories":["Community Projects"],"sub_categories":["For PyPI Packages"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cdiv\u003e\n    \u003ca href=\"https://strandsagents.com\"\u003e\n      \u003cimg src=\"https://strandsagents.com/latest/assets/logo-github.svg\" alt=\"Strands Agents\" width=\"55px\" height=\"105px\"\u003e\n    \u003c/a\u003e\n  \u003c/div\u003e\n\n  \u003ch1\u003e\n    Strands Evals SDK\n  \u003c/h1\u003e\n  \u003ch2\u003e\n    A comprehensive evaluation framework for AI agents and LLM applications.\n  \u003c/h2\u003e\n\n  \u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://github.com/strands-agents/evals/graphs/commit-activity\"\u003e\u003cimg alt=\"GitHub commit activity\" src=\"https://img.shields.io/github/commit-activity/m/strands-agents/evals\"/\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/strands-agents/evals/issues\"\u003e\u003cimg alt=\"GitHub open issues\" src=\"https://img.shields.io/github/issues/strands-agents/evals\"/\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/strands-agents/evals/pulls\"\u003e\u003cimg alt=\"GitHub open pull requests\" src=\"https://img.shields.io/github/issues-pr/strands-agents/evals\"/\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/strands-agents/evals/blob/main/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/strands-agents/evals\"/\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/strands-agents-evals/\"\u003e\u003cimg alt=\"PyPI version\" src=\"https://img.shields.io/pypi/v/strands-agents-evals\"/\u003e\u003c/a\u003e\n    \u003ca href=\"https://python.org\"\u003e\u003cimg alt=\"Python versions\" src=\"https://img.shields.io/pypi/pyversions/strands-agents-evals\"/\u003e\u003c/a\u003e\n  \u003c/div\u003e\n  \n  \u003cp\u003e\n    \u003ca href=\"https://strandsagents.com/\"\u003eDocumentation\u003c/a\u003e\n    ◆ \u003ca href=\"https://github.com/strands-agents/samples\"\u003eSamples\u003c/a\u003e\n    ◆ \u003ca href=\"https://github.com/strands-agents/sdk-python\"\u003ePython SDK\u003c/a\u003e\n    ◆ \u003ca href=\"https://github.com/strands-agents/sdk-typescript\"\u003eTypescript SDK\u003c/a\u003e\n    ◆ \u003ca href=\"https://github.com/strands-agents/tools\"\u003eTools\u003c/a\u003e\n    ◆ \u003ca href=\"https://github.com/strands-agents/evals\"\u003eEvaluations\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/div\u003e\n\nStrands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems.\n\n## Feature Overview\n\n- **Multiple Evaluation Types**: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation\n- **Dynamic Simulators**: Multi-turn conversation simulation with realistic user behavior and goal-oriented interactions\n- **LLM-as-a-Judge**: Built-in evaluators using language models for sophisticated assessment with structured scoring\n- **Trace-based Evaluation**: Analyze agent behavior through OpenTelemetry execution traces\n- **Automated Experiment Generation**: Generate comprehensive test suites from context descriptions\n- **Custom Evaluators**: Extensible framework for domain-specific evaluation logic\n- **Experiment Management**: Save, load, and version your evaluation experiments with JSON serialization\n- **Built-in Scoring Tools**: Helper functions for exact, in-order, and any-order trajectory matching\n\n## Quick Start\n\n```bash\n# Install Strands Evals SDK\npip install strands-agents-evals\n```\n\n```python\nfrom strands import Agent\nfrom strands_evals import Case, Experiment\nfrom strands_evals.evaluators import OutputEvaluator\n\n# Create test cases\ntest_cases = [\n    Case[str, str](\n        name=\"knowledge-1\",\n        input=\"What is the capital of France?\",\n        expected_output=\"The capital of France is Paris.\",\n        metadata={\"category\": \"knowledge\"}\n    )\n]\n\n# Create evaluators with custom rubric\nevaluators = [\n    OutputEvaluator(\n        rubric=\"\"\"\n        Evaluate based on:\n        1. Accuracy - Is the information correct?\n        2. Completeness - Does it fully answer the question?\n        3. Clarity - Is it easy to understand?\n        \n        Score 1.0 if all criteria are met excellently.\n        Score 0.5 if some criteria are partially met.\n        Score 0.0 if the response is inadequate.\n        \"\"\"\n    )\n]\n\n# Create experiment and run evaluation\nexperiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)\n\ndef get_response(case: Case) -\u003e str:\n    agent = Agent(callback_handler=None)\n    return str(agent(case.input))\n\n# Run evaluations\nreports = experiment.run_evaluations(get_response)\nreports[0].run_display()\n```\n\n## Installation\n\nEnsure you have Python 3.10+ installed, then:\n\n```bash\n# Create and activate virtual environment\npython -m venv .venv\nsource .venv/bin/activate  # On Windows use: .venv\\Scripts\\activate\n\n# Install in development mode\npip install -e .\n\n# Install with test dependencies\npip install -e \".[test]\"\n\n# Install with both test and dev dependencies\npip install -e \".[test,dev]\"\n```\n\n## Features at a Glance\n\n### Output Evaluation with Custom Rubrics\n\nEvaluate agent responses using LLM-as-a-judge with flexible scoring criteria:\n\n```python\nfrom strands_evals.evaluators import OutputEvaluator\n\nevaluator = OutputEvaluator(\n    rubric=\"Score 1.0 for accurate, complete responses. Score 0.5 for partial answers. Score 0.0 for incorrect or unhelpful responses.\",\n    include_inputs=True,  # Include context in evaluation\n    model=\"us.anthropic.claude-sonnet-4-20250514-v1:0\"  # Custom judge model\n)\n```\n\n### Trajectory Evaluation with Built-in Scoring\n\nAnalyze agent tool usage and action sequences with helper scoring functions:\n\n```python\nfrom strands_evals.evaluators import TrajectoryEvaluator\nfrom strands_evals.extractors import tools_use_extractor\nfrom strands_tools import calculator\n\ndef get_response_with_tools(case: Case) -\u003e dict:\n    agent = Agent(tools=[calculator])\n    response = agent(case.input)\n    \n    # Extract trajectory efficiently to prevent context overflow\n    trajectory = tools_use_extractor.extract_agent_tools_used_from_messages(agent.messages)\n    \n    # Update evaluator with tool descriptions\n    evaluator.update_trajectory_description(\n        tools_use_extractor.extract_tools_description(agent, is_short=True)\n    )\n    \n    return {\"output\": str(response), \"trajectory\": trajectory}\n\n# Evaluator includes built-in scoring tools: exact_match_scorer, in_order_match_scorer, any_order_match_scorer\nevaluator = TrajectoryEvaluator(\n    rubric=\"Score 1.0 if correct tools used in proper sequence. Use scoring tools to verify trajectory matches.\"\n)\n```\n\n### Trace-based Helpfulness Evaluation\n\nEvaluate agent helpfulness using OpenTelemetry traces with seven-level scoring:\n\n```python\nfrom strands_evals.evaluators import HelpfulnessEvaluator\nfrom strands_evals.telemetry import StrandsEvalsTelemetry\nfrom strands_evals.mappers import StrandsInMemorySessionMapper\n\n# Setup telemetry for trace capture\ntelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()\n\ndef user_task_function(case: Case) -\u003e dict:\n    telemetry.memory_exporter.clear()\n    \n    agent = Agent(\n        trace_attributes={\"session.id\": case.session_id},\n        callback_handler=None\n    )\n    response = agent(case.input)\n    \n    # Map spans to session for evaluation\n    spans = telemetry.memory_exporter.get_finished_spans()\n    mapper = StrandsInMemorySessionMapper()\n    session = mapper.map_to_session(spans, session_id=case.session_id)\n    \n    return {\"output\": str(response), \"trajectory\": session}\n\n# Seven-level scoring: Not helpful (0.0) to Above and beyond (1.0)\nevaluators = [HelpfulnessEvaluator()]\nexperiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)\n\n# Run evaluations\nreports = experiment.run_evaluations(user_task_function)\nreports[0].run_display()\n```\n\n### Multi-turn Conversation Simulation\n\nSimulate realistic user interactions with dynamic, goal-oriented conversations using ActorSimulator:\n\n```python\nfrom strands import Agent\nfrom strands_evals import Case, Experiment, ActorSimulator\nfrom strands_evals.evaluators import HelpfulnessEvaluator, GoalSuccessRateEvaluator\nfrom strands_evals.mappers import StrandsInMemorySessionMapper\nfrom strands_evals.telemetry import StrandsEvalsTelemetry\n\n# Setup telemetry\ntelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()\nmemory_exporter = telemetry.in_memory_exporter\n\ndef task_function(case: Case) -\u003e dict:\n    # Create simulator to drive conversation\n    simulator = ActorSimulator.from_case_for_user_simulator(\n        case=case,\n        max_turns=10\n    )\n\n    # Create agent to evaluate\n    agent = Agent(\n        trace_attributes={\n            \"gen_ai.conversation.id\": case.session_id,\n            \"session.id\": case.session_id\n        },\n        callback_handler=None\n    )\n\n    # Run multi-turn conversation\n    all_spans = []\n    user_message = case.input\n\n    while simulator.has_next():\n        memory_exporter.clear()\n        agent_response = agent(user_message)\n        turn_spans = list(memory_exporter.get_finished_spans())\n        all_spans.extend(turn_spans)\n\n        user_result = simulator.act(str(agent_response))\n        user_message = str(user_result.structured_output.message)\n\n    # Map to session for evaluation\n    mapper = StrandsInMemorySessionMapper()\n    session = mapper.map_to_session(all_spans, session_id=case.session_id)\n\n    return {\"output\": str(agent_response), \"trajectory\": session}\n\n# Use evaluators to assess simulated conversations\nevaluators = [\n    HelpfulnessEvaluator(),\n    GoalSuccessRateEvaluator()\n]\n\nexperiment = Experiment(cases=test_cases, evaluators=evaluators)\nreports = experiment.run_evaluations(task_function)\n```\n\n**Key Benefits:**\n- **Dynamic Interactions**: Simulator adapts responses based on agent behavior\n- **Goal-Oriented Testing**: Verify agents can complete user objectives through dialogue\n- **Realistic Conversations**: Generate authentic multi-turn interaction patterns\n- **No Predefined Scripts**: Test agents without hardcoded conversation paths\n- **Comprehensive Evaluation**: Combine with trace-based evaluators for full assessment\n\n### Automated Experiment Generation\n\nGenerate comprehensive test suites automatically from context descriptions:\n\n```python\nfrom strands_evals.generators import ExperimentGenerator\nfrom strands_evals.evaluators import TrajectoryEvaluator\n\n# Define available tools and context\ntool_context = \"\"\"\nAvailable tools:\n- calculator(expression: str) -\u003e float: Evaluate mathematical expressions\n- web_search(query: str) -\u003e str: Search the web for information\n- file_read(path: str) -\u003e str: Read file contents\n\"\"\"\n\n# Generate experiment with multiple test cases\ngenerator = ExperimentGenerator[str, str](str, str)\nexperiment = await generator.from_context_async(\n    context=tool_context,\n    num_cases=10,\n    evaluator=TrajectoryEvaluator,\n    task_description=\"Math and research assistant with tool usage\",\n    num_topics=3  # Distribute cases across multiple topics\n)\n\n# Save generated experiment\nexperiment.to_file(\"generated_experiment\", \"json\")\n```\n\n### Custom Evaluators with Structured Output\n\nCreate domain-specific evaluation logic with standardized output format:\n\n```python\nfrom strands_evals.evaluators import Evaluator\nfrom strands_evals.types import EvaluationData, EvaluationOutput\n\nclass PolicyComplianceEvaluator(Evaluator[str, str]):\n    def evaluate(self, evaluation_case: EvaluationData[str, str]) -\u003e EvaluationOutput:\n        # Custom evaluation logic\n        response = evaluation_case.actual_output\n        \n        # Check for policy violations\n        violations = self._check_policy_violations(response)\n        \n        if not violations:\n            return EvaluationOutput(\n                score=1.0,\n                test_pass=True,\n                reason=\"Response complies with all policies\",\n                label=\"compliant\"\n            )\n        else:\n            return EvaluationOutput(\n                score=0.0,\n                test_pass=False,\n                reason=f\"Policy violations: {', '.join(violations)}\",\n                label=\"non_compliant\"\n            )\n    \n    def _check_policy_violations(self, response: str) -\u003e list[str]:\n        # Implementation details...\n        return []\n```\n\n### Tool Usage and Parameter Evaluation\n\nEvaluate specific aspects of tool usage with specialized evaluators:\n\n```python\nfrom strands_evals.evaluators import ToolSelectionAccuracyEvaluator, ToolParameterAccuracyEvaluator\n\n# Evaluate if correct tools were selected\ntool_selection_evaluator = ToolSelectionAccuracyEvaluator(\n    rubric=\"Score 1.0 if optimal tools selected, 0.5 if suboptimal but functional, 0.0 if wrong tools\"\n)\n\n# Evaluate if tool parameters were correct\ntool_parameter_evaluator = ToolParameterAccuracyEvaluator(\n    rubric=\"Score based on parameter accuracy and appropriateness for the task\"\n)\n```\n\n## Available Evaluators\n\n### Core Evaluators\n- **OutputEvaluator**: Flexible LLM-based evaluation with custom rubrics\n- **TrajectoryEvaluator**: Action sequence evaluation with built-in scoring tools\n- **HelpfulnessEvaluator**: Seven-level helpfulness assessment from user perspective\n- **FaithfulnessEvaluator**: Evaluates if responses are grounded in conversation history\n- **GoalSuccessRateEvaluator**: Measures if user goals were achieved\n\n### Specialized Evaluators\n- **ToolSelectionAccuracyEvaluator**: Evaluates appropriateness of tool choices\n- **ToolParameterAccuracyEvaluator**: Evaluates correctness of tool parameters\n- **InteractionsEvaluator**: Multi-agent interaction and handoff evaluation\n- **Custom Evaluators**: Extensible base class for domain-specific logic\n\n## Experiment Management and Serialization\n\nSave, load, and version experiments for reproducibility:\n\n```python\n# Save experiment with metadata\nexperiment.to_file(\"customer_service_eval\", \"json\")\n\n# Load experiment from file\nloaded_experiment = Experiment.from_file(\"./experiment_files/customer_service_eval.json\", \"json\")\n\n# Experiment files include:\n# - Test cases with metadata\n# - Evaluator configuration\n# - Expected outputs and trajectories\n# - Versioning information\n```\n\n## Evaluation Metrics and Analysis\n\nTrack comprehensive metrics across multiple dimensions:\n\n```python\n# Built-in metrics to consider:\nmetrics = {\n    \"accuracy\": \"Factual correctness of responses\",\n    \"task_completion\": \"Whether agent completed the task\",\n    \"tool_selection\": \"Appropriateness of tool choices\", \n    \"response_time\": \"Agent response latency\",\n    \"hallucination_rate\": \"Frequency of fabricated information\",\n    \"token_usage\": \"Efficiency of token consumption\",\n    \"user_satisfaction\": \"Subjective helpfulness ratings\"\n}\n\n# Generate analysis reports\nreports = experiment.run_evaluations(task_function)\nreports[0].run_display()  # Interactive display with metrics breakdown\n```\n\n## Best Practices\n\n### Evaluation Strategy\n1. **Diversify Test Cases**: Cover knowledge, reasoning, tool usage, conversation, edge cases, and safety scenarios\n2. **Use Statistical Baselines**: Run multiple evaluations to account for LLM non-determinism\n3. **Combine Multiple Evaluators**: Use output, trajectory, and helpfulness evaluators together\n4. **Regular Evaluation Cadence**: Implement consistent evaluation schedules for continuous improvement\n\n### Performance Optimization\n1. **Use Extractors**: Always use `tools_use_extractor` functions to prevent context overflow\n2. **Update Descriptions Dynamically**: Call `update_trajectory_description()` with tool descriptions\n3. **Choose Appropriate Judge Models**: Use stronger models for complex evaluations\n4. **Batch Evaluations**: Process multiple test cases efficiently\n\n### Experiment Design\n1. **Write Clear Rubrics**: Include explicit scoring criteria and examples\n2. **Include Expected Trajectories**: Define exact sequences for trajectory evaluation\n3. **Use Appropriate Matching**: Choose between exact, in-order, or any-order matching\n4. **Version Control**: Track agent configurations alongside evaluation results\n\n## Documentation\n\nFor detailed guidance \u0026 examples, explore our documentation:\n\n- [User Guide](https://strandsagents.com/latest/documentation/docs/user-guide/evals-sdk/quickstart/)\n- [Evaluator Reference](https://strandsagents.com/latest/documentation/docs/user-guide/evals-sdk/evaluators/)\n- [Simulators Guide](https://strandsagents.com/latest/documentation/docs/user-guide/evals-sdk/simulators/)\n\n## Contributing ❤️\n\nWe welcome contributions! See our [Contributing Guide](CONTRIBUTING.md) for details on:\n- Development setup\n- Contributing via Pull Requests\n- Code of Conduct\n- Reporting of security issues\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## Security\n\nSee [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstrands-agents%2Fevals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstrands-agents%2Fevals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstrands-agents%2Fevals/lists"}