https://github.com/strands-agents/evals
A comprehensive evaluation framework for AI agents and LLM applications.
https://github.com/strands-agents/evals
agentic agentic-ai ai evaluation machine-learning python strands-agents
Last synced: 5 months ago
JSON representation
A comprehensive evaluation framework for AI agents and LLM applications.
- Host: GitHub
- URL: https://github.com/strands-agents/evals
- Owner: strands-agents
- License: apache-2.0
- Created: 2025-07-31T21:53:35.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2026-01-21T21:05:38.000Z (5 months ago)
- Last Synced: 2026-01-22T09:49:33.592Z (5 months ago)
- Topics: agentic, agentic-ai, ai, evaluation, machine-learning, python, strands-agents
- Language: Python
- Homepage: https://strandsagents.com
- Size: 344 KB
- Stars: 61
- Watchers: 2
- Forks: 14
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Notice: NOTICE
Awesome Lists containing this project
- awesome-strands-agents - Evals Framework - agents/evals](https://github.com/strands-agents/evals) | Official Resources | (Community Projects / For PyPI Packages)
README
Strands Evals SDK
A comprehensive evaluation framework for AI agents and LLM applications.
Documentation
◆ Samples
◆ Python SDK
◆ Typescript SDK
◆ Tools
◆ Evaluations
Strands Evaluation is a powerful framework for evaluating AI agents and LLM applications. From simple output validation to complex multi-agent interaction analysis, trajectory evaluation, and automated experiment generation, Strands Evaluation provides comprehensive tools to measure and improve your AI systems.
## Feature Overview
- **Multiple Evaluation Types**: Output evaluation, trajectory analysis, tool usage assessment, and interaction evaluation
- **Dynamic Simulators**: Multi-turn conversation simulation with realistic user behavior and goal-oriented interactions
- **LLM-as-a-Judge**: Built-in evaluators using language models for sophisticated assessment with structured scoring
- **Trace-based Evaluation**: Analyze agent behavior through OpenTelemetry execution traces
- **Automated Experiment Generation**: Generate comprehensive test suites from context descriptions
- **Custom Evaluators**: Extensible framework for domain-specific evaluation logic
- **Experiment Management**: Save, load, and version your evaluation experiments with JSON serialization
- **Built-in Scoring Tools**: Helper functions for exact, in-order, and any-order trajectory matching
## Quick Start
```bash
# Install Strands Evals SDK
pip install strands-agents-evals
```
```python
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator
# Create test cases
test_cases = [
Case[str, str](
name="knowledge-1",
input="What is the capital of France?",
expected_output="The capital of France is Paris.",
metadata={"category": "knowledge"}
)
]
# Create evaluators with custom rubric
evaluators = [
OutputEvaluator(
rubric="""
Evaluate based on:
1. Accuracy - Is the information correct?
2. Completeness - Does it fully answer the question?
3. Clarity - Is it easy to understand?
Score 1.0 if all criteria are met excellently.
Score 0.5 if some criteria are partially met.
Score 0.0 if the response is inadequate.
"""
)
]
# Create experiment and run evaluation
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)
def get_response(case: Case) -> str:
agent = Agent(callback_handler=None)
return str(agent(case.input))
# Run evaluations
reports = experiment.run_evaluations(get_response)
reports[0].run_display()
```
## Installation
Ensure you have Python 3.10+ installed, then:
```bash
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows use: .venv\Scripts\activate
# Install in development mode
pip install -e .
# Install with test dependencies
pip install -e ".[test]"
# Install with both test and dev dependencies
pip install -e ".[test,dev]"
```
## Features at a Glance
### Output Evaluation with Custom Rubrics
Evaluate agent responses using LLM-as-a-judge with flexible scoring criteria:
```python
from strands_evals.evaluators import OutputEvaluator
evaluator = OutputEvaluator(
rubric="Score 1.0 for accurate, complete responses. Score 0.5 for partial answers. Score 0.0 for incorrect or unhelpful responses.",
include_inputs=True, # Include context in evaluation
model="us.anthropic.claude-sonnet-4-20250514-v1:0" # Custom judge model
)
```
### Trajectory Evaluation with Built-in Scoring
Analyze agent tool usage and action sequences with helper scoring functions:
```python
from strands_evals.evaluators import TrajectoryEvaluator
from strands_evals.extractors import tools_use_extractor
from strands_tools import calculator
def get_response_with_tools(case: Case) -> dict:
agent = Agent(tools=[calculator])
response = agent(case.input)
# Extract trajectory efficiently to prevent context overflow
trajectory = tools_use_extractor.extract_agent_tools_used_from_messages(agent.messages)
# Update evaluator with tool descriptions
evaluator.update_trajectory_description(
tools_use_extractor.extract_tools_description(agent, is_short=True)
)
return {"output": str(response), "trajectory": trajectory}
# Evaluator includes built-in scoring tools: exact_match_scorer, in_order_match_scorer, any_order_match_scorer
evaluator = TrajectoryEvaluator(
rubric="Score 1.0 if correct tools used in proper sequence. Use scoring tools to verify trajectory matches."
)
```
### Trace-based Helpfulness Evaluation
Evaluate agent helpfulness using OpenTelemetry traces with seven-level scoring:
```python
from strands_evals.evaluators import HelpfulnessEvaluator
from strands_evals.telemetry import StrandsEvalsTelemetry
from strands_evals.mappers import StrandsInMemorySessionMapper
# Setup telemetry for trace capture
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
def user_task_function(case: Case) -> dict:
telemetry.memory_exporter.clear()
agent = Agent(
trace_attributes={"session.id": case.session_id},
callback_handler=None
)
response = agent(case.input)
# Map spans to session for evaluation
spans = telemetry.memory_exporter.get_finished_spans()
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(spans, session_id=case.session_id)
return {"output": str(response), "trajectory": session}
# Seven-level scoring: Not helpful (0.0) to Above and beyond (1.0)
evaluators = [HelpfulnessEvaluator()]
experiment = Experiment[str, str](cases=test_cases, evaluators=evaluators)
# Run evaluations
reports = experiment.run_evaluations(user_task_function)
reports[0].run_display()
```
### Multi-turn Conversation Simulation
Simulate realistic user interactions with dynamic, goal-oriented conversations using ActorSimulator:
```python
from strands import Agent
from strands_evals import Case, Experiment, ActorSimulator
from strands_evals.evaluators import HelpfulnessEvaluator, GoalSuccessRateEvaluator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry
# Setup telemetry
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter
def task_function(case: Case) -> dict:
# Create simulator to drive conversation
simulator = ActorSimulator.from_case_for_user_simulator(
case=case,
max_turns=10
)
# Create agent to evaluate
agent = Agent(
trace_attributes={
"gen_ai.conversation.id": case.session_id,
"session.id": case.session_id
},
callback_handler=None
)
# Run multi-turn conversation
all_spans = []
user_message = case.input
while simulator.has_next():
memory_exporter.clear()
agent_response = agent(user_message)
turn_spans = list(memory_exporter.get_finished_spans())
all_spans.extend(turn_spans)
user_result = simulator.act(str(agent_response))
user_message = str(user_result.structured_output.message)
# Map to session for evaluation
mapper = StrandsInMemorySessionMapper()
session = mapper.map_to_session(all_spans, session_id=case.session_id)
return {"output": str(agent_response), "trajectory": session}
# Use evaluators to assess simulated conversations
evaluators = [
HelpfulnessEvaluator(),
GoalSuccessRateEvaluator()
]
experiment = Experiment(cases=test_cases, evaluators=evaluators)
reports = experiment.run_evaluations(task_function)
```
**Key Benefits:**
- **Dynamic Interactions**: Simulator adapts responses based on agent behavior
- **Goal-Oriented Testing**: Verify agents can complete user objectives through dialogue
- **Realistic Conversations**: Generate authentic multi-turn interaction patterns
- **No Predefined Scripts**: Test agents without hardcoded conversation paths
- **Comprehensive Evaluation**: Combine with trace-based evaluators for full assessment
### Automated Experiment Generation
Generate comprehensive test suites automatically from context descriptions:
```python
from strands_evals.generators import ExperimentGenerator
from strands_evals.evaluators import TrajectoryEvaluator
# Define available tools and context
tool_context = """
Available tools:
- calculator(expression: str) -> float: Evaluate mathematical expressions
- web_search(query: str) -> str: Search the web for information
- file_read(path: str) -> str: Read file contents
"""
# Generate experiment with multiple test cases
generator = ExperimentGenerator[str, str](str, str)
experiment = await generator.from_context_async(
context=tool_context,
num_cases=10,
evaluator=TrajectoryEvaluator,
task_description="Math and research assistant with tool usage",
num_topics=3 # Distribute cases across multiple topics
)
# Save generated experiment
experiment.to_file("generated_experiment", "json")
```
### Custom Evaluators with Structured Output
Create domain-specific evaluation logic with standardized output format:
```python
from strands_evals.evaluators import Evaluator
from strands_evals.types import EvaluationData, EvaluationOutput
class PolicyComplianceEvaluator(Evaluator[str, str]):
def evaluate(self, evaluation_case: EvaluationData[str, str]) -> EvaluationOutput:
# Custom evaluation logic
response = evaluation_case.actual_output
# Check for policy violations
violations = self._check_policy_violations(response)
if not violations:
return EvaluationOutput(
score=1.0,
test_pass=True,
reason="Response complies with all policies",
label="compliant"
)
else:
return EvaluationOutput(
score=0.0,
test_pass=False,
reason=f"Policy violations: {', '.join(violations)}",
label="non_compliant"
)
def _check_policy_violations(self, response: str) -> list[str]:
# Implementation details...
return []
```
### Tool Usage and Parameter Evaluation
Evaluate specific aspects of tool usage with specialized evaluators:
```python
from strands_evals.evaluators import ToolSelectionAccuracyEvaluator, ToolParameterAccuracyEvaluator
# Evaluate if correct tools were selected
tool_selection_evaluator = ToolSelectionAccuracyEvaluator(
rubric="Score 1.0 if optimal tools selected, 0.5 if suboptimal but functional, 0.0 if wrong tools"
)
# Evaluate if tool parameters were correct
tool_parameter_evaluator = ToolParameterAccuracyEvaluator(
rubric="Score based on parameter accuracy and appropriateness for the task"
)
```
## Available Evaluators
### Core Evaluators
- **OutputEvaluator**: Flexible LLM-based evaluation with custom rubrics
- **TrajectoryEvaluator**: Action sequence evaluation with built-in scoring tools
- **HelpfulnessEvaluator**: Seven-level helpfulness assessment from user perspective
- **FaithfulnessEvaluator**: Evaluates if responses are grounded in conversation history
- **GoalSuccessRateEvaluator**: Measures if user goals were achieved
### Specialized Evaluators
- **ToolSelectionAccuracyEvaluator**: Evaluates appropriateness of tool choices
- **ToolParameterAccuracyEvaluator**: Evaluates correctness of tool parameters
- **InteractionsEvaluator**: Multi-agent interaction and handoff evaluation
- **Custom Evaluators**: Extensible base class for domain-specific logic
## Experiment Management and Serialization
Save, load, and version experiments for reproducibility:
```python
# Save experiment with metadata
experiment.to_file("customer_service_eval", "json")
# Load experiment from file
loaded_experiment = Experiment.from_file("./experiment_files/customer_service_eval.json", "json")
# Experiment files include:
# - Test cases with metadata
# - Evaluator configuration
# - Expected outputs and trajectories
# - Versioning information
```
## Evaluation Metrics and Analysis
Track comprehensive metrics across multiple dimensions:
```python
# Built-in metrics to consider:
metrics = {
"accuracy": "Factual correctness of responses",
"task_completion": "Whether agent completed the task",
"tool_selection": "Appropriateness of tool choices",
"response_time": "Agent response latency",
"hallucination_rate": "Frequency of fabricated information",
"token_usage": "Efficiency of token consumption",
"user_satisfaction": "Subjective helpfulness ratings"
}
# Generate analysis reports
reports = experiment.run_evaluations(task_function)
reports[0].run_display() # Interactive display with metrics breakdown
```
## Best Practices
### Evaluation Strategy
1. **Diversify Test Cases**: Cover knowledge, reasoning, tool usage, conversation, edge cases, and safety scenarios
2. **Use Statistical Baselines**: Run multiple evaluations to account for LLM non-determinism
3. **Combine Multiple Evaluators**: Use output, trajectory, and helpfulness evaluators together
4. **Regular Evaluation Cadence**: Implement consistent evaluation schedules for continuous improvement
### Performance Optimization
1. **Use Extractors**: Always use `tools_use_extractor` functions to prevent context overflow
2. **Update Descriptions Dynamically**: Call `update_trajectory_description()` with tool descriptions
3. **Choose Appropriate Judge Models**: Use stronger models for complex evaluations
4. **Batch Evaluations**: Process multiple test cases efficiently
### Experiment Design
1. **Write Clear Rubrics**: Include explicit scoring criteria and examples
2. **Include Expected Trajectories**: Define exact sequences for trajectory evaluation
3. **Use Appropriate Matching**: Choose between exact, in-order, or any-order matching
4. **Version Control**: Track agent configurations alongside evaluation results
## Documentation
For detailed guidance & examples, explore our documentation:
- [User Guide](https://strandsagents.com/latest/documentation/docs/user-guide/evals-sdk/quickstart/)
- [Evaluator Reference](https://strandsagents.com/latest/documentation/docs/user-guide/evals-sdk/evaluators/)
- [Simulators Guide](https://strandsagents.com/latest/documentation/docs/user-guide/evals-sdk/simulators/)
## Contributing ❤️
We welcome contributions! See our [Contributing Guide](CONTRIBUTING.md) for details on:
- Development setup
- Contributing via Pull Requests
- Code of Conduct
- Reporting of security issues
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## Security
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.