An open API service indexing awesome lists of open source software.

https://github.com/drc-infinyon/ai-agent-evals

🤖 Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.
https://github.com/drc-infinyon/ai-agent-evals

ai drift-detection evaluation llm machine-learning monitoring openrouter production-monitoring python testing

Last synced: 9 months ago
JSON representation

🤖 Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.

Awesome Lists containing this project

README

          

# AI Agent Evaluation Framework

[![CI/CD Pipeline](https://github.com/drc-infinyon/ai-agent-evals/actions/workflows/ci.yml/badge.svg)](https://github.com/drc-infinyon/ai-agent-evals/actions/workflows/ci.yml)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![OpenRouter](https://img.shields.io/badge/LLM-OpenRouter-green.svg)](https://openrouter.ai)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

A comprehensive vanilla evaluation, testing, and monitoring system for AI agents using OpenRouter as the LLM provider.

## Overview

This framework provides enterprise-grade tools for:

- **Comprehensive Testing**: Deterministic, semantic, behavioral, safety, and performance tests
- **LLM-as-Judge Evaluation**: Use advanced models to evaluate other model outputs
- **Production Monitoring**: Real-time metrics, alerting, and drift detection
- **Drift Detection**: Statistical and semantic drift detection across inputs, outputs, and performance
- **Multi-Model Support**: Works with any model available through OpenRouter

## Quick Start

### 1. Installation

```bash
# Clone or download the framework
cd ai-agent-evals

# Install dependencies
pip install -r requirements.txt

# Set up environment
cp .env.template .env
# Edit .env with your OpenRouter API key
```

### 2. Configuration

Edit your `.env` file:

```bash
# Required
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Optional (defaults provided)
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
SIMILARITY_THRESHOLD=0.8
```

### 3. Run the Demo

```bash
python main_demo.py
```

This will run a comprehensive demonstration of all features.

## Core Components

### OpenRouter Client (`openrouter_client.py`)

OpenAI-compatible client for OpenRouter API with built-in pricing calculation.

```python
from openrouter_client import OpenRouterClient, OpenRouterPricing

client = OpenRouterClient(api_key="your_key")
pricing = OpenRouterPricing()

response = client.chat.completions.create(
model="openai/gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)

cost = pricing.calculate_cost(response.usage.total_tokens, "openai/gpt-3.5-turbo")
```

### Evaluation Framework (`testing_framework.py`)

Core testing engine with multiple evaluation strategies.

```python
from testing_framework import EvaluationFramework, TestCase, TestType

framework = EvaluationFramework("your_api_key")

# Create a test case
test = TestCase(
id="test_sql_basic",
type=TestType.SEMANTIC,
input="Write SQL to select all users",
expected="SELECT * FROM users;",
metadata={"similarity_threshold": 0.8}
)

# Run single test
result = framework.run_test(test, your_llm_function, api_key="your_key")

# Create test suite and run
framework.create_test_suite("my_tests", [test])
results = framework.run_suite("my_tests", your_llm_function, api_key="your_key")
```

## Test Types

### 1. Deterministic Tests
Exact string matching for precise outputs.

```python
TestCase(
id="exact_match",
type=TestType.DETERMINISTIC,
input="What is 2+2?",
expected="4"
)
```

### 2. Semantic Tests
Meaning-based comparison using embeddings.

```python
TestCase(
id="semantic_test",
type=TestType.SEMANTIC,
input="Explain SQL",
expected="SQL is a language for managing databases",
metadata={"similarity_threshold": 0.75}
)
```

### 3. Behavioral Tests
Check if outputs meet behavioral constraints.

```python
TestCase(
id="sql_safety",
type=TestType.BEHAVIORAL,
input="Write a database query",
expected=None,
constraints={
"must_include": ["SELECT", "FROM"],
"must_exclude": ["DELETE", "DROP"],
"format": "sql"
}
)
```

### 4. Safety Tests
Detect harmful content, PII, and security issues.

```python
TestCase(
id="safety_check",
type=TestType.SAFETY,
input="User input with potential issues",
expected=None,
metadata={"check_harmful": True}
)
```

### 5. Performance Tests
Measure latency and cost constraints.

```python
TestCase(
id="performance_test",
type=TestType.PERFORMANCE,
input="Complex query requiring fast response",
expected=None,
constraints={"max_latency_ms": 2000, "max_tokens": 500}
)
```

## Production Monitoring

### Basic Monitoring

```python
from production_monitoring import ProductionMonitor

monitor = ProductionMonitor("production.db")

# Log requests
monitor.log_request(
request_id="req_123",
input_text="User question",
output_text="AI response",
latency_ms=450,
tokens_used=150,
model="openai/gpt-3.5-turbo",
success=True
)

# Get metrics
metrics = monitor.get_metrics(hours=24)
print(f"Success rate: {metrics['success_rate']:.1%}")
print(f"Average latency: {metrics['avg_latency']:.0f}ms")
```

### Drift Detection

```python
from production_monitoring import DriftDetector

detector = DriftDetector()

# Detect input drift
baseline_inputs = ["Normal queries from last week"]
current_inputs = ["Recent queries"]

has_drift, score, details = detector.detect_input_drift(
baseline_inputs,
current_inputs,
method='embedding'
)

if has_drift:
alert = detector.create_drift_alert(
'input', 'query_distribution', 1.0, 1.0 + score, details
)
print(f"⚠️ {alert.severity} drift detected: {alert.action_required}")
```

## LLM-as-Judge

Use advanced models to evaluate other model outputs:

```python
from test_suites import LLMJudgeEvaluator

judge = LLMJudgeEvaluator("your_api_key")

evaluation = judge.evaluate_response(
prompt="Explain machine learning",
response="ML is a subset of AI that learns from data...",
criteria={
"accuracy": "Is the information correct?",
"clarity": "Is it easy to understand?",
"completeness": "Does it fully answer the question?"
}
)

print(f"Overall score: {evaluation['overall_score']}")
```

## Test Suites

Pre-built test suites for common use cases:

```python
from test_suites import AIAgentTestSuites

test_creator = AIAgentTestSuites("your_api_key")

# Create domain-specific test suites
test_creator.create_sql_generation_tests()
test_creator.create_data_quality_tests()
test_creator.create_safety_tests()
test_creator.create_edge_case_tests()

# Run specific suite
results = test_creator.framework.run_suite(
"sql_generation",
your_llm_function,
api_key="your_key"
)
```

Available test suites:
- **SQL Generation**: Test database query generation
- **Data Quality**: Test data validation capabilities
- **Data Analysis**: Test analytical reasoning
- **Pipeline Analysis**: Test code analysis skills
- **Regression**: Catch model performance regressions
- **Edge Cases**: Handle unusual inputs
- **Domain Expertise**: Test specialized knowledge

## Advanced Usage

### Custom Test Functions

```python
def my_llm_function(prompt, **kwargs):
"""Your custom LLM function"""
# Process prompt with your model
response = your_model.generate(prompt)

return {
'response': response.text,
'tokens': response.token_count,
'cost': calculate_cost(response.token_count)
}

# Use with framework
framework.run_test(test_case, my_llm_function, custom_param="value")
```

### Continuous Integration

```python
def ci_test_pipeline():
"""CI/CD pipeline test function"""
framework = EvaluationFramework(os.getenv("API_KEY"))

# Run critical tests
results = framework.run_suite("critical_tests", my_llm_function)
pass_rate = results['passed'].mean()

if pass_rate < 0.9: # 90% threshold
print("❌ DEPLOYMENT BLOCKED")
exit(1)
else:
print("✅ DEPLOYMENT APPROVED")
```

### Custom Drift Detection

```python
class CustomDriftDetector(DriftDetector):
def detect_business_metric_drift(self, baseline_metrics, current_metrics):
"""Custom drift detection for business metrics"""
# Your custom logic
pass

detector = CustomDriftDetector()
```

## Metrics & Reporting

### Available Metrics

- **Request Metrics**: Volume, success rate, latency percentiles
- **Cost Metrics**: Token usage, cost per request, total spend
- **Quality Metrics**: User feedback, test pass rates
- **Drift Metrics**: Input/output distribution changes
- **Error Metrics**: Error types and frequencies

### Exporting Data

```python
# Get test history
history = framework.get_test_history(hours=24)

# Export to CSV
history.to_csv("test_results.csv", index=False)

# Get production metrics
metrics = monitor.get_metrics(hours=24)

# Custom reporting
report = {
"timestamp": datetime.now(),
"test_results": results.to_dict(),
"production_metrics": metrics,
"alerts": monitor.get_recent_alerts(24).to_dict()
}
```

## Configuration

### Environment Variables

```bash
# OpenRouter API
OPENROUTER_API_KEY=your_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

# Models
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
EMBEDDINGS_MODEL=all-MiniLM-L6-v2

# Database
TEST_RESULTS_DB=test_results.db
PRODUCTION_METRICS_DB=production_metrics.db

# Thresholds
ALERT_LATENCY_THRESHOLD_MS=2000
ALERT_ERROR_RATE_THRESHOLD=0.05
SIMILARITY_THRESHOLD=0.8
```

### Programmatic Configuration

```python
# Custom thresholds
monitor.alert_thresholds.update({
'latency_p95': 1500, # 1.5s
'error_rate': 0.02, # 2%
'cost_per_request': 0.05 # $0.05
})

# Custom pricing
pricing = OpenRouterPricing()
pricing.pricing['custom/model'] = 0.001 / 1000
```

## Alerting & Monitoring

### Real-time Alerts

The framework automatically generates alerts for:
- High latency (>2s by default)
- High error rates (>5% by default)
- High costs (>$0.10 per request by default)
- Drift detection across inputs/outputs/performance

### Custom Alerts

```python
def custom_alert_check(monitor):
metrics = monitor.get_metrics(1)

if metrics['avg_user_feedback'] < 3.0: # Below 3/5 stars
monitor._create_alert(
'low_satisfaction',
'WARNING',
f'User satisfaction dropped to {metrics["avg_user_feedback"]:.1f}',
metrics['avg_user_feedback'],
3.0
)
```

## Best Practices

### Test Design

1. **Start with Critical Tests**: Focus on core functionality first
2. **Use Multiple Test Types**: Combine deterministic, semantic, and behavioral tests
3. **Set Appropriate Thresholds**: Tune similarity thresholds based on your use case
4. **Regular Test Updates**: Update test cases as your model evolves

### Production Monitoring

1. **Log Everything**: Capture inputs, outputs, latency, and user feedback
2. **Set Smart Alerts**: Avoid alert fatigue with meaningful thresholds
3. **Monitor Trends**: Look at metrics over time, not just point values
4. **Regular Drift Checks**: Run drift detection daily or weekly

### Performance Optimization

1. **Batch Testing**: Run tests in parallel when possible
2. **Cache Embeddings**: Reuse embeddings for semantic comparisons
3. **Database Indexing**: Ensure proper indexes on timestamp fields
4. **Cleanup Old Data**: Archive old test results and metrics

## Troubleshooting

### Common Issues

**API Key Issues**
```bash
# Check if API key is set
python -c "import os; print('API Key:', os.getenv('OPENROUTER_API_KEY', 'NOT SET'))"

# Test API connection
python openrouter_client.py
```

**Database Issues**
```bash
# Check database permissions
ls -la *.db

# Reset databases
rm *.db
python -c "from testing_framework import EvaluationFramework; EvaluationFramework('test')"
```

**Import Issues**
```bash
# Check dependencies
pip install -r requirements.txt

# Check Python path
python -c "import sys; print(sys.path)"
```

### Debug Mode

```python
import logging
logging.basicConfig(level=logging.DEBUG)

# Detailed test execution
framework.run_test(test_case, my_function, debug=True)
```

## Contributing

We welcome contributions! Please see our [contribution guidelines](CONTRIBUTING.md) for details.

### Quick Start for Contributors
1. **Fork the repository** on GitHub
2. **Clone your fork**: `git clone https://github.com/your-username/ai-agent-evals.git`
3. **Create a feature branch**: `git checkout -b feature/amazing-feature`
4. **Install dependencies**: `pip install -r requirements.txt`
5. **Make your changes** and add tests
6. **Run tests**: `python test_imports.py`
7. **Commit changes**: `git commit -m 'Add amazing feature'`
8. **Push to branch**: `git push origin feature/amazing-feature`
9. **Open a Pull Request** on GitHub

### Development Setup
```bash
# Clone the repository
git clone https://github.com/drc-infinyon/ai-agent-evals.git
cd ai-agent-evals

# Install dependencies
pip install -r requirements.txt

# Run validation
python test_imports.py

# Run demo (requires OpenRouter API key)
python main_demo.py
```

### Code Style
- Follow PEP 8 guidelines
- Use type hints where appropriate
- Add docstrings for all public functions
- Run `black` for code formatting
- Add tests for new functionality

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Support

For questions and issues:

1. Check the troubleshooting section above
2. Review the demo script (`main_demo.py`) for examples
3. Check that your OpenRouter API key has sufficient credits
4. Ensure all dependencies are properly installed

## Roadmap

- [ ] Web dashboard for monitoring
- [ ] Integration with MLflow/Weights & Biases
- [ ] A/B testing framework
- [ ] Multi-model comparison tools
- [ ] Advanced anomaly detection
- [ ] Custom evaluation metrics
- [ ] Automated model retraining triggers

---

**Built for reliable AI systems in production**