https://github.com/drc-infinyon/ai-agent-evals
🤖 Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.
https://github.com/drc-infinyon/ai-agent-evals
ai drift-detection evaluation llm machine-learning monitoring openrouter production-monitoring python testing
Last synced: 9 months ago
JSON representation
🤖 Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.
- Host: GitHub
- URL: https://github.com/drc-infinyon/ai-agent-evals
- Owner: drc-infinyon
- License: mit
- Created: 2025-09-11T18:39:52.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-11T20:25:56.000Z (10 months ago)
- Last Synced: 2025-09-11T22:59:49.260Z (10 months ago)
- Topics: ai, drift-detection, evaluation, llm, machine-learning, monitoring, openrouter, production-monitoring, python, testing
- Language: Python
- Homepage: https://github.com/drc-infinyon/ai-agent-evals
- Size: 135 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI Agent Evaluation Framework
[](https://github.com/drc-infinyon/ai-agent-evals/actions/workflows/ci.yml)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://openrouter.ai)
[](https://github.com/psf/black)
A comprehensive vanilla evaluation, testing, and monitoring system for AI agents using OpenRouter as the LLM provider.
## Overview
This framework provides enterprise-grade tools for:
- **Comprehensive Testing**: Deterministic, semantic, behavioral, safety, and performance tests
- **LLM-as-Judge Evaluation**: Use advanced models to evaluate other model outputs
- **Production Monitoring**: Real-time metrics, alerting, and drift detection
- **Drift Detection**: Statistical and semantic drift detection across inputs, outputs, and performance
- **Multi-Model Support**: Works with any model available through OpenRouter
## Quick Start
### 1. Installation
```bash
# Clone or download the framework
cd ai-agent-evals
# Install dependencies
pip install -r requirements.txt
# Set up environment
cp .env.template .env
# Edit .env with your OpenRouter API key
```
### 2. Configuration
Edit your `.env` file:
```bash
# Required
OPENROUTER_API_KEY=your_openrouter_api_key_here
# Optional (defaults provided)
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
SIMILARITY_THRESHOLD=0.8
```
### 3. Run the Demo
```bash
python main_demo.py
```
This will run a comprehensive demonstration of all features.
## Core Components
### OpenRouter Client (`openrouter_client.py`)
OpenAI-compatible client for OpenRouter API with built-in pricing calculation.
```python
from openrouter_client import OpenRouterClient, OpenRouterPricing
client = OpenRouterClient(api_key="your_key")
pricing = OpenRouterPricing()
response = client.chat.completions.create(
model="openai/gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
cost = pricing.calculate_cost(response.usage.total_tokens, "openai/gpt-3.5-turbo")
```
### Evaluation Framework (`testing_framework.py`)
Core testing engine with multiple evaluation strategies.
```python
from testing_framework import EvaluationFramework, TestCase, TestType
framework = EvaluationFramework("your_api_key")
# Create a test case
test = TestCase(
id="test_sql_basic",
type=TestType.SEMANTIC,
input="Write SQL to select all users",
expected="SELECT * FROM users;",
metadata={"similarity_threshold": 0.8}
)
# Run single test
result = framework.run_test(test, your_llm_function, api_key="your_key")
# Create test suite and run
framework.create_test_suite("my_tests", [test])
results = framework.run_suite("my_tests", your_llm_function, api_key="your_key")
```
## Test Types
### 1. Deterministic Tests
Exact string matching for precise outputs.
```python
TestCase(
id="exact_match",
type=TestType.DETERMINISTIC,
input="What is 2+2?",
expected="4"
)
```
### 2. Semantic Tests
Meaning-based comparison using embeddings.
```python
TestCase(
id="semantic_test",
type=TestType.SEMANTIC,
input="Explain SQL",
expected="SQL is a language for managing databases",
metadata={"similarity_threshold": 0.75}
)
```
### 3. Behavioral Tests
Check if outputs meet behavioral constraints.
```python
TestCase(
id="sql_safety",
type=TestType.BEHAVIORAL,
input="Write a database query",
expected=None,
constraints={
"must_include": ["SELECT", "FROM"],
"must_exclude": ["DELETE", "DROP"],
"format": "sql"
}
)
```
### 4. Safety Tests
Detect harmful content, PII, and security issues.
```python
TestCase(
id="safety_check",
type=TestType.SAFETY,
input="User input with potential issues",
expected=None,
metadata={"check_harmful": True}
)
```
### 5. Performance Tests
Measure latency and cost constraints.
```python
TestCase(
id="performance_test",
type=TestType.PERFORMANCE,
input="Complex query requiring fast response",
expected=None,
constraints={"max_latency_ms": 2000, "max_tokens": 500}
)
```
## Production Monitoring
### Basic Monitoring
```python
from production_monitoring import ProductionMonitor
monitor = ProductionMonitor("production.db")
# Log requests
monitor.log_request(
request_id="req_123",
input_text="User question",
output_text="AI response",
latency_ms=450,
tokens_used=150,
model="openai/gpt-3.5-turbo",
success=True
)
# Get metrics
metrics = monitor.get_metrics(hours=24)
print(f"Success rate: {metrics['success_rate']:.1%}")
print(f"Average latency: {metrics['avg_latency']:.0f}ms")
```
### Drift Detection
```python
from production_monitoring import DriftDetector
detector = DriftDetector()
# Detect input drift
baseline_inputs = ["Normal queries from last week"]
current_inputs = ["Recent queries"]
has_drift, score, details = detector.detect_input_drift(
baseline_inputs,
current_inputs,
method='embedding'
)
if has_drift:
alert = detector.create_drift_alert(
'input', 'query_distribution', 1.0, 1.0 + score, details
)
print(f"⚠️ {alert.severity} drift detected: {alert.action_required}")
```
## LLM-as-Judge
Use advanced models to evaluate other model outputs:
```python
from test_suites import LLMJudgeEvaluator
judge = LLMJudgeEvaluator("your_api_key")
evaluation = judge.evaluate_response(
prompt="Explain machine learning",
response="ML is a subset of AI that learns from data...",
criteria={
"accuracy": "Is the information correct?",
"clarity": "Is it easy to understand?",
"completeness": "Does it fully answer the question?"
}
)
print(f"Overall score: {evaluation['overall_score']}")
```
## Test Suites
Pre-built test suites for common use cases:
```python
from test_suites import AIAgentTestSuites
test_creator = AIAgentTestSuites("your_api_key")
# Create domain-specific test suites
test_creator.create_sql_generation_tests()
test_creator.create_data_quality_tests()
test_creator.create_safety_tests()
test_creator.create_edge_case_tests()
# Run specific suite
results = test_creator.framework.run_suite(
"sql_generation",
your_llm_function,
api_key="your_key"
)
```
Available test suites:
- **SQL Generation**: Test database query generation
- **Data Quality**: Test data validation capabilities
- **Data Analysis**: Test analytical reasoning
- **Pipeline Analysis**: Test code analysis skills
- **Regression**: Catch model performance regressions
- **Edge Cases**: Handle unusual inputs
- **Domain Expertise**: Test specialized knowledge
## Advanced Usage
### Custom Test Functions
```python
def my_llm_function(prompt, **kwargs):
"""Your custom LLM function"""
# Process prompt with your model
response = your_model.generate(prompt)
return {
'response': response.text,
'tokens': response.token_count,
'cost': calculate_cost(response.token_count)
}
# Use with framework
framework.run_test(test_case, my_llm_function, custom_param="value")
```
### Continuous Integration
```python
def ci_test_pipeline():
"""CI/CD pipeline test function"""
framework = EvaluationFramework(os.getenv("API_KEY"))
# Run critical tests
results = framework.run_suite("critical_tests", my_llm_function)
pass_rate = results['passed'].mean()
if pass_rate < 0.9: # 90% threshold
print("❌ DEPLOYMENT BLOCKED")
exit(1)
else:
print("✅ DEPLOYMENT APPROVED")
```
### Custom Drift Detection
```python
class CustomDriftDetector(DriftDetector):
def detect_business_metric_drift(self, baseline_metrics, current_metrics):
"""Custom drift detection for business metrics"""
# Your custom logic
pass
detector = CustomDriftDetector()
```
## Metrics & Reporting
### Available Metrics
- **Request Metrics**: Volume, success rate, latency percentiles
- **Cost Metrics**: Token usage, cost per request, total spend
- **Quality Metrics**: User feedback, test pass rates
- **Drift Metrics**: Input/output distribution changes
- **Error Metrics**: Error types and frequencies
### Exporting Data
```python
# Get test history
history = framework.get_test_history(hours=24)
# Export to CSV
history.to_csv("test_results.csv", index=False)
# Get production metrics
metrics = monitor.get_metrics(hours=24)
# Custom reporting
report = {
"timestamp": datetime.now(),
"test_results": results.to_dict(),
"production_metrics": metrics,
"alerts": monitor.get_recent_alerts(24).to_dict()
}
```
## Configuration
### Environment Variables
```bash
# OpenRouter API
OPENROUTER_API_KEY=your_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
# Models
DEFAULT_MODEL=openai/gpt-3.5-turbo
JUDGE_MODEL=openai/gpt-4
EMBEDDINGS_MODEL=all-MiniLM-L6-v2
# Database
TEST_RESULTS_DB=test_results.db
PRODUCTION_METRICS_DB=production_metrics.db
# Thresholds
ALERT_LATENCY_THRESHOLD_MS=2000
ALERT_ERROR_RATE_THRESHOLD=0.05
SIMILARITY_THRESHOLD=0.8
```
### Programmatic Configuration
```python
# Custom thresholds
monitor.alert_thresholds.update({
'latency_p95': 1500, # 1.5s
'error_rate': 0.02, # 2%
'cost_per_request': 0.05 # $0.05
})
# Custom pricing
pricing = OpenRouterPricing()
pricing.pricing['custom/model'] = 0.001 / 1000
```
## Alerting & Monitoring
### Real-time Alerts
The framework automatically generates alerts for:
- High latency (>2s by default)
- High error rates (>5% by default)
- High costs (>$0.10 per request by default)
- Drift detection across inputs/outputs/performance
### Custom Alerts
```python
def custom_alert_check(monitor):
metrics = monitor.get_metrics(1)
if metrics['avg_user_feedback'] < 3.0: # Below 3/5 stars
monitor._create_alert(
'low_satisfaction',
'WARNING',
f'User satisfaction dropped to {metrics["avg_user_feedback"]:.1f}',
metrics['avg_user_feedback'],
3.0
)
```
## Best Practices
### Test Design
1. **Start with Critical Tests**: Focus on core functionality first
2. **Use Multiple Test Types**: Combine deterministic, semantic, and behavioral tests
3. **Set Appropriate Thresholds**: Tune similarity thresholds based on your use case
4. **Regular Test Updates**: Update test cases as your model evolves
### Production Monitoring
1. **Log Everything**: Capture inputs, outputs, latency, and user feedback
2. **Set Smart Alerts**: Avoid alert fatigue with meaningful thresholds
3. **Monitor Trends**: Look at metrics over time, not just point values
4. **Regular Drift Checks**: Run drift detection daily or weekly
### Performance Optimization
1. **Batch Testing**: Run tests in parallel when possible
2. **Cache Embeddings**: Reuse embeddings for semantic comparisons
3. **Database Indexing**: Ensure proper indexes on timestamp fields
4. **Cleanup Old Data**: Archive old test results and metrics
## Troubleshooting
### Common Issues
**API Key Issues**
```bash
# Check if API key is set
python -c "import os; print('API Key:', os.getenv('OPENROUTER_API_KEY', 'NOT SET'))"
# Test API connection
python openrouter_client.py
```
**Database Issues**
```bash
# Check database permissions
ls -la *.db
# Reset databases
rm *.db
python -c "from testing_framework import EvaluationFramework; EvaluationFramework('test')"
```
**Import Issues**
```bash
# Check dependencies
pip install -r requirements.txt
# Check Python path
python -c "import sys; print(sys.path)"
```
### Debug Mode
```python
import logging
logging.basicConfig(level=logging.DEBUG)
# Detailed test execution
framework.run_test(test_case, my_function, debug=True)
```
## Contributing
We welcome contributions! Please see our [contribution guidelines](CONTRIBUTING.md) for details.
### Quick Start for Contributors
1. **Fork the repository** on GitHub
2. **Clone your fork**: `git clone https://github.com/your-username/ai-agent-evals.git`
3. **Create a feature branch**: `git checkout -b feature/amazing-feature`
4. **Install dependencies**: `pip install -r requirements.txt`
5. **Make your changes** and add tests
6. **Run tests**: `python test_imports.py`
7. **Commit changes**: `git commit -m 'Add amazing feature'`
8. **Push to branch**: `git push origin feature/amazing-feature`
9. **Open a Pull Request** on GitHub
### Development Setup
```bash
# Clone the repository
git clone https://github.com/drc-infinyon/ai-agent-evals.git
cd ai-agent-evals
# Install dependencies
pip install -r requirements.txt
# Run validation
python test_imports.py
# Run demo (requires OpenRouter API key)
python main_demo.py
```
### Code Style
- Follow PEP 8 guidelines
- Use type hints where appropriate
- Add docstrings for all public functions
- Run `black` for code formatting
- Add tests for new functionality
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Support
For questions and issues:
1. Check the troubleshooting section above
2. Review the demo script (`main_demo.py`) for examples
3. Check that your OpenRouter API key has sufficient credits
4. Ensure all dependencies are properly installed
## Roadmap
- [ ] Web dashboard for monitoring
- [ ] Integration with MLflow/Weights & Biases
- [ ] A/B testing framework
- [ ] Multi-model comparison tools
- [ ] Advanced anomaly detection
- [ ] Custom evaluation metrics
- [ ] Automated model retraining triggers
---
**Built for reliable AI systems in production**