https://github.com/drc-infinyon/ai-agent-evals

🤖 Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.
https://github.com/drc-infinyon/ai-agent-evals
ai drift-detection evaluation llm machine-learning monitoring openrouter production-monitoring python testing
Last synced: 9 months ago
JSON representation
🤖 Comprehensive AI Agent Evaluation Framework: Vanilla testing, monitoring & drift detection using OpenRouter. Features 5 test types, LLM-as-Judge, production monitoring, and 100+ model support.
Host: GitHub
URL: https://github.com/drc-infinyon/ai-agent-evals
Owner: drc-infinyon
License: mit
Created: 2025-09-11T18:39:52.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-09-11T20:25:56.000Z (10 months ago)
Last Synced: 2025-09-11T22:59:49.260Z (10 months ago)
Topics: ai, drift-detection, evaluation, llm, machine-learning, monitoring, openrouter, production-monitoring, python, testing
Language: Python
Homepage: https://github.com/drc-infinyon/ai-agent-evals
Size: 135 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # AI Agent Evaluation Framework

[![CI/CD Pipeline](https://github.com/drc-infinyon/ai-agent-evals/actions/workflows/ci.yml/badge.svg)](https://github.com/drc-infinyon/ai-agent-evals/actions/workflows/ci.yml)

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![OpenRouter](https://img.shields.io/badge/LLM-OpenRouter-green.svg)](https://openrouter.ai)

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

A comprehensive vanilla evaluation, testing, and monitoring system for AI agents using OpenRouter as the LLM provider.

## Overview

This framework provides enterprise-grade tools for:

- **Comprehensive Testing**: Deterministic, semantic, behavioral, safety, and performance tests

- **LLM-as-Judge Evaluation**: Use advanced models to evaluate other model outputs

- **Production Monitoring**: Real-time metrics, alerting, and drift detection  

- **Drift Detection**: Statistical and semantic drift detection across inputs, outputs, and performance

- **Multi-Model Support**: Works with any model available through OpenRouter

## Quick Start

### 1. Installation

```bash

# Clone or download the framework

cd ai-agent-evals

# Install dependencies

pip install -r requirements.txt

# Set up environment

cp .env.template .env

# Edit .env with your OpenRouter API key

```

### 2. Configuration

Edit your `.env` file:

```bash

# Required

OPENROUTER_API_KEY=your_openrouter_api_key_here

# Optional (defaults provided)

DEFAULT_MODEL=openai/gpt-3.5-turbo

JUDGE_MODEL=openai/gpt-4

SIMILARITY_THRESHOLD=0.8

```

### 3. Run the Demo

```bash

python main_demo.py

```

This will run a comprehensive demonstration of all features.

## Core Components

### OpenRouter Client (`openrouter_client.py`)

OpenAI-compatible client for OpenRouter API with built-in pricing calculation.

```python

from openrouter_client import OpenRouterClient, OpenRouterPricing

client = OpenRouterClient(api_key="your_key")

pricing = OpenRouterPricing()

response = client.chat.completions.create(

    model="openai/gpt-3.5-turbo",

    messages=[{"role": "user", "content": "Hello!"}]

)

cost = pricing.calculate_cost(response.usage.total_tokens, "openai/gpt-3.5-turbo")

```

### Evaluation Framework (`testing_framework.py`)

Core testing engine with multiple evaluation strategies.

```python

from testing_framework import EvaluationFramework, TestCase, TestType

framework = EvaluationFramework("your_api_key")

# Create a test case

test = TestCase(

    id="test_sql_basic",

    type=TestType.SEMANTIC,

    input="Write SQL to select all users",

    expected="SELECT * FROM users;",

    metadata={"similarity_threshold": 0.8}

)

# Run single test

result = framework.run_test(test, your_llm_function, api_key="your_key")

# Create test suite and run

framework.create_test_suite("my_tests", [test])

results = framework.run_suite("my_tests", your_llm_function, api_key="your_key")

```

## Test Types

### 1. Deterministic Tests

Exact string matching for precise outputs.

```python

TestCase(

    id="exact_match",

    type=TestType.DETERMINISTIC,

    input="What is 2+2?",

    expected="4"

)

```

### 2. Semantic Tests  

Meaning-based comparison using embeddings.

```python

TestCase(

    id="semantic_test",

    type=TestType.SEMANTIC,

    input="Explain SQL",

    expected="SQL is a language for managing databases",

    metadata={"similarity_threshold": 0.75}

)

```

### 3. Behavioral Tests

Check if outputs meet behavioral constraints.

```python

TestCase(

    id="sql_safety",

    type=TestType.BEHAVIORAL,

    input="Write a database query",

    expected=None,

    constraints={

        "must_include": ["SELECT", "FROM"],

        "must_exclude": ["DELETE", "DROP"],

        "format": "sql"

    }

)

```

### 4. Safety Tests

Detect harmful content, PII, and security issues.

```python

TestCase(

    id="safety_check",

    type=TestType.SAFETY,

    input="User input with potential issues",

    expected=None,

    metadata={"check_harmful": True}

)

```

### 5. Performance Tests

Measure latency and cost constraints.

```python

TestCase(

    id="performance_test",

    type=TestType.PERFORMANCE,

    input="Complex query requiring fast response",

    expected=None,

    constraints={"max_latency_ms": 2000, "max_tokens": 500}

)

```

## Production Monitoring

### Basic Monitoring

```python

from production_monitoring import ProductionMonitor

monitor = ProductionMonitor("production.db")

# Log requests

monitor.log_request(

    request_id="req_123",

    input_text="User question",

    output_text="AI response", 

    latency_ms=450,

    tokens_used=150,

    model="openai/gpt-3.5-turbo",

    success=True

)

# Get metrics

metrics = monitor.get_metrics(hours=24)

print(f"Success rate: {metrics['success_rate']:.1%}")

print(f"Average latency: {metrics['avg_latency']:.0f}ms")

```

### Drift Detection

```python

from production_monitoring import DriftDetector

detector = DriftDetector()

# Detect input drift

baseline_inputs = ["Normal queries from last week"]

current_inputs = ["Recent queries"]

has_drift, score, details = detector.detect_input_drift(

    baseline_inputs, 

    current_inputs,

    method='embedding'

)

if has_drift:

    alert = detector.create_drift_alert(

        'input', 'query_distribution', 1.0, 1.0 + score, details

    )

    print(f"⚠️ {alert.severity} drift detected: {alert.action_required}")

```

## LLM-as-Judge

Use advanced models to evaluate other model outputs:

```python

from test_suites import LLMJudgeEvaluator

judge = LLMJudgeEvaluator("your_api_key")

evaluation = judge.evaluate_response(

    prompt="Explain machine learning",

    response="ML is a subset of AI that learns from data...",

    criteria={

        "accuracy": "Is the information correct?",

        "clarity": "Is it easy to understand?",

        "completeness": "Does it fully answer the question?"

    }

)

print(f"Overall score: {evaluation['overall_score']}")

```

## Test Suites

Pre-built test suites for common use cases:

```python

from test_suites import AIAgentTestSuites

test_creator = AIAgentTestSuites("your_api_key")

# Create domain-specific test suites

test_creator.create_sql_generation_tests()

test_creator.create_data_quality_tests() 

test_creator.create_safety_tests()

test_creator.create_edge_case_tests()

# Run specific suite

results = test_creator.framework.run_suite(

    "sql_generation",

    your_llm_function,

    api_key="your_key"

)

```

Available test suites:

- **SQL Generation**: Test database query generation

- **Data Quality**: Test data validation capabilities  

- **Data Analysis**: Test analytical reasoning

- **Pipeline Analysis**: Test code analysis skills

- **Regression**: Catch model performance regressions

- **Edge Cases**: Handle unusual inputs

- **Domain Expertise**: Test specialized knowledge

## Advanced Usage

### Custom Test Functions

```python

def my_llm_function(prompt, **kwargs):

    """Your custom LLM function"""

    # Process prompt with your model

    response = your_model.generate(prompt)

    

    return {

        'response': response.text,

        'tokens': response.token_count,

        'cost': calculate_cost(response.token_count)

    }

# Use with framework

framework.run_test(test_case, my_llm_function, custom_param="value")

```

### Continuous Integration

```python

def ci_test_pipeline():

    """CI/CD pipeline test function"""

    framework = EvaluationFramework(os.getenv("API_KEY"))

    

    # Run critical tests

    results = framework.run_suite("critical_tests", my_llm_function)

    pass_rate = results['passed'].mean()

    

    if pass_rate < 0.9:  # 90% threshold

        print("❌ DEPLOYMENT BLOCKED")

        exit(1)

    else:

        print("✅ DEPLOYMENT APPROVED")

```

### Custom Drift Detection

```python

class CustomDriftDetector(DriftDetector):

    def detect_business_metric_drift(self, baseline_metrics, current_metrics):

        """Custom drift detection for business metrics"""

        # Your custom logic

        pass

detector = CustomDriftDetector()

```

## Metrics & Reporting

### Available Metrics

- **Request Metrics**: Volume, success rate, latency percentiles

- **Cost Metrics**: Token usage, cost per request, total spend  

- **Quality Metrics**: User feedback, test pass rates

- **Drift Metrics**: Input/output distribution changes

- **Error Metrics**: Error types and frequencies

### Exporting Data

```python

# Get test history

history = framework.get_test_history(hours=24)

# Export to CSV

history.to_csv("test_results.csv", index=False)

# Get production metrics  

metrics = monitor.get_metrics(hours=24)

# Custom reporting

report = {

    "timestamp": datetime.now(),

    "test_results": results.to_dict(),

    "production_metrics": metrics,

    "alerts": monitor.get_recent_alerts(24).to_dict()

}

```

## Configuration

### Environment Variables

```bash

# OpenRouter API

OPENROUTER_API_KEY=your_key_here

OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

# Models  

DEFAULT_MODEL=openai/gpt-3.5-turbo

JUDGE_MODEL=openai/gpt-4

EMBEDDINGS_MODEL=all-MiniLM-L6-v2

# Database

TEST_RESULTS_DB=test_results.db

PRODUCTION_METRICS_DB=production_metrics.db

# Thresholds

ALERT_LATENCY_THRESHOLD_MS=2000

ALERT_ERROR_RATE_THRESHOLD=0.05

SIMILARITY_THRESHOLD=0.8

```

### Programmatic Configuration

```python

# Custom thresholds

monitor.alert_thresholds.update({

    'latency_p95': 1500,  # 1.5s

    'error_rate': 0.02,   # 2%

    'cost_per_request': 0.05  # $0.05

})

# Custom pricing

pricing = OpenRouterPricing()

pricing.pricing['custom/model'] = 0.001 / 1000

```

## Alerting & Monitoring

### Real-time Alerts

The framework automatically generates alerts for:

- High latency (>2s by default)

- High error rates (>5% by default) 

- High costs (>$0.10 per request by default)

- Drift detection across inputs/outputs/performance

### Custom Alerts

```python

def custom_alert_check(monitor):

    metrics = monitor.get_metrics(1)

    

    if metrics['avg_user_feedback'] < 3.0:  # Below 3/5 stars

        monitor._create_alert(

            'low_satisfaction',

            'WARNING', 

            f'User satisfaction dropped to {metrics["avg_user_feedback"]:.1f}',

            metrics['avg_user_feedback'],

            3.0

        )

```

## Best Practices

### Test Design

1. **Start with Critical Tests**: Focus on core functionality first

2. **Use Multiple Test Types**: Combine deterministic, semantic, and behavioral tests

3. **Set Appropriate Thresholds**: Tune similarity thresholds based on your use case

4. **Regular Test Updates**: Update test cases as your model evolves

### Production Monitoring

1. **Log Everything**: Capture inputs, outputs, latency, and user feedback

2. **Set Smart Alerts**: Avoid alert fatigue with meaningful thresholds

3. **Monitor Trends**: Look at metrics over time, not just point values

4. **Regular Drift Checks**: Run drift detection daily or weekly

### Performance Optimization

1. **Batch Testing**: Run tests in parallel when possible

2. **Cache Embeddings**: Reuse embeddings for semantic comparisons

3. **Database Indexing**: Ensure proper indexes on timestamp fields

4. **Cleanup Old Data**: Archive old test results and metrics

## Troubleshooting

### Common Issues

**API Key Issues**

```bash

# Check if API key is set

python -c "import os; print('API Key:', os.getenv('OPENROUTER_API_KEY', 'NOT SET'))"

# Test API connection

python openrouter_client.py

```

**Database Issues**

```bash

# Check database permissions

ls -la *.db

# Reset databases

rm *.db

python -c "from testing_framework import EvaluationFramework; EvaluationFramework('test')"

```

**Import Issues**

```bash

# Check dependencies

pip install -r requirements.txt

# Check Python path

python -c "import sys; print(sys.path)"

```

### Debug Mode

```python

import logging

logging.basicConfig(level=logging.DEBUG)

# Detailed test execution

framework.run_test(test_case, my_function, debug=True)

```

## Contributing

We welcome contributions! Please see our [contribution guidelines](CONTRIBUTING.md) for details.

### Quick Start for Contributors

1. **Fork the repository** on GitHub

2. **Clone your fork**: `git clone https://github.com/your-username/ai-agent-evals.git`

3. **Create a feature branch**: `git checkout -b feature/amazing-feature`

4. **Install dependencies**: `pip install -r requirements.txt`

5. **Make your changes** and add tests

6. **Run tests**: `python test_imports.py`

7. **Commit changes**: `git commit -m 'Add amazing feature'`

8. **Push to branch**: `git push origin feature/amazing-feature`

9. **Open a Pull Request** on GitHub

### Development Setup

```bash

# Clone the repository

git clone https://github.com/drc-infinyon/ai-agent-evals.git

cd ai-agent-evals

# Install dependencies

pip install -r requirements.txt

# Run validation

python test_imports.py

# Run demo (requires OpenRouter API key)

python main_demo.py

```

### Code Style

- Follow PEP 8 guidelines

- Use type hints where appropriate

- Add docstrings for all public functions

- Run `black` for code formatting

- Add tests for new functionality

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Support

For questions and issues:

1. Check the troubleshooting section above

2. Review the demo script (`main_demo.py`) for examples

3. Check that your OpenRouter API key has sufficient credits

4. Ensure all dependencies are properly installed

## Roadmap

- [ ] Web dashboard for monitoring

- [ ] Integration with MLflow/Weights & Biases

- [ ] A/B testing framework

- [ ] Multi-model comparison tools

- [ ] Advanced anomaly detection

- [ ] Custom evaluation metrics

- [ ] Automated model retraining triggers

---

**Built for reliable AI systems in production**
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/drc-infinyon/ai-agent-evals

Awesome Lists containing this project

README