https://github.com/acodercat/agent-performance-evaluator
https://github.com/acodercat/agent-performance-evaluator
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/acodercat/agent-performance-evaluator
- Owner: acodercat
- Created: 2025-07-03T02:13:39.000Z (12 months ago)
- Default Branch: master
- Last Pushed: 2025-07-03T03:33:23.000Z (12 months ago)
- Last Synced: 2025-07-03T04:32:54.188Z (12 months ago)
- Language: Python
- Size: 189 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Agent Performance Evaluator
A comprehensive benchmarking framework for evaluating AI agent implementations and their tool-calling capabilities across multiple scenarios.
## Features
- **Multi-Agent Framework Support**: PyCallingAgent, LangChain, LlamaIndex, OpenAI agents
- **Diverse Scenarios**: Weather queries, flight booking, event planning, healthcare data processing, investment analysis, product recommendations
- **Comprehensive Metrics**: Success rates, function call accuracy, argument validation, step efficiency
- **Multi-Model Testing**: GPT-4o, Claude, DeepSeek, Qwen models, and more
## Development
### Adding New Scenarios
1. Create a new scenario file in `scenarios/`
2. Define your functions with proper type hints
3. Add test cases to `ground_truths.json`
4. Update the evaluation pipeline
## Quick Start
### Installation
```bash
# Clone the repository
git clone
cd agent-performance-evaluator
# Install dependencies using uv
uv sync
```
### Using PyCallingAgent Framework
```python
import asyncio
from adapters.py_calling_agent_adapter import PyCallingAgentFactory
from py_calling_agent.models import LiteLLMModel
from evaluation import evaluate
import json
# Load test scenarios
ground_truths = json.load(open("./scenarios/ground_truths.json"))
# Create agent factory
model = LiteLLMModel(
model_id="gpt-4o",
api_key="your-api-key",
base_url="your-base-url",
custom_llm_provider='openai'
)
agent_factory = PyCallingAgentFactory(model)
# Run evaluation
async def main():
await evaluate(agent_factory, ground_truths, "./results/results.json")
asyncio.run(main())
```
### Using Other ToolCalling Agent Frameworks
```python
# LangChain Agent
from adapters.langchain_agent_adapter import LangChainAgentFactory
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
agent_factory = LangChainAgentFactory(llm)
await evaluate(agent_factory, ground_truths, "./results/langchain_results.json")
# LlamaIndex Agent
from adapters.llama_index_agent_adapter import LLamaIndexAgentFactory
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o")
agent_factory = LLamaIndexAgentFactory(llm)
await evaluate(agent_factory, ground_truths, "./results/llama_index_results.json")
```
## Project Structure
```
├── adapters/ # Agent implementation adapters
│ ├── py_calling_agent_adapter.py
│ ├── langchain_agent_adapter.py
│ ├── llama_index_agent_adapter.py
│ └── openai_agent_adapter.py
├── core/ # Core evaluation logic
│ ├── evaluator.py # Main evaluation engine
│ ├── agent.py # Agent interface definitions
│ └── validation.py # Function call validation
├── scenarios/ # Test scenarios and ground truths
│ ├── weather_query.py
│ ├── flight_booking.py
│ ├── event_planning.py
│ └── ground_truths.json
├── results/ # Evaluation results
└── evaluation.py # Main evaluation runner
```
## Evaluation Metrics
- **Success Rate**: Percentage of successful turns
- **Function Call Accuracy**: Missing calls, wrong arguments, type mismatches
- **Step Efficiency**: Number of steps per conversation
- **Argument Validation**: Missing, wrong type, or incorrect values