https://github.com/aaronlab/agent-bench-lite
Lightweight AI Agent evaluation benchmark toolkit — 6 dimensions, 30 tasks, plugin architecture, async execution, beautiful CLI
https://github.com/aaronlab/agent-bench-lite
Last synced: 1 day ago
JSON representation
Lightweight AI Agent evaluation benchmark toolkit — 6 dimensions, 30 tasks, plugin architecture, async execution, beautiful CLI
- Host: GitHub
- URL: https://github.com/aaronlab/agent-bench-lite
- Owner: aaronlab
- License: mit
- Created: 2026-04-01T11:19:44.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-01T11:19:47.000Z (2 months ago)
- Last Synced: 2026-05-09T15:53:22.459Z (25 days ago)
- Language: Python
- Size: 40 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# agent-bench-lite
A lightweight AI Agent evaluation benchmark toolkit.
## Overview
**agent-bench-lite** provides a modular, extensible framework for evaluating AI agents across six core dimensions:
| Dimension | What it measures |
|---|---|
| **Tool Calling Accuracy** | Can the agent call the right tools with correct parameters? |
| **Planning & Decomposition** | Can the agent break complex tasks into logical steps? |
| **Context Retention** | Can the agent remember and use earlier context? |
| **Error Recovery** | Can the agent handle and recover from errors gracefully? |
| **Instruction Following** | Does the agent follow exact specifications? |
| **Multi-step Reasoning** | Can the agent chain logical steps to reach a conclusion? |
## Installation
```bash
# Base install (no LLM adapters)
pip install -e .
# With Anthropic adapter
pip install -e ".[anthropic]"
# With OpenAI adapter
pip install -e ".[openai]"
# Everything
pip install -e ".[all]"
# Development
pip install -e ".[dev]"
```
## Quick Start
```python
import asyncio
from agent_bench_lite import BenchmarkRunner, EchoAdapter
async def main():
adapter = EchoAdapter()
runner = BenchmarkRunner(adapter=adapter)
report = await runner.run()
report.print_summary()
report.save_json("results.json")
asyncio.run(main())
```
Or use the example script:
```bash
python examples/run_benchmark.py
```
## Architecture
- **Adapters** wrap LLM APIs into a common interface (`BaseAdapter`)
- **Dimensions** define evaluation tasks and scoring logic (`BaseDimension`)
- **Runner** orchestrates task execution across dimensions
- **Evaluator** computes scores from raw results
- **Reporter** formats and exports results
### Adding a new dimension
```python
from agent_bench_lite.dimensions.base import BaseDimension, TaskResult
class MyDimension(BaseDimension):
name = "my_dimension"
description = "Evaluates something new"
def get_tasks(self):
return [...]
async def evaluate_task(self, task, agent_response):
return TaskResult(...)
```
### Adding a new adapter
```python
from agent_bench_lite.adapters.base import BaseAdapter
class MyAdapter(BaseAdapter):
async def send_message(self, messages, tools=None):
...
async def send_message_with_tools(self, messages, tools, tool_handler):
...
```
## License
MIT