An open API service indexing awesome lists of open source software.

https://github.com/aaronlab/agent-bench-lite

Lightweight AI Agent evaluation benchmark toolkit — 6 dimensions, 30 tasks, plugin architecture, async execution, beautiful CLI
https://github.com/aaronlab/agent-bench-lite

Last synced: 1 day ago
JSON representation

Lightweight AI Agent evaluation benchmark toolkit — 6 dimensions, 30 tasks, plugin architecture, async execution, beautiful CLI

Awesome Lists containing this project

README

          

# agent-bench-lite

A lightweight AI Agent evaluation benchmark toolkit.

## Overview

**agent-bench-lite** provides a modular, extensible framework for evaluating AI agents across six core dimensions:

| Dimension | What it measures |
|---|---|
| **Tool Calling Accuracy** | Can the agent call the right tools with correct parameters? |
| **Planning & Decomposition** | Can the agent break complex tasks into logical steps? |
| **Context Retention** | Can the agent remember and use earlier context? |
| **Error Recovery** | Can the agent handle and recover from errors gracefully? |
| **Instruction Following** | Does the agent follow exact specifications? |
| **Multi-step Reasoning** | Can the agent chain logical steps to reach a conclusion? |

## Installation

```bash
# Base install (no LLM adapters)
pip install -e .

# With Anthropic adapter
pip install -e ".[anthropic]"

# With OpenAI adapter
pip install -e ".[openai]"

# Everything
pip install -e ".[all]"

# Development
pip install -e ".[dev]"
```

## Quick Start

```python
import asyncio
from agent_bench_lite import BenchmarkRunner, EchoAdapter

async def main():
adapter = EchoAdapter()
runner = BenchmarkRunner(adapter=adapter)
report = await runner.run()
report.print_summary()
report.save_json("results.json")

asyncio.run(main())
```

Or use the example script:

```bash
python examples/run_benchmark.py
```

## Architecture

- **Adapters** wrap LLM APIs into a common interface (`BaseAdapter`)
- **Dimensions** define evaluation tasks and scoring logic (`BaseDimension`)
- **Runner** orchestrates task execution across dimensions
- **Evaluator** computes scores from raw results
- **Reporter** formats and exports results

### Adding a new dimension

```python
from agent_bench_lite.dimensions.base import BaseDimension, TaskResult

class MyDimension(BaseDimension):
name = "my_dimension"
description = "Evaluates something new"

def get_tasks(self):
return [...]

async def evaluate_task(self, task, agent_response):
return TaskResult(...)
```

### Adding a new adapter

```python
from agent_bench_lite.adapters.base import BaseAdapter

class MyAdapter(BaseAdapter):
async def send_message(self, messages, tools=None):
...

async def send_message_with_tools(self, messages, tools, tool_handler):
...
```

## License

MIT