https://github.com/aaronlab/agent-bench-lite

Lightweight AI Agent evaluation benchmark toolkit — 6 dimensions, 30 tasks, plugin architecture, async execution, beautiful CLI
https://github.com/aaronlab/agent-bench-lite

Last synced: about 2 months ago
JSON representation

Lightweight AI Agent evaluation benchmark toolkit — 6 dimensions, 30 tasks, plugin architecture, async execution, beautiful CLI

Host: GitHub
URL: https://github.com/aaronlab/agent-bench-lite
Owner: aaronlab
License: mit
Created: 2026-04-01T11:19:44.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-04-01T11:19:47.000Z (4 months ago)
Last Synced: 2026-05-09T15:53:22.459Z (2 months ago)
Language: Python
Size: 40 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # agent-bench-lite

A lightweight AI Agent evaluation benchmark toolkit.

## Overview

**agent-bench-lite** provides a modular, extensible framework for evaluating AI agents across six core dimensions:

| Dimension | What it measures |

|---|---|

| **Tool Calling Accuracy** | Can the agent call the right tools with correct parameters? |

| **Planning & Decomposition** | Can the agent break complex tasks into logical steps? |

| **Context Retention** | Can the agent remember and use earlier context? |

| **Error Recovery** | Can the agent handle and recover from errors gracefully? |

| **Instruction Following** | Does the agent follow exact specifications? |

| **Multi-step Reasoning** | Can the agent chain logical steps to reach a conclusion? |

## Installation

```bash

# Base install (no LLM adapters)

pip install -e .

# With Anthropic adapter

pip install -e ".[anthropic]"

# With OpenAI adapter

pip install -e ".[openai]"

# Everything

pip install -e ".[all]"

# Development

pip install -e ".[dev]"

```

## Quick Start

```python

import asyncio

from agent_bench_lite import BenchmarkRunner, EchoAdapter

async def main():

    adapter = EchoAdapter()

    runner = BenchmarkRunner(adapter=adapter)

    report = await runner.run()

    report.print_summary()

    report.save_json("results.json")

asyncio.run(main())

```

Or use the example script:

```bash

python examples/run_benchmark.py

```

## Architecture

- **Adapters** wrap LLM APIs into a common interface (`BaseAdapter`)

- **Dimensions** define evaluation tasks and scoring logic (`BaseDimension`)

- **Runner** orchestrates task execution across dimensions

- **Evaluator** computes scores from raw results

- **Reporter** formats and exports results

### Adding a new dimension

```python

from agent_bench_lite.dimensions.base import BaseDimension, TaskResult

class MyDimension(BaseDimension):

    name = "my_dimension"

    description = "Evaluates something new"

    def get_tasks(self):

        return [...]

    async def evaluate_task(self, task, agent_response):

        return TaskResult(...)

```

### Adding a new adapter

```python

from agent_bench_lite.adapters.base import BaseAdapter

class MyAdapter(BaseAdapter):

    async def send_message(self, messages, tools=None):

        ...

    async def send_message_with_tools(self, messages, tools, tool_handler):

        ...

```

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aaronlab/agent-bench-lite

Awesome Lists containing this project

README