https://github.com/acodercat/cave-bench

A benchmarking framework for evaluating CaveAgent tool calling, stateful management, and JSON-based tool calling.
https://github.com/acodercat/cave-bench

Last synced: 4 months ago
JSON representation

A benchmarking framework for evaluating CaveAgent tool calling, stateful management, and JSON-based tool calling.

Host: GitHub
URL: https://github.com/acodercat/cave-bench
Owner: acodercat
License: mit
Created: 2025-11-25T09:11:39.000Z (7 months ago)
Default Branch: master
Last Pushed: 2026-01-21T13:58:18.000Z (5 months ago)
Last Synced: 2026-01-22T01:48:49.195Z (5 months ago)
Language: Python
Homepage:
Size: 15.2 MB
Stars: 3
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          # cave-bench

Benchmarking framework for evaluating [CaveAgent](https://github.com/acodercat/cave-agent) tool calling, stateful management, and JSON-based tool calling.

## Installation

```bash

uv sync

```

## Configuration

Copy `.env.example` to `.env` and add your API keys:

```bash

cp .env.example .env

```

```bash

# .env

DEEPSEEK_API_KEY=your-api-key

DEEPSEEK_MODEL_ID=deepseek-chat

DEEPSEEK_BASE_URL=https://api.deepseek.com/v1

DEEPSEEK_TEMPERATURE=0.3

```

## Quick Start

Run benchmarks using module syntax:

```bash

# Function calling benchmarks

python -m scripts.function_calling           # Run both agent types

python -m scripts.function_calling -a cave   # CaveAgent (Python code execution)

python -m scripts.function_calling -a json   # LiteLLM (JSON function calling)

# Other benchmarks

python -m scripts.data_analysis     # Data analysis benchmarks

python -m scripts.smart_home        # Smart home benchmarks

```

Edit the `BENCHMARKS` list in each script to select which benchmarks to run.

## Benchmark Structure

### JSON Schema

```json

{

  "name": "scenario_name",

  "module": "evals.data_analysis.MyDataset.my_analysis",

  "requirements": "Optional task requirements",

  "conversations": [

    {

      "id": "test_1",

      "turns": [

        {

          "query": "Analyze the dataset...",

          "validator": "validate_q1",

          "expected_variable_reads": ["df"],

          "expected_variable_writes": ["result"]

        }

      ]

    }

  ]

}

```

### Python Module

```python

from typing import List

from cave_agent.python_runtime import Variable, PythonRuntime

from core.validation import ValidatorResult

from core.types import Turn, ToolCall

import pandas as pd

df = pd.read_csv("path/to/dataset.csv")

def validate_q1(

    response: str,

    runtime: PythonRuntime,

    turn: Turn,

    actual_calls: List[ToolCall]

) -> ValidatorResult:

    result = runtime.retrieve("result")

    if result == expected_value:

        return ValidatorResult(True, "Correct!")

    return ValidatorResult(False, f"Expected {expected_value}, got {result}")

tools = []

variables = [Variable("df", df, "Dataset description")]

validators = {"validate_q1": validate_q1}

```

## Metrics

- **Success Rate**: Percentage of successful turns

- **Function Calls**: Missing calls, wrong argument types/values

- **Variables**: Missing reads/writes

- **Steps**: Total steps taken

- **Tokens**: Consumed Tokens

## Contributing

Contributions are welcome! Please feel free to submit a PR.

## License

MIT License - see [LICENSE](LICENSE) for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/acodercat/cave-bench

Awesome Lists containing this project

README