https://github.com/yotambraun/toolscore

Python framework for evaluating LLM tool-calling behavior with comprehensive metrics on accuracy, efficiency, and correctness
https://github.com/yotambraun/toolscore

ai-agents ai-agents-and-tools anthropic function-calling langchain llm llm-evaluation llm-metrics metrics openai testing-framework tool-use

Last synced: 4 months ago
JSON representation

Python framework for evaluating LLM tool-calling behavior with comprehensive metrics on accuracy, efficiency, and correctness

Host: GitHub
URL: https://github.com/yotambraun/toolscore
Owner: yotambraun
License: apache-2.0
Created: 2025-10-10T08:44:58.000Z (8 months ago)
Default Branch: main
Last Pushed: 2026-01-09T13:20:40.000Z (5 months ago)
Last Synced: 2026-01-14T11:36:45.787Z (5 months ago)
Topics: ai-agents, ai-agents-and-tools, anthropic, function-calling, langchain, llm, llm-evaluation, llm-metrics, metrics, openai, testing-framework, tool-use
Language: Python
Homepage: https://pypi.org/project/tool-scorer/
Size: 1.71 MB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

README

          


  



Toolscore




  Lightweight tool-call testing for LLM agents — deterministic, local, zero API cost



[![PyPI version](https://badge.fury.io/py/tool-scorer.svg)](https://badge.fury.io/py/tool-scorer)

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)

[![Downloads](https://static.pepy.tech/badge/tool-scorer)](https://pepy.tech/project/tool-scorer)

[![Python Versions](https://img.shields.io/pypi/pyversions/tool-scorer.svg)](https://pypi.org/project/tool-scorer/)

[![CI](https://github.com/yotambraun/toolscore/workflows/CI/badge.svg)](https://github.com/yotambraun/toolscore/actions)

---

## 3-Line Quick Start

```python

from toolscore import evaluate

result = evaluate(

    expected=[

        {"tool": "get_weather", "args": {"city": "NYC"}},

        {"tool": "send_email", "args": {"to": "user@example.com"}},

    ],

    actual=[

        {"tool": "get_weather", "args": {"city": "New York"}},

        {"tool": "send_email", "args": {"to": "user@example.com"}},

    ],

)

print(result.score)              # 0.85 (composite score)

print(result.selection_accuracy) # 1.0

print(result.argument_f1)       # 0.7

```

No files, no config, no API keys. Just Python objects in, score out.

## What is Toolscore?

Toolscore is the simplest way to **unit-test LLM tool calls**. It compares the tool calls your agent _actually_ made against what you _expected_, and gives you a score.

- **Deterministic** - no LLM calls needed for evaluation (optional LLM judge available)

- **Local-first** - runs entirely on your machine, zero cloud dependencies

- **Zero API cost** - evaluation itself is free, always

- **Works with any provider** - OpenAI, Anthropic, Gemini, LangChain, MCP, or custom formats

## Installation

```bash

pip install tool-scorer

```

## Usage

### Python API (recommended)

```python

from toolscore import evaluate, assert_tools

# Get a detailed result

result = evaluate(

    expected=[{"tool": "search", "args": {"q": "test"}}],

    actual=[{"tool": "search", "args": {"q": "test"}}],

)

assert result.score == 1.0

# Or use the one-liner for pytest

assert_tools(

    expected=[{"tool": "search", "args": {"q": "test"}}],

    actual=[{"tool": "search", "args": {"q": "test"}}],

    min_score=0.9,

)

```

### With OpenAI / Anthropic / Gemini responses

No need to manually extract tool calls from API responses:

```python

from openai import OpenAI

from toolscore import evaluate

from toolscore.integrations import from_openai

client = OpenAI()

response = client.chat.completions.create(

    model="gpt-4o",

    messages=[...],

    tools=[...],

)

actual = from_openai(response)

result = evaluate(expected=[...], actual=actual)

```

Also available: `from_anthropic()` and `from_gemini()`.

### CLI

```bash

# Both `toolscore` and `tool-scorer` work

toolscore eval gold_calls.json trace.json

# Simplified output by default, use --verbose for full detail

toolscore eval gold_calls.json trace.json --verbose

# HTML report

toolscore eval gold_calls.json trace.json --html report.html

# Compare multiple models

toolscore compare gold.json gpt4.json claude.json -n gpt-4 -n claude-3

```

### Pytest Integration

```python

# test_my_agent.py

def test_agent_accuracy(toolscore_eval, toolscore_assertions):

    """Test that agent achieves high accuracy."""

    result = toolscore_eval("gold_calls.json", "trace.json")

    toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)

    toolscore_assertions.assert_argument_f1(result, min_f1=0.8)

```

Or use the simpler `assert_tools`:

```python

from toolscore import assert_tools

def test_agent():

    assert_tools(

        expected=[{"tool": "search", "args": {"q": "test"}}],

        actual=my_agent_output,

        min_score=0.9,

    )

```

## When to Use Toolscore vs. Alternatives

| Use case | Recommendation |

|----------|---------------|

| **Fast, deterministic tool-call checks in CI without API costs** | **Toolscore** |

| **Comprehensive LLM evaluation across multiple dimensions** (hallucination, toxicity, RAG, tool calls, etc.) | [DeepEval](https://github.com/confident-ai/deepeval) |

| **RAG pipeline evaluation** (retrieval quality, answer faithfulness) | [Ragas](https://github.com/explodinggradients/ragas) |

| **Government/safety-focused AI evaluation** | [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) |

| **Tracing and observability for LangChain apps** | [LangSmith](https://smith.langchain.com/) |

Toolscore does **one thing well**: it checks whether your agent called the right tools with the right arguments, deterministically, with zero cost. If you need broader LLM evaluation, the tools above are excellent choices.

## Metrics

The composite `result.score` is a weighted average of these core metrics:

| Metric | Weight | Description |

|--------|--------|-------------|

| **Selection Accuracy** | 40% | Did the agent call the right tools? |

| **Argument F1** | 30% | Did it pass the right arguments? |

| **Sequence Accuracy** | 20% | Were tools called in the right order? |

| **Redundancy** (inverted) | 10% | Were there unnecessary duplicate calls? |

Access individual metrics via properties:

```python

result.score               # Weighted composite (0.0 - 1.0)

result.selection_accuracy  # Tool name matching

result.argument_f1         # Argument precision/recall

result.sequence_accuracy   # Order correctness

```

Custom weights are supported:

```python

result = evaluate(

    expected=[...],

    actual=[...],

    weights={"selection_accuracy": 0.5, "argument_f1": 0.5, "sequence_accuracy": 0.0, "redundant_rate": 0.0},

)

```

### Additional Metrics (verbose mode)

When using `--verbose` or the file-based API, Toolscore also reports:

- Tool Invocation Accuracy

- Tool Correctness (coverage of expected tools)

- Trajectory Accuracy (multi-step path analysis)

- Detailed failure analysis with actionable error messages

## Regression Testing

Catch performance degradation in CI:

```bash

# Save a baseline

toolscore eval gold.json trace.json --save-baseline baseline.json

# Check for regressions (fails if accuracy drops >5%)

toolscore regression baseline.json new_trace.json --gold-file gold.json

```

## GitHub Action

```yaml

- uses: yotambraun/toolscore@v1

  with:

    gold-file: tests/gold_standard.json

    trace-file: tests/agent_trace.json

    threshold: '0.90'

```

## Supported Formats

| Provider | Format | Auto-detected |

|----------|--------|---------------|

| OpenAI | `tool_calls` / `function_call` | Yes |

| Anthropic | `tool_use` content blocks | Yes |

| Google Gemini | `functionCall` parts | Yes |

| MCP | JSON-RPC 2.0 | Yes |

| LangChain | `tool` / `tool_input` | Yes |

| Custom | `{"calls": [{"tool": ..., "args": ...}]}` | Yes |

## Advanced Features

These features are available for users who need them:

- **LLM-as-a-Judge**: `--llm-judge` flag for semantic tool name matching (requires OpenAI API key)

- **Schema Validation**: Validate argument types, ranges, and patterns against schemas

- **Side-Effect Validation**: Check HTTP responses, filesystem state, database rows

- **Cost Tracking**: Token usage and pricing estimation for OpenAI/Anthropic/Gemini

- **Multiple report formats**: HTML, JSON, CSV, Markdown

- **Synthetic test generation**: `toolscore generate --from-openai functions.json`

## File-Based API

The original file-based API is still fully supported:

```python

from toolscore import evaluate_trace

result = evaluate_trace(

    gold_file="gold_calls.json",

    trace_file="trace.json",

    format="auto",

)

print(result.score)

print(result.selection_accuracy)

```

## Development

```bash

pip install -e ".[dev]"

pytest

ruff check toolscore

mypy toolscore

```

## License

Apache License 2.0 - see [LICENSE](LICENSE) for details.

## Citation

```bibtex

@software{toolscore,

  title = {Toolscore: Lightweight Tool-Call Testing for LLM Agents},

  author = {Yotam Braun},

  year = {2025},

  url = {https://github.com/yotambraun/toolscore}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yotambraun/toolscore

Awesome Lists containing this project

README

Toolscore