https://github.com/yotambraun/toolscore
Python framework for evaluating LLM tool-calling behavior with comprehensive metrics on accuracy, efficiency, and correctness
https://github.com/yotambraun/toolscore
ai-agents ai-agents-and-tools anthropic function-calling langchain llm llm-evaluation llm-metrics metrics openai testing-framework tool-use
Last synced: 4 months ago
JSON representation
Python framework for evaluating LLM tool-calling behavior with comprehensive metrics on accuracy, efficiency, and correctness
- Host: GitHub
- URL: https://github.com/yotambraun/toolscore
- Owner: yotambraun
- License: apache-2.0
- Created: 2025-10-10T08:44:58.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-01-09T13:20:40.000Z (5 months ago)
- Last Synced: 2026-01-14T11:36:45.787Z (5 months ago)
- Topics: ai-agents, ai-agents-and-tools, anthropic, function-calling, langchain, llm, llm-evaluation, llm-metrics, metrics, openai, testing-framework, tool-use
- Language: Python
- Homepage: https://pypi.org/project/tool-scorer/
- Size: 1.71 MB
- Stars: 5
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
Toolscore
Lightweight tool-call testing for LLM agents — deterministic, local, zero API cost
[](https://badge.fury.io/py/tool-scorer)
[](LICENSE)
[](https://pepy.tech/project/tool-scorer)
[](https://pypi.org/project/tool-scorer/)
[](https://github.com/yotambraun/toolscore/actions)
---
## 3-Line Quick Start
```python
from toolscore import evaluate
result = evaluate(
expected=[
{"tool": "get_weather", "args": {"city": "NYC"}},
{"tool": "send_email", "args": {"to": "user@example.com"}},
],
actual=[
{"tool": "get_weather", "args": {"city": "New York"}},
{"tool": "send_email", "args": {"to": "user@example.com"}},
],
)
print(result.score) # 0.85 (composite score)
print(result.selection_accuracy) # 1.0
print(result.argument_f1) # 0.7
```
No files, no config, no API keys. Just Python objects in, score out.
## What is Toolscore?
Toolscore is the simplest way to **unit-test LLM tool calls**. It compares the tool calls your agent _actually_ made against what you _expected_, and gives you a score.
- **Deterministic** - no LLM calls needed for evaluation (optional LLM judge available)
- **Local-first** - runs entirely on your machine, zero cloud dependencies
- **Zero API cost** - evaluation itself is free, always
- **Works with any provider** - OpenAI, Anthropic, Gemini, LangChain, MCP, or custom formats
## Installation
```bash
pip install tool-scorer
```
## Usage
### Python API (recommended)
```python
from toolscore import evaluate, assert_tools
# Get a detailed result
result = evaluate(
expected=[{"tool": "search", "args": {"q": "test"}}],
actual=[{"tool": "search", "args": {"q": "test"}}],
)
assert result.score == 1.0
# Or use the one-liner for pytest
assert_tools(
expected=[{"tool": "search", "args": {"q": "test"}}],
actual=[{"tool": "search", "args": {"q": "test"}}],
min_score=0.9,
)
```
### With OpenAI / Anthropic / Gemini responses
No need to manually extract tool calls from API responses:
```python
from openai import OpenAI
from toolscore import evaluate
from toolscore.integrations import from_openai
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[...],
tools=[...],
)
actual = from_openai(response)
result = evaluate(expected=[...], actual=actual)
```
Also available: `from_anthropic()` and `from_gemini()`.
### CLI
```bash
# Both `toolscore` and `tool-scorer` work
toolscore eval gold_calls.json trace.json
# Simplified output by default, use --verbose for full detail
toolscore eval gold_calls.json trace.json --verbose
# HTML report
toolscore eval gold_calls.json trace.json --html report.html
# Compare multiple models
toolscore compare gold.json gpt4.json claude.json -n gpt-4 -n claude-3
```
### Pytest Integration
```python
# test_my_agent.py
def test_agent_accuracy(toolscore_eval, toolscore_assertions):
"""Test that agent achieves high accuracy."""
result = toolscore_eval("gold_calls.json", "trace.json")
toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)
toolscore_assertions.assert_argument_f1(result, min_f1=0.8)
```
Or use the simpler `assert_tools`:
```python
from toolscore import assert_tools
def test_agent():
assert_tools(
expected=[{"tool": "search", "args": {"q": "test"}}],
actual=my_agent_output,
min_score=0.9,
)
```
## When to Use Toolscore vs. Alternatives
| Use case | Recommendation |
|----------|---------------|
| **Fast, deterministic tool-call checks in CI without API costs** | **Toolscore** |
| **Comprehensive LLM evaluation across multiple dimensions** (hallucination, toxicity, RAG, tool calls, etc.) | [DeepEval](https://github.com/confident-ai/deepeval) |
| **RAG pipeline evaluation** (retrieval quality, answer faithfulness) | [Ragas](https://github.com/explodinggradients/ragas) |
| **Government/safety-focused AI evaluation** | [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) |
| **Tracing and observability for LangChain apps** | [LangSmith](https://smith.langchain.com/) |
Toolscore does **one thing well**: it checks whether your agent called the right tools with the right arguments, deterministically, with zero cost. If you need broader LLM evaluation, the tools above are excellent choices.
## Metrics
The composite `result.score` is a weighted average of these core metrics:
| Metric | Weight | Description |
|--------|--------|-------------|
| **Selection Accuracy** | 40% | Did the agent call the right tools? |
| **Argument F1** | 30% | Did it pass the right arguments? |
| **Sequence Accuracy** | 20% | Were tools called in the right order? |
| **Redundancy** (inverted) | 10% | Were there unnecessary duplicate calls? |
Access individual metrics via properties:
```python
result.score # Weighted composite (0.0 - 1.0)
result.selection_accuracy # Tool name matching
result.argument_f1 # Argument precision/recall
result.sequence_accuracy # Order correctness
```
Custom weights are supported:
```python
result = evaluate(
expected=[...],
actual=[...],
weights={"selection_accuracy": 0.5, "argument_f1": 0.5, "sequence_accuracy": 0.0, "redundant_rate": 0.0},
)
```
### Additional Metrics (verbose mode)
When using `--verbose` or the file-based API, Toolscore also reports:
- Tool Invocation Accuracy
- Tool Correctness (coverage of expected tools)
- Trajectory Accuracy (multi-step path analysis)
- Detailed failure analysis with actionable error messages
## Regression Testing
Catch performance degradation in CI:
```bash
# Save a baseline
toolscore eval gold.json trace.json --save-baseline baseline.json
# Check for regressions (fails if accuracy drops >5%)
toolscore regression baseline.json new_trace.json --gold-file gold.json
```
## GitHub Action
```yaml
- uses: yotambraun/toolscore@v1
with:
gold-file: tests/gold_standard.json
trace-file: tests/agent_trace.json
threshold: '0.90'
```
## Supported Formats
| Provider | Format | Auto-detected |
|----------|--------|---------------|
| OpenAI | `tool_calls` / `function_call` | Yes |
| Anthropic | `tool_use` content blocks | Yes |
| Google Gemini | `functionCall` parts | Yes |
| MCP | JSON-RPC 2.0 | Yes |
| LangChain | `tool` / `tool_input` | Yes |
| Custom | `{"calls": [{"tool": ..., "args": ...}]}` | Yes |
## Advanced Features
These features are available for users who need them:
- **LLM-as-a-Judge**: `--llm-judge` flag for semantic tool name matching (requires OpenAI API key)
- **Schema Validation**: Validate argument types, ranges, and patterns against schemas
- **Side-Effect Validation**: Check HTTP responses, filesystem state, database rows
- **Cost Tracking**: Token usage and pricing estimation for OpenAI/Anthropic/Gemini
- **Multiple report formats**: HTML, JSON, CSV, Markdown
- **Synthetic test generation**: `toolscore generate --from-openai functions.json`
## File-Based API
The original file-based API is still fully supported:
```python
from toolscore import evaluate_trace
result = evaluate_trace(
gold_file="gold_calls.json",
trace_file="trace.json",
format="auto",
)
print(result.score)
print(result.selection_accuracy)
```
## Development
```bash
pip install -e ".[dev]"
pytest
ruff check toolscore
mypy toolscore
```
## License
Apache License 2.0 - see [LICENSE](LICENSE) for details.
## Citation
```bibtex
@software{toolscore,
title = {Toolscore: Lightweight Tool-Call Testing for LLM Agents},
author = {Yotam Braun},
year = {2025},
url = {https://github.com/yotambraun/toolscore}
}
```