{"id":37079088,"url":"https://github.com/yotambraun/toolscore","last_synced_at":"2026-02-06T21:00:37.371Z","repository":{"id":318723195,"uuid":"1073530226","full_name":"yotambraun/Toolscore","owner":"yotambraun","description":"Python framework for evaluating LLM tool-calling behavior with comprehensive metrics on accuracy, efficiency, and correctness","archived":false,"fork":false,"pushed_at":"2026-01-09T13:20:40.000Z","size":1793,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-14T11:36:45.787Z","etag":null,"topics":["ai-agents","ai-agents-and-tools","anthropic","function-calling","langchain","llm","llm-evaluation","llm-metrics","metrics","openai","testing-framework","tool-use"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/tool-scorer/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yotambraun.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-10T08:44:58.000Z","updated_at":"2026-01-09T13:20:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"83e2a8c7-3660-451e-89ef-298704e57126","html_url":"https://github.com/yotambraun/Toolscore","commit_stats":null,"previous_names":["yotambraun/toolscore"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/yotambraun/Toolscore","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yotambraun%2FToolscore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yotambraun%2FToolscore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yotambraun%2FToolscore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yotambraun%2FToolscore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yotambraun","download_url":"https://codeload.github.com/yotambraun/Toolscore/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yotambraun%2FToolscore/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29175821,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T20:14:21.878Z","status":"ssl_error","status_checked_at":"2026-02-06T20:14:21.443Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","ai-agents-and-tools","anthropic","function-calling","langchain","llm","llm-evaluation","llm-metrics","metrics","openai","testing-framework","tool-use"],"created_at":"2026-01-14T09:34:34.647Z","updated_at":"2026-02-06T21:00:37.360Z","avatar_url":"https://github.com/yotambraun.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/logo.png\" alt=\"Toolscore Logo\" width=\"200\"/\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eToolscore\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cem\u003eLightweight tool-call testing for LLM agents \u0026mdash; deterministic, local, zero API cost\u003c/em\u003e\n\u003c/p\u003e\n\n[![PyPI version](https://badge.fury.io/py/tool-scorer.svg)](https://badge.fury.io/py/tool-scorer)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![Downloads](https://static.pepy.tech/badge/tool-scorer)](https://pepy.tech/project/tool-scorer)\n[![Python Versions](https://img.shields.io/pypi/pyversions/tool-scorer.svg)](https://pypi.org/project/tool-scorer/)\n[![CI](https://github.com/yotambraun/toolscore/workflows/CI/badge.svg)](https://github.com/yotambraun/toolscore/actions)\n\n---\n\n## 3-Line Quick Start\n\n```python\nfrom toolscore import evaluate\n\nresult = evaluate(\n    expected=[\n        {\"tool\": \"get_weather\", \"args\": {\"city\": \"NYC\"}},\n        {\"tool\": \"send_email\", \"args\": {\"to\": \"user@example.com\"}},\n    ],\n    actual=[\n        {\"tool\": \"get_weather\", \"args\": {\"city\": \"New York\"}},\n        {\"tool\": \"send_email\", \"args\": {\"to\": \"user@example.com\"}},\n    ],\n)\n\nprint(result.score)              # 0.85 (composite score)\nprint(result.selection_accuracy) # 1.0\nprint(result.argument_f1)       # 0.7\n```\n\nNo files, no config, no API keys. Just Python objects in, score out.\n\n## What is Toolscore?\n\nToolscore is the simplest way to **unit-test LLM tool calls**. It compares the tool calls your agent _actually_ made against what you _expected_, and gives you a score.\n\n- **Deterministic** - no LLM calls needed for evaluation (optional LLM judge available)\n- **Local-first** - runs entirely on your machine, zero cloud dependencies\n- **Zero API cost** - evaluation itself is free, always\n- **Works with any provider** - OpenAI, Anthropic, Gemini, LangChain, MCP, or custom formats\n\n## Installation\n\n```bash\npip install tool-scorer\n```\n\n## Usage\n\n### Python API (recommended)\n\n```python\nfrom toolscore import evaluate, assert_tools\n\n# Get a detailed result\nresult = evaluate(\n    expected=[{\"tool\": \"search\", \"args\": {\"q\": \"test\"}}],\n    actual=[{\"tool\": \"search\", \"args\": {\"q\": \"test\"}}],\n)\nassert result.score == 1.0\n\n# Or use the one-liner for pytest\nassert_tools(\n    expected=[{\"tool\": \"search\", \"args\": {\"q\": \"test\"}}],\n    actual=[{\"tool\": \"search\", \"args\": {\"q\": \"test\"}}],\n    min_score=0.9,\n)\n```\n\n### With OpenAI / Anthropic / Gemini responses\n\nNo need to manually extract tool calls from API responses:\n\n```python\nfrom openai import OpenAI\nfrom toolscore import evaluate\nfrom toolscore.integrations import from_openai\n\nclient = OpenAI()\nresponse = client.chat.completions.create(\n    model=\"gpt-4o\",\n    messages=[...],\n    tools=[...],\n)\n\nactual = from_openai(response)\nresult = evaluate(expected=[...], actual=actual)\n```\n\nAlso available: `from_anthropic()` and `from_gemini()`.\n\n### CLI\n\n```bash\n# Both `toolscore` and `tool-scorer` work\ntoolscore eval gold_calls.json trace.json\n\n# Simplified output by default, use --verbose for full detail\ntoolscore eval gold_calls.json trace.json --verbose\n\n# HTML report\ntoolscore eval gold_calls.json trace.json --html report.html\n\n# Compare multiple models\ntoolscore compare gold.json gpt4.json claude.json -n gpt-4 -n claude-3\n```\n\n### Pytest Integration\n\n```python\n# test_my_agent.py\ndef test_agent_accuracy(toolscore_eval, toolscore_assertions):\n    \"\"\"Test that agent achieves high accuracy.\"\"\"\n    result = toolscore_eval(\"gold_calls.json\", \"trace.json\")\n    toolscore_assertions.assert_selection_accuracy(result, min_accuracy=0.9)\n    toolscore_assertions.assert_argument_f1(result, min_f1=0.8)\n```\n\nOr use the simpler `assert_tools`:\n\n```python\nfrom toolscore import assert_tools\n\ndef test_agent():\n    assert_tools(\n        expected=[{\"tool\": \"search\", \"args\": {\"q\": \"test\"}}],\n        actual=my_agent_output,\n        min_score=0.9,\n    )\n```\n\n## When to Use Toolscore vs. Alternatives\n\n| Use case | Recommendation |\n|----------|---------------|\n| **Fast, deterministic tool-call checks in CI without API costs** | **Toolscore** |\n| **Comprehensive LLM evaluation across multiple dimensions** (hallucination, toxicity, RAG, tool calls, etc.) | [DeepEval](https://github.com/confident-ai/deepeval) |\n| **RAG pipeline evaluation** (retrieval quality, answer faithfulness) | [Ragas](https://github.com/explodinggradients/ragas) |\n| **Government/safety-focused AI evaluation** | [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai) |\n| **Tracing and observability for LangChain apps** | [LangSmith](https://smith.langchain.com/) |\n\nToolscore does **one thing well**: it checks whether your agent called the right tools with the right arguments, deterministically, with zero cost. If you need broader LLM evaluation, the tools above are excellent choices.\n\n## Metrics\n\nThe composite `result.score` is a weighted average of these core metrics:\n\n| Metric | Weight | Description |\n|--------|--------|-------------|\n| **Selection Accuracy** | 40% | Did the agent call the right tools? |\n| **Argument F1** | 30% | Did it pass the right arguments? |\n| **Sequence Accuracy** | 20% | Were tools called in the right order? |\n| **Redundancy** (inverted) | 10% | Were there unnecessary duplicate calls? |\n\nAccess individual metrics via properties:\n\n```python\nresult.score               # Weighted composite (0.0 - 1.0)\nresult.selection_accuracy  # Tool name matching\nresult.argument_f1         # Argument precision/recall\nresult.sequence_accuracy   # Order correctness\n```\n\nCustom weights are supported:\n\n```python\nresult = evaluate(\n    expected=[...],\n    actual=[...],\n    weights={\"selection_accuracy\": 0.5, \"argument_f1\": 0.5, \"sequence_accuracy\": 0.0, \"redundant_rate\": 0.0},\n)\n```\n\n### Additional Metrics (verbose mode)\n\nWhen using `--verbose` or the file-based API, Toolscore also reports:\n- Tool Invocation Accuracy\n- Tool Correctness (coverage of expected tools)\n- Trajectory Accuracy (multi-step path analysis)\n- Detailed failure analysis with actionable error messages\n\n## Regression Testing\n\nCatch performance degradation in CI:\n\n```bash\n# Save a baseline\ntoolscore eval gold.json trace.json --save-baseline baseline.json\n\n# Check for regressions (fails if accuracy drops \u003e5%)\ntoolscore regression baseline.json new_trace.json --gold-file gold.json\n```\n\n## GitHub Action\n\n```yaml\n- uses: yotambraun/toolscore@v1\n  with:\n    gold-file: tests/gold_standard.json\n    trace-file: tests/agent_trace.json\n    threshold: '0.90'\n```\n\n## Supported Formats\n\n| Provider | Format | Auto-detected |\n|----------|--------|---------------|\n| OpenAI | `tool_calls` / `function_call` | Yes |\n| Anthropic | `tool_use` content blocks | Yes |\n| Google Gemini | `functionCall` parts | Yes |\n| MCP | JSON-RPC 2.0 | Yes |\n| LangChain | `tool` / `tool_input` | Yes |\n| Custom | `{\"calls\": [{\"tool\": ..., \"args\": ...}]}` | Yes |\n\n## Advanced Features\n\nThese features are available for users who need them:\n\n- **LLM-as-a-Judge**: `--llm-judge` flag for semantic tool name matching (requires OpenAI API key)\n- **Schema Validation**: Validate argument types, ranges, and patterns against schemas\n- **Side-Effect Validation**: Check HTTP responses, filesystem state, database rows\n- **Cost Tracking**: Token usage and pricing estimation for OpenAI/Anthropic/Gemini\n- **Multiple report formats**: HTML, JSON, CSV, Markdown\n- **Synthetic test generation**: `toolscore generate --from-openai functions.json`\n\n## File-Based API\n\nThe original file-based API is still fully supported:\n\n```python\nfrom toolscore import evaluate_trace\n\nresult = evaluate_trace(\n    gold_file=\"gold_calls.json\",\n    trace_file=\"trace.json\",\n    format=\"auto\",\n)\nprint(result.score)\nprint(result.selection_accuracy)\n```\n\n## Development\n\n```bash\npip install -e \".[dev]\"\npytest\nruff check toolscore\nmypy toolscore\n```\n\n## License\n\nApache License 2.0 - see [LICENSE](LICENSE) for details.\n\n## Citation\n\n```bibtex\n@software{toolscore,\n  title = {Toolscore: Lightweight Tool-Call Testing for LLM Agents},\n  author = {Yotam Braun},\n  year = {2025},\n  url = {https://github.com/yotambraun/toolscore}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyotambraun%2Ftoolscore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyotambraun%2Ftoolscore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyotambraun%2Ftoolscore/lists"}