{"id":45939737,"url":"https://github.com/acodercat/cave-bench","last_synced_at":"2026-02-28T10:31:04.253Z","repository":{"id":332351340,"uuid":"1103735301","full_name":"acodercat/cave-bench","owner":"acodercat","description":"A benchmarking framework for evaluating CaveAgent tool calling, stateful management, and JSON-based tool calling.","archived":false,"fork":false,"pushed_at":"2026-01-21T13:58:18.000Z","size":15923,"stargazers_count":3,"open_issues_count":1,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-01-22T01:48:49.195Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/acodercat.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-25T09:11:39.000Z","updated_at":"2026-01-21T14:02:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/acodercat/cave-bench","commit_stats":null,"previous_names":["acodercat/cave-bench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/acodercat/cave-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/acodercat%2Fcave-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/acodercat%2Fcave-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/acodercat%2Fcave-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/acodercat%2Fcave-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/acodercat","download_url":"https://codeload.github.com/acodercat/cave-bench/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/acodercat%2Fcave-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29930344,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-28T09:58:13.507Z","status":"ssl_error","status_checked_at":"2026-02-28T09:57:57.047Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-28T10:31:00.775Z","updated_at":"2026-02-28T10:31:04.245Z","avatar_url":"https://github.com/acodercat.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cave-bench\n\nBenchmarking framework for evaluating [CaveAgent](https://github.com/acodercat/cave-agent) tool calling, stateful management, and JSON-based tool calling.\n\n## Installation\n\n```bash\nuv sync\n```\n\n## Configuration\n\nCopy `.env.example` to `.env` and add your API keys:\n\n```bash\ncp .env.example .env\n```\n\n```bash\n# .env\nDEEPSEEK_API_KEY=your-api-key\nDEEPSEEK_MODEL_ID=deepseek-chat\nDEEPSEEK_BASE_URL=https://api.deepseek.com/v1\nDEEPSEEK_TEMPERATURE=0.3\n```\n\n## Quick Start\n\nRun benchmarks using module syntax:\n\n```bash\n# Function calling benchmarks\npython -m scripts.function_calling           # Run both agent types\npython -m scripts.function_calling -a cave   # CaveAgent (Python code execution)\npython -m scripts.function_calling -a json   # LiteLLM (JSON function calling)\n\n# Other benchmarks\npython -m scripts.data_analysis     # Data analysis benchmarks\npython -m scripts.smart_home        # Smart home benchmarks\n```\n\nEdit the `BENCHMARKS` list in each script to select which benchmarks to run.\n\n## Benchmark Structure\n\n### JSON Schema\n\n```json\n{\n  \"name\": \"scenario_name\",\n  \"module\": \"evals.data_analysis.MyDataset.my_analysis\",\n  \"requirements\": \"Optional task requirements\",\n  \"conversations\": [\n    {\n      \"id\": \"test_1\",\n      \"turns\": [\n        {\n          \"query\": \"Analyze the dataset...\",\n          \"validator\": \"validate_q1\",\n          \"expected_variable_reads\": [\"df\"],\n          \"expected_variable_writes\": [\"result\"]\n        }\n      ]\n    }\n  ]\n}\n```\n\n### Python Module\n\n```python\nfrom typing import List\nfrom cave_agent.python_runtime import Variable, PythonRuntime\nfrom core.validation import ValidatorResult\nfrom core.types import Turn, ToolCall\nimport pandas as pd\n\ndf = pd.read_csv(\"path/to/dataset.csv\")\n\ndef validate_q1(\n    response: str,\n    runtime: PythonRuntime,\n    turn: Turn,\n    actual_calls: List[ToolCall]\n) -\u003e ValidatorResult:\n    result = runtime.retrieve(\"result\")\n    if result == expected_value:\n        return ValidatorResult(True, \"Correct!\")\n    return ValidatorResult(False, f\"Expected {expected_value}, got {result}\")\n\ntools = []\nvariables = [Variable(\"df\", df, \"Dataset description\")]\nvalidators = {\"validate_q1\": validate_q1}\n```\n\n## Metrics\n\n- **Success Rate**: Percentage of successful turns\n- **Function Calls**: Missing calls, wrong argument types/values\n- **Variables**: Missing reads/writes\n- **Steps**: Total steps taken\n- **Tokens**: Consumed Tokens\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a PR.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Facodercat%2Fcave-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Facodercat%2Fcave-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Facodercat%2Fcave-bench/lists"}