{"id":47683625,"url":"https://github.com/zenprocess/pawbench","last_synced_at":"2026-04-02T14:23:17.101Z","repository":{"id":347180920,"uuid":"1193130442","full_name":"zenprocess/pawbench","owner":"zenprocess","description":"PawBench - 4-dimensional LLM inference benchmark. Multi-turn, multi-agent, parallel dispatch with tool calling. Inspired by Lola (@_justlolathings).","archived":false,"fork":false,"pushed_at":"2026-03-27T09:29:18.000Z","size":255,"stargazers_count":0,"open_issues_count":7,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-03-27T11:14:49.195Z","etag":null,"topics":["ai","benchmark","inference","llm","machine-learning","openai","tool-calling","vllm"],"latest_commit_sha":null,"homepage":"https://www.instagram.com/_justlolathings/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zenprocess.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-26T22:51:25.000Z","updated_at":"2026-03-27T09:29:21.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/zenprocess/pawbench","commit_stats":null,"previous_names":["zenprocess/pawbench"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/zenprocess/pawbench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zenprocess%2Fpawbench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zenprocess%2Fpawbench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zenprocess%2Fpawbench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zenprocess%2Fpawbench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zenprocess","download_url":"https://codeload.github.com/zenprocess/pawbench/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zenprocess%2Fpawbench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31307853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","benchmark","inference","llm","machine-learning","openai","tool-calling","vllm"],"created_at":"2026-04-02T14:23:14.939Z","updated_at":"2026-04-02T14:23:17.086Z","avatar_url":"https://github.com/zenprocess.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/pawbench.png\" alt=\"PawBench\" width=\"200\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003ePawBench\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eBecause your model deserves a benchmark with more bark than bite.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  4-dimensional LLM inference benchmark.\u003cbr\u003e\n  Multi-turn, multi-agent, parallel dispatch with tool calling.\n\u003c/p\u003e\n\n\u003cbr\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/zenprocess/pawbench/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/zenprocess/pawbench/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://codecov.io/gh/zenprocess/pawbench\"\u003e\u003cimg src=\"https://codecov.io/gh/zenprocess/pawbench/graph/badge.svg\" alt=\"codecov\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://sonarcloud.io/summary/new_code?id=zenprocess_pawbench\"\u003e\u003cimg src=\"https://sonarcloud.io/api/project_badges/measure?project=zenprocess_pawbench\u0026metric=alert_status\" alt=\"Quality Gate\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/pawbench/\"\u003e\u003cimg src=\"https://badge.fury.io/py/pawbench.svg\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://opensource.org/licenses/MIT\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.python.org/downloads/\"\u003e\u003cimg src=\"https://img.shields.io/badge/python-3.10+-blue.svg\" alt=\"Python\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://zenprocess.github.io/pawbench/\"\u003e\u003cimg src=\"https://img.shields.io/badge/docs-GitHub%20Pages-blue\" alt=\"Docs\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cbr\u003e\n\n---\n\n\u003cbr\u003e\n\n## About\n\nPawBench tests LLMs with **realistic coding agent workloads** — not synthetic single-turn completions.\n\nIt simulates what actually happens when you deploy coding agents: multi-turn conversations, parallel tool calling, mid-task steering events, and cross-agent coordination. Then it measures four dimensions: **throughput**, **quality**, **efficiency**, and **adaptability**.\n\nWorks against any OpenAI-compatible endpoint — vLLM, TGI, OpenAI, Ollama, LMStudio.\n\n\u003cbr\u003e\n\n## Meet Lola\n\nPawBench is inspired by **Lola** ([@_justlolathings](https://www.instagram.com/_justlolathings/)) — the most fashionable pup on Instagram.\n\nThe built-in scenarios revolve around building her boutique dog apparel store, *PawStyle by Lola*. Every product, every size guide, every \"Lola's Pick\" badge traces back to this style icon on four legs.\n\nFollow Lola: [instagram.com/_justlolathings](https://www.instagram.com/_justlolathings/)\n\n\u003cbr\u003e\n\n## Install\n\n```bash\npip install pawbench\n# or\nuv pip install pawbench\n```\n\n\u003cbr\u003e\n\n## Quick Start\n\n```bash\n# Benchmark your local vLLM\npawbench --endpoint http://localhost:8000\n\n# Against any OpenAI-compatible endpoint\npawbench --endpoint https://api.openai.com/v1 --tag gpt4o\n\n# Just throughput saturation (no scenarios)\npawbench --saturation-only --concurrency 1,2,4,8,16\n\n# JSON output for CI/autoresearch\npawbench --json --output results/\n\n# Custom scenario\npawbench --scenario my_scenario.json\n```\n\n\u003cbr\u003e\n\n## What It Measures\n\n### 4 Dimensions\n\n| Dimension | Metrics |\n|---|---|\n| **Throughput** | Single-agent tok/s, parallel saturation curve (1-\u003eN), TTFT, peak concurrency |\n| **Quality** | Tool call accuracy, instruction following, format compliance, keyword matching |\n| **Efficiency** | Useful token ratio (code in tool args vs filler preamble), tokens per turn |\n| **Adaptability** | Steering event response, mid-conversation context injection, nudge quality delta |\n\n\u003cbr\u003e\n\n### Built-in Scenarios: PawStyle by Lola\n\nTwo parallel agents build Lola's boutique dog apparel e-commerce store — *\"Where every pup is a fashionista\"*:\n\n- **`pawstyle-independent`** — Frontend and backend work independently on Lola's shop. Pure parallel throughput + quality baseline.\n- **`pawstyle`** — Backend gets a steering event mid-task (\"frontend added a Size Guide button — implement Lola's breed-specific sizing endpoint\").\n- **`pawstyle-nudge`** — Frontend adds Lola's Favorites (wishlist) and Compare features that require backend changes. Backend receives nudges and adapts.\n\nEach scenario is 3 turns x 2 agents, with tool calls (`write_file`, `read_file`, `run_command`) and injected tool results. Products include Lola's Signature Bandana, Cozy Knit Sweater, Rainy Day Raincoat, Adventure Booties, Dapper Bow Tie, and Walk-in-Style Harness — with \"Lola's Pick\" badges on her personal favorites.\n\n\u003cbr\u003e\n\n### Server Metrics (optional)\n\nIf the endpoint exposes `/metrics` (vLLM, TGI), PawBench scrapes:\n\n- KV cache usage and prefix cache hit rate\n- Speculative decoding acceptance rate\n- GPU cache pressure\n\n\u003cbr\u003e\n\n## Custom Scenarios\n\nScenarios are JSON files:\n\n```json\n{\n  \"id\": \"my-scenario\",\n  \"name\": \"My Custom Scenario\",\n  \"agents\": [\n    {\n      \"id\": \"agent-1\",\n      \"name\": \"My Agent\",\n      \"turns\": [\n        {\n          \"turn\": 1,\n          \"role\": \"user\",\n          \"content\": \"Build a REST API with Flask...\",\n          \"tools\": [\"write_file\"],\n          \"expect\": {\n            \"tool_calls_min\": 1,\n            \"tool_name_any\": [\"write_file\"],\n            \"output_mentions\": [\"flask\", \"api\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"tools_schema\": [...]\n}\n```\n\n\u003cbr\u003e\n\n## Comparing Configs\n\n```bash\npawbench --tag baseline --output results/\n# ... change model config ...\npawbench --tag eagle3 --output results/\n\npawbench-compare results/pawbench_baseline_*.json results/pawbench_eagle3_*.json\n```\n\n\u003cbr\u003e\n\n## Output Format\n\nJSON results include full model card (architecture, quantization, GPU, serving params) for reproducibility:\n\n```json\n{\n  \"tag\": \"fp8-eagle3-spec3\",\n  \"model_card\": {\n    \"model_name\": \"qwen3-coder\",\n    \"model_config\": {\"architectures\": [\"Qwen3NextForCausalLM\"], \"num_experts\": 512},\n    \"tuning\": {\"kv_cache_dtype\": \"fp8_e4m3\", \"speculative_config\": \"eagle3\"},\n    \"gpu\": {\"name\": \"NVIDIA GB10\"}\n  },\n  \"dim1_throughput\": {\"avg_single_tok_s\": 69.0, \"raw_peak_tok_s\": 469.3},\n  \"dim2_quality\": {\"avg_quality\": 0.81, \"tool_accuracy\": 0.96},\n  \"saturation_curve\": [{\"concurrency\": 1, \"tok_s\": 69.3}, {\"concurrency\": 8, \"tok_s\": 469.3}],\n  \"server_metrics\": {\"spec_acceptance_rate\": 0.72, \"gpu_prefix_cache_hit_rate\": 0.92}\n}\n```\n\n\u003cbr\u003e\n\n## Why PawBench Exists\n\nKarpathy's [autoresearch](https://github.com/karpathy/autoresearch) showed that an AI agent can autonomously run ML experiments overnight — modify, train, evaluate, repeat. PawBench extends that idea to **inference serving**: what if an agent could autonomously tune your model config, benchmark it, and keep the best result?\n\nThe problem is that LLM serving optimization is gatekept. The best configs — speculative decoding heads, MoE kernel tuning, KV cache quantization strategies — live in private Discord channels and undocumented tribal knowledge. A team with an H100 cluster can spend weeks finding the right settings. A solo dev with a single GPU doesn't have that luxury.\n\nPawBench is the benchmark harness for that loop. Run it, change your config, run it again, compare. The [Serving Card](https://servingcard.dev) initiative takes it further — standardizing how model serving configs are documented and shared, so the community can build on each other's work instead of rediscovering the same optimizations in isolation.\n\nDemocratize the configs. Benchmark everything. Share what works.\n\n\u003cbr\u003e\n\n## Disclaimer\n\nThis project has been entirely vibe coded. Two humans, several AI agents, one very fashionable dog, and a mass of mass energy that mass produced some mass code. If something breaks, it was probably the cat's fault (see [commit history](https://github.com/zenprocess/pawbench/commit/9d36c56)).\n\n\u003cbr\u003e\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzenprocess%2Fpawbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzenprocess%2Fpawbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzenprocess%2Fpawbench/lists"}