{"id":50450654,"url":"https://github.com/outsourc-e/bench-loop","last_synced_at":"2026-06-01T00:01:29.903Z","repository":{"id":357457654,"uuid":"1237054274","full_name":"outsourc-e/bench-loop","owner":"outsourc-e","description":"Local-first CLI for benchmarking LLMs on real hardware — quality, speed, reliability, and a real multi-turn agent loop.","archived":false,"fork":false,"pushed_at":"2026-05-23T06:03:39.000Z","size":749,"stargazers_count":25,"open_issues_count":0,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T08:11:09.455Z","etag":null,"topics":["agent","benchmark","cli","evaluation","llm","local-llm","mlx","ollama","vllm"],"latest_commit_sha":null,"homepage":"https://bench-loop.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/outsourc-e.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":"AUDIT-2026-04-24.md","citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-12T20:41:28.000Z","updated_at":"2026-05-23T06:03:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/outsourc-e/bench-loop","commit_stats":null,"previous_names":["outsourc-e/bench-loop"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/outsourc-e/bench-loop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outsourc-e%2Fbench-loop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outsourc-e%2Fbench-loop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outsourc-e%2Fbench-loop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outsourc-e%2Fbench-loop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/outsourc-e","download_url":"https://codeload.github.com/outsourc-e/bench-loop/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/outsourc-e%2Fbench-loop/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33753925,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","benchmark","cli","evaluation","llm","local-llm","mlx","ollama","vllm"],"created_at":"2026-06-01T00:01:16.572Z","updated_at":"2026-06-01T00:01:29.884Z","avatar_url":"https://github.com/outsourc-e.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BenchLoop\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/outsourc-e/bench-loop-web/main/site/public/og-image.png\" alt=\"BenchLoop\" width=\"640\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://bench-loop.com\"\u003e\u003cimg src=\"https://img.shields.io/badge/site-bench--loop.com-2dd47f?style=flat-square\" alt=\"site\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/benchloop-cli/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/benchloop-cli?style=flat-square\u0026color=2dd47f\" alt=\"pypi\" /\u003e\u003c/a\u003e  \u003ca href=\"https://github.com/outsourc-e/bench-loop/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-2dd47f?style=flat-square\" alt=\"MIT\" /\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/status-beta-eab308?style=flat-square\" alt=\"beta\" /\u003e\n\u003c/p\u003e\n\n**Benchmark local LLMs by what actually matters.**\n\nBenchLoop is a local-first CLI + web app for benchmarking LLMs running on your own hardware or cloud providers. It scores models across seven repeatable suites — quality, speed, reliability, agentic tool use, coding, instruction following — and gives you receipts: per-task outputs, latency, token counts, machine info, scores.\n\nNo accounts, no telemetry. Local models need no API keys; cloud providers use standard OpenAI-compatible auth. Your model, your machine (or your provider), your numbers.\n\n```\n$ benchloop run --model qwen3:8b --suites speed,toolcall,agent\n... 8 tasks, 4 tools, 6 turns avg, 74.6 tok/s ...\n\nOverall  73.4  ████████░░\nQuality  73.6  ████████░░\nSpeed    78.9  █████████░\nAgent    96.9  █████████▌\n```\n\nPublished runs live at \u003chttps://bench-loop.com/leaderboard\u003e. Every completed local benchmark auto-publishes there.\n## Why\n\nHosted LLM leaderboards answer *\"which model wins on a server farm someone else paid for?\"* BenchLoop answers *\"which model + harness + hardware combination actually works for me right now?\"* — the question you have when picking a local stack.\n\nIt is repeatable on purpose: every run persists to disk, the task set is frozen, the scorer is deterministic. If you say \"qwen3:8b scored 89 on my 4090\", anyone can install BenchLoop and verify it.\n\n## Install\n\n### pipx (recommended)\n\n```bash\npipx install benchloop-cli\nbenchloop --version\n```\n\n\u003e The PyPI distribution is named `benchloop-cli` (the bare `benchloop` name was taken by an unrelated dataset library). The installed commands are still `benchloop` and `bench-loop`.\n\n### pip\n\n```bash\npip install benchloop-cli\n```\n\n### From source\n\n```bash\ngit clone https://github.com/outsourc-e/bench-loop\ncd bench-loop\npip install -e .\n```\n\n## Run your first benchmark\n\nMake sure you have a local LLM endpoint running. Anything OpenAI-compatible or Ollama-flavored works:\n\n- Ollama at `http://localhost:11434` (default)\n- LM Studio at `http://localhost:1234` (`--provider openai_compat`)\n- MLX / Osaurus at `http://localhost:8000` (`--provider openai_compat`)\n- vLLM, Jan, llama-server, etc.\n\nThen:\n\n```bash\nbenchloop run \\\n  --model qwen3:8b \\\n  --endpoint http://localhost:11434 \\\n  --provider ollama\n```\n\nThis runs every default suite, scores them, prints a console report, and persists the full run to `~/.bench-loop/runs/`.\n\n### Run a subset\n\n```bash\nbenchloop run --model qwen3:8b --suites speed,agent\n```\n\n### Different prompting harness\n\nSame model, four ways to talk to it:\n\n```bash\nbenchloop run --model qwen3:8b --harness raw      # native tool calling\nbenchloop run --model qwen3:8b --harness hermes   # \u003ctool_call\u003e{...}\u003c/tool_call\u003e\nbenchloop run --model qwen3:8b --harness qwen     # \u003cfunction_call\u003e{...}\u003c/function_call\u003e\nbenchloop run --model qwen3:8b --harness pi       # \u003cthink\u003e...\u003c/think\u003e + Hermes tags\n```\n\n### Stamp custom hardware (e.g. when benchmarking through a tunnel)\n\n```bash\nbenchloop run \\\n  --model qwen3:8b \\\n  --endpoint http://localhost:11435 \\\n  --hardware \"NVIDIA RTX 4090 24GB\" \\\n  --gpu \"NVIDIA RTX 4090\" \\\n  --gpu-memory-gb 24\n```\n\n### Benchmark cloud/remote APIs\n\nWorks with any OpenAI-compatible endpoint — DashScope, OpenRouter, Together, OpenAI, vLLM with auth, sglang, etc.\n\n```bash\n# Via environment variable\nexport OPENAI_API_KEY=\"sk-...\"\nbenchloop run \\\n  --model qwen3.7-max \\\n  --provider openai_compat \\\n  --endpoint https://dashscope-intl.aliyuncs.com/compatible-mode \\\n  --remote\n\n# Or inline\nbenchloop run \\\n  --model gpt-4o \\\n  --provider openai_compat \\\n  --endpoint https://api.openai.com/v1 \\\n  --api-key sk-... \\\n  --remote\n```\n\nThe `--remote` flag (auto-detected for non-localhost endpoints) switches to cloud-aware scoring:\n- **Speed** uses streaming TTFT (time-to-first-token) + effective content tok/s\n- **Overall** = 0.50·quality + 0.25·speed + 0.25·reliability (vs local's 0.55/0.20/0.25)\n- Reasoning models: content tok/s excludes internal thinking tokens\n\n### API key auth\n\nRequired for vLLM, sglang, and most cloud providers. Two ways to provide it:\n\n```bash\n# 1. Environment variable (recommended)\nexport OPENAI_API_KEY=\"your-key-here\"\nbenchloop run --model your-model --provider openai_compat --endpoint http://your-server:8000\n\n# 2. CLI flag\nbenchloop run --model your-model --provider openai_compat --endpoint http://your-server:8000 --api-key your-key-here\n```\n\nThe CLI flag takes precedence over the env var. For Ollama and local providers without auth, neither is needed.\n\n### Launch the local dashboard\n\nv0.2.0+ ships the full FastAPI + React dashboard inside the wheel. After `pipx install benchloop-cli`:\n\n```bash\nbenchloop dashboard\n# → open http://127.0.0.1:8877\n```\n\nNeed it to survive browser/terminal churn? Print a service template instead of keeping the dashboard tied to one shell:\n\n```bash\nbenchloop dashboard --service-template launchd\nbenchloop dashboard --service-template systemd\nbenchloop dashboard --service-template windows-task\n```\n\nThis serves the Models, Benchmark, Leaderboard, Compare, and Chat tabs on a single port, with auto-discovered local providers (Ollama, LM Studio, MLX/Osaurus, vLLM, Jan).\n\nFor hot-reload development against a clone of [`bench-loop-web`](https://github.com/outsourc-e/bench-loop-web):\n\n```bash\nbenchloop dashboard --dev\n```\n\n## Suites\n\n| Suite | What it scores |\n|---|---|\n| `speed` | Latency, throughput, TTFT, generation tok/s across short/medium/long contexts |\n| `toolcall` | Structured tool-call correctness across realistic tasks (weather, stocks, email, search) |\n| `coding` | Executable Python tasks verified in a sandboxed subprocess (10s timeout) |\n| `dataextract` | JSON / structured extraction from messy natural language |\n| `instructfollow` | Constraint following, formatting, exactness |\n| `reasonmath` | Small reasoning + math tasks with deterministic checks |\n| `agent` | **Multi-turn agentic tool use.** BenchLoop drives a real loop: model emits a tool call, BenchLoop executes it locally, feeds the result back, model iterates until done. Scores correctness, efficiency, no-hallucination, required-tool coverage. |\n\n## Scoring\n\n```\nLocal:  Overall = 0.55 · quality + 0.20 · speed + 0.25 · reliability\nCloud:  Overall = 0.50 · quality + 0.25 · speed + 0.25 · reliability  (with streaming speed data)\n        Overall = 0.65 · quality + 0.35 · reliability                   (no speed data)\n```\n\n- **Quality** = mean of non-speed suite scores (size-fair).\n- **Speed (local)** = `12.54 · log2(tok/s) + 0.9`, clamped to 0–100.\n- **Speed (cloud)** = 0.60 · TTFT_score + 0.40 · tok/s_score, where TTFT uses exponential decay (200ms→100, 2000ms→40) and tok/s uses a log curve calibrated for 20-150 tok/s.\n- **Reliability** = pass rate across all tasks.\n- **Agent** = `correct_final + efficient + no_hallucinated_tools + all_required_called`, 25 pts each, averaged across tasks.\n\n## Local web app\n\nA FastAPI backend + React frontend bundle ships alongside the CLI for visualizing runs:\n\n```bash\nbenchloop dashboard   # starts the local web app on :5180\n```\n\nTabs: Models, Benchmark, Leaderboard, Compare runs, Chat, agent trace viewer.\n\n## Publish a run\n\nEvery completed benchmark auto-publishes to \u003chttps://bench-loop.com/leaderboard\u003e via `https://api.bench-loop.com/submit`. Runs are deduped by `(machine_id, run_id)` so the same run from the same machine won't be double-counted.\n\nOpt out:\n\n```bash\nexport BENCHLOOP_NO_SUBMIT=1\n```\n\nYou can still manually export a snapshot for sharing / archiving:\n\n```bash\nbenchloop export --output my-runs.json\n```\n\n## Architecture\n\n```\nbench-loop/                    ← this repo, the CLI + suites + scorers\n  bench_loop/\n    cli.py                     ← `benchloop` entrypoint\n    suites/                    ← speed, toolcall, coding, agent, ...\n    harness.py                 ← raw / hermes / qwen / pi adapters\n    providers/                 ← ollama, openai_compat\n    runner/orchestrator.py     ← drives suites + harnesses\n    tasks/                     ← frozen task YAML fixtures\nbench-loop-web/                ← the web app (separate repo)\n  api/                         ← FastAPI wrapper around bench_loop\n  ui/                          ← local dashboard\n  site/                        ← public bench-loop.com static site\n```\n\n## Status\n\nBenchLoop is **v0.2 beta**. The benchmark surface, scoring, web app, agent loop, four harnesses, and cloud provider support all work end-to-end. Stuff still on the roadmap:\n\n- ~~Streaming TTFT for OpenAI-compatible providers~~ ✅ (v0.2.3+ with `--remote`)\n- Bigger task fixtures (each suite is intentionally small and frozen for v1)\n- Hosted submission flow for community runs\n- Cloud-specific leaderboard on bench-loop.com (filter by local vs remote)\n- More provider adapters (TGI, Bedrock, etc. if there's demand)\n\n## License\n\nMIT. See `LICENSE`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foutsourc-e%2Fbench-loop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foutsourc-e%2Fbench-loop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foutsourc-e%2Fbench-loop/lists"}