https://github.com/outsourc-e/bench-loop-web
BenchLoop web app + public site for bench-loop.com — FastAPI backend, local React dashboard, static marketing site.
https://github.com/outsourc-e/bench-loop-web
benchmark fastapi leaderboard llm local-llm react
Last synced: 24 days ago
JSON representation
BenchLoop web app + public site for bench-loop.com — FastAPI backend, local React dashboard, static marketing site.
- Host: GitHub
- URL: https://github.com/outsourc-e/bench-loop-web
- Owner: outsourc-e
- Created: 2026-05-12T20:42:08.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-23T07:54:11.000Z (about 1 month ago)
- Last Synced: 2026-05-23T09:12:02.756Z (about 1 month ago)
- Topics: benchmark, fastapi, leaderboard, llm, local-llm, react
- Language: TypeScript
- Homepage: https://bench-loop.com
- Size: 1.76 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# BenchLoop Web
The web surface for [BenchLoop](https://bench-loop.com) — a local-first benchmark suite for LLM models that scores **quality, speed, and reliability** across seven fixed task suites (`speed`, `toolcall`, `coding`, `dataextract`, `instructfollow`, `reasonmath`, `agent`).
Pick a model on any reachable endpoint (Ollama, LM Studio, Osaurus, vLLM, oMLX, Jan, or any OpenAI-compatible server), pick the suites, hit Run, watch live progress, then compare results in the leaderboard.
## Architecture
```
bench-loop-web/
api/ FastAPI app (uvicorn) wrapping the bench-loop runner
ui/ React + Vite frontend
```
The API delegates to `bench-loop/` (sibling repo) for the actual benchmark logic. Runs are persisted to `~/.bench-loop/runs/` so they survive restarts and show up in the leaderboard from disk.
## Quick start (dev)
Two long-running processes:
```bash
# 1. API (port 8877)
cd bench-loop-web/api
PYTHONPATH=/Users/aurora/.ocplatform/workspace/bench-loop \
BENCH_LOOP_DIR=/Users/aurora/.ocplatform/workspace/bench-loop \
/Users/aurora/.ocplatform/workspace/bench-loop/.venv/bin/python \
-m uvicorn main:app --host 127.0.0.1 --port 8877 --app-dir .
# 2. UI (port 5180)
cd bench-loop-web/ui
npm install
npx vite --host 127.0.0.1 --port 5180
```
Open .
## Pages
| Path | Purpose |
|---|---|
| `/` `/models` | Auto-detect local providers, browse model catalog, jump to benchmark |
| `/chat` | Quick chat against any reachable model |
| `/benchmark` | Pick model + suites + harness, run with live progress |
| `/leaderboard` | Best run per model+harness, rank by overall/quality/speed/tok-s/efficiency. Click row for detail, hit Compare per row |
| `/runs/:runId` | Full per-suite scores, speed metrics, machine info, raw JSON |
| `/compare?a=&b=` | Two runs side-by-side with deltas across every metric |
| `/stacks` | Stack-oriented context-window leaderboard |
## API endpoints
| Route | What |
|---|---|
| `GET /api/health` | Liveness |
| `GET /api/hardware` | Local machine info (CPU, GPU, memory) |
| `GET /api/models?endpoint=...` | List models. If endpoint omitted, auto-probe localhost for Ollama (11434), LM Studio (1234), oMLX/Osaurus (8000), Jan (1337), vLLM (8080) |
| `GET /api/models/preflight?endpoint=...&model=...` | Verify a model can actually load |
| `GET /api/models/search-hf?q=&limit=` | Search Hugging Face |
| `GET /api/models/hf-details?repo=` | HF repo metadata |
| `POST /api/models/pull` | Trigger a model pull |
| `GET /api/models/pull/active` | List in-flight pulls |
| `GET /api/models/pull/{id}/stream` | SSE for pull progress |
| `POST /api/benchmark/run` | Start a benchmark. Body: `{model, endpoint, provider, suites[], harness}` |
| `GET /api/benchmark/runs` | List persisted runs with v2 speed-score recompute |
| `GET /api/benchmark/runs/{runId}` | Run detail (active or persisted) |
| `GET /api/benchmark/stream/{runId}` | SSE for live progress |
| `POST /api/chat/generate` | Passthrough chat completion |
## Providers
Provider type is auto-detected per model and passed to the runner:
- `ollama` — Ollama's `/api/chat` (default for `http://localhost:11434` and any tunnelled Ollama)
- `openai_compat` — Any OpenAI-compatible `/v1/chat/completions`: LM Studio, vLLM, Osaurus/MLX, Jan, oMLX, hosted endpoints
The UI's BenchmarkTab picks the correct provider based on the chosen model's source — no manual selection needed.
## Harnesses
Wrap the same task in different prompt/parse contracts so you can A/B "this model with raw tools" vs "this model with Hermes tags":
- `raw` — vanilla OpenAI-style tools, no prompt rewriting
- `hermes` — NousResearch `{...}` XML tags
- `qwen` — Qwen3 `{...}` tags
- `pi` — OpenClaw/Pi-style `...` + Hermes tags
## What ships in v1
- ✅ Six fixed task suites, deterministic + reproducible
- ✅ Live SSE progress per task
- ✅ Provider auto-detect (Ollama + OpenAI-compatible)
- ✅ Run persistence + leaderboard from disk
- ✅ Per-run detail + side-by-side compare
- ✅ Speed-score v2 curve (anchored on real M-series/RTX reference points)
- ✅ Preflight model-load check with actionable diagnostics
- ⏳ True streaming TTFT (currently 0 for openai_compat; requires streaming pass)
- ⏳ Hosted leaderboard at bench-loop.com
- ⏳ Community submission flow
## License
TBD before the public launch.