https://github.com/outsourc-e/bench-loop-web

BenchLoop web app + public site for bench-loop.com — FastAPI backend, local React dashboard, static marketing site.
https://github.com/outsourc-e/bench-loop-web

benchmark fastapi leaderboard llm local-llm react

Last synced: about 2 months ago
JSON representation

BenchLoop web app + public site for bench-loop.com — FastAPI backend, local React dashboard, static marketing site.

Host: GitHub
URL: https://github.com/outsourc-e/bench-loop-web
Owner: outsourc-e
Created: 2026-05-12T20:42:08.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-05-23T07:54:11.000Z (2 months ago)
Last Synced: 2026-05-23T09:12:02.756Z (2 months ago)
Topics: benchmark, fastapi, leaderboard, llm, local-llm, react
Language: TypeScript
Homepage: https://bench-loop.com
Size: 1.76 MB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # BenchLoop Web

The web surface for [BenchLoop](https://bench-loop.com) — a local-first benchmark suite for LLM models that scores **quality, speed, and reliability** across seven fixed task suites (`speed`, `toolcall`, `coding`, `dataextract`, `instructfollow`, `reasonmath`, `agent`).

Pick a model on any reachable endpoint (Ollama, LM Studio, Osaurus, vLLM, oMLX, Jan, or any OpenAI-compatible server), pick the suites, hit Run, watch live progress, then compare results in the leaderboard.

## Architecture

```

bench-loop-web/

  api/    FastAPI app (uvicorn) wrapping the bench-loop runner

  ui/     React + Vite frontend

```

The API delegates to `bench-loop/` (sibling repo) for the actual benchmark logic. Runs are persisted to `~/.bench-loop/runs/` so they survive restarts and show up in the leaderboard from disk.

## Quick start (dev)

Two long-running processes:

```bash

# 1. API (port 8877)

cd bench-loop-web/api

PYTHONPATH=/Users/aurora/.ocplatform/workspace/bench-loop \

BENCH_LOOP_DIR=/Users/aurora/.ocplatform/workspace/bench-loop \

  /Users/aurora/.ocplatform/workspace/bench-loop/.venv/bin/python \

  -m uvicorn main:app --host 127.0.0.1 --port 8877 --app-dir .

# 2. UI (port 5180)

cd bench-loop-web/ui

npm install

npx vite --host 127.0.0.1 --port 5180

```

Open .

## Pages

| Path | Purpose |

|---|---|

| `/` `/models` | Auto-detect local providers, browse model catalog, jump to benchmark |

| `/chat` | Quick chat against any reachable model |

| `/benchmark` | Pick model + suites + harness, run with live progress |

| `/leaderboard` | Best run per model+harness, rank by overall/quality/speed/tok-s/efficiency. Click row for detail, hit Compare per row |

| `/runs/:runId` | Full per-suite scores, speed metrics, machine info, raw JSON |

| `/compare?a=&b=` | Two runs side-by-side with deltas across every metric |

| `/stacks` | Stack-oriented context-window leaderboard |

## API endpoints

| Route | What |

|---|---|

| `GET  /api/health` | Liveness |

| `GET  /api/hardware` | Local machine info (CPU, GPU, memory) |

| `GET  /api/models?endpoint=...` | List models. If endpoint omitted, auto-probe localhost for Ollama (11434), LM Studio (1234), oMLX/Osaurus (8000), Jan (1337), vLLM (8080) |

| `GET  /api/models/preflight?endpoint=...&model=...` | Verify a model can actually load |

| `GET  /api/models/search-hf?q=&limit=` | Search Hugging Face |

| `GET  /api/models/hf-details?repo=` | HF repo metadata |

| `POST /api/models/pull` | Trigger a model pull |

| `GET  /api/models/pull/active` | List in-flight pulls |

| `GET  /api/models/pull/{id}/stream` | SSE for pull progress |

| `POST /api/benchmark/run` | Start a benchmark. Body: `{model, endpoint, provider, suites[], harness}` |

| `GET  /api/benchmark/runs` | List persisted runs with v2 speed-score recompute |

| `GET  /api/benchmark/runs/{runId}` | Run detail (active or persisted) |

| `GET  /api/benchmark/stream/{runId}` | SSE for live progress |

| `POST /api/chat/generate` | Passthrough chat completion |

## Providers

Provider type is auto-detected per model and passed to the runner:

- `ollama` — Ollama's `/api/chat` (default for `http://localhost:11434` and any tunnelled Ollama)

- `openai_compat` — Any OpenAI-compatible `/v1/chat/completions`: LM Studio, vLLM, Osaurus/MLX, Jan, oMLX, hosted endpoints

The UI's BenchmarkTab picks the correct provider based on the chosen model's source — no manual selection needed.

## Harnesses

Wrap the same task in different prompt/parse contracts so you can A/B "this model with raw tools" vs "this model with Hermes tags":

- `raw` — vanilla OpenAI-style tools, no prompt rewriting

- `hermes` — NousResearch `{...}` XML tags

- `qwen` — Qwen3 `{...}` tags

- `pi` — OpenClaw/Pi-style `...` + Hermes tags

## What ships in v1

- ✅ Six fixed task suites, deterministic + reproducible

- ✅ Live SSE progress per task

- ✅ Provider auto-detect (Ollama + OpenAI-compatible)

- ✅ Run persistence + leaderboard from disk

- ✅ Per-run detail + side-by-side compare

- ✅ Speed-score v2 curve (anchored on real M-series/RTX reference points)

- ✅ Preflight model-load check with actionable diagnostics

- ⏳ True streaming TTFT (currently 0 for openai_compat; requires streaming pass)

- ⏳ Hosted leaderboard at bench-loop.com

- ⏳ Community submission flow

## License

TBD before the public launch.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/outsourc-e/bench-loop-web

Awesome Lists containing this project

README