https://github.com/drewmattie-code/pipelinescore
Measure LLM performance on your own equipment. Run a 25-task benchmark against any model with your own API key, get a deterministic score, tier badge, and a place on the public leaderboard.
https://github.com/drewmattie-code/pipelinescore
ai ai-evaluation apache-2 benchmark claude hardware-comparison leaderboard llama-cpp llm lm-studio local-first local-llm local-models mcp mlx ollama openai typescript vllm
Last synced: 11 days ago
JSON representation
Measure LLM performance on your own equipment. Run a 25-task benchmark against any model with your own API key, get a deterministic score, tier badge, and a place on the public leaderboard.
- Host: GitHub
- URL: https://github.com/drewmattie-code/pipelinescore
- Owner: drewmattie-code
- License: apache-2.0
- Created: 2026-03-03T01:47:57.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-06-09T00:11:23.000Z (12 days ago)
- Last Synced: 2026-06-09T02:12:40.513Z (12 days ago)
- Topics: ai, ai-evaluation, apache-2, benchmark, claude, hardware-comparison, leaderboard, llama-cpp, llm, lm-studio, local-first, local-llm, local-models, mcp, mlx, ollama, openai, typescript, vllm
- Language: TypeScript
- Homepage: https://pipelinescore.ai
- Size: 79.5 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# PipelineScore
**Benchmark LLMs on YOUR hardware.** Same 34 deterministic tasks, scored entirely on your machine — no judge model, no API key — one 0–100 score. The only public LLM leaderboard that ranks where the model runs — not just which model it is.
[](https://pipelinescore.ai)
[](LICENSE)
[](https://www.typescriptlang.org/)
[](https://github.com/drewmattie-code/pipelinescore/stargazers)
[](https://github.com/drewmattie-code/pipelinescore/issues)
[](https://pipelinescore.ai/run)
[**🚀 Live at pipelinescore.ai**](https://pipelinescore.ai)
[Live leaderboard](https://pipelinescore.ai/leaderboard/users) · [Methodology](https://pipelinescore.ai/methodology) · [Privacy / BYOK posture](https://pipelinescore.ai/privacy) · [Run the CLI](https://pipelinescore.ai/run)
[](https://pipelinescore.ai/leaderboard/hardware)
---
## What it looks like
```text
$ npx @pipelinescore/cli run \
--provider local --endpoint http://localhost:11434 \
--model llama-3.3-70b --hardware-tag m3-max-128gb \
--user your-handle
╭ PipelineScore v0.3.0 ──────────────────╮
│ Provider: local │
│ Model: llama-3.3-70b │
│ Hardware: m3-max-128gb │
│ Config tag: — (base model) │
│ User: your-handle │
│ Submit: yes │
╰────────────────────────────────────────╯
Fetched testpack 2026-06-10-v3 from backend.
Running 34 tasks ... ████████████████████ 34/34
╭──────────────────── PipelineScore ─────────────────────╮
│ │
│ 78.4 MAINLINE │
│ ──── │
│ │
│ code ████████░░ 79.1 tool_use ██████░░░░ 61.4 │
│ reason ███████░░ 75.8 rag ████████░░ 82.6 │
│ speed █████░░░░░ 52.3 │
│ │
│ Total tokens: 4,827 · Avg latency: 712ms │
│ See your run: pipelinescore.ai/users/your-handle │
╰────────────────────────────────────────────────────────╯
Opening your leaderboard page in your browser.
```
## Quickstart — local model (30 seconds)
If you have Ollama / LM Studio / MLX / llama.cpp running:
```bash
npx @pipelinescore/cli run \
--provider local \
--endpoint http://localhost:11434 \
--model llama-3.3-70b \
--hardware-tag m3-max-128gb \
--user your-handle
```
Swap port for LM Studio (`1234`), llama.cpp (`8080`), MLX-Omni (`10240`), or LiteLLM proxy (`8000`). Replace `m3-max-128gb` with your rig (`rtx-4090-24gb`, `ryzen-7950x-cpu-only`, `a100-80gb`, anything alphanum + `. _ -`).
The CLI runs locally, calls your model server, scores the output, and publishes the result to https://pipelinescore.ai/users/your-handle.
## Quickstart — frontier API (BYOK)
```bash
ANTHROPIC_API_KEY=sk-... npx @pipelinescore/cli run \
--provider anthropic --model claude-opus-4-7 \
--user your-handle
```
Or `--provider openai`. **Your key never reaches our backend** — it goes directly to the provider. See [Privacy](https://pipelinescore.ai/privacy) for the full data-flow.
## Why this leaderboard exists
Every other ranked LLM list ignores the rig:
| | Hardware-aware? | You can run it yourself? | Local-model coverage | Reproducible | Open source |
|---|:---:|:---:|:---:|:---:|:---:|
| **PipelineScore** | ✅ | ✅ | ✅ | ✅ | ✅ Apache 2.0 |
| LMArena | ❌ | ❌ (preference votes only) | partial | ❌ | partial |
| Artificial Analysis | ❌ | ❌ (centrally run) | partial | ❌ | ❌ |
| lm-evaluation-harness | ❌ | ✅ | ✅ | ✅ | ✅ MIT |
| MMLU / SWE-Bench / TerminalBench | ❌ | ✅ | ✅ | ⚠️ test set leaks fast | ✅ |
| OpenLLM Leaderboard (HF) | ❌ | ❌ | ✅ | ✅ | ✅ |
**The missing axis is the hardware tag.** Same Llama 4 on an M3 Max vs an RTX 4090 vs an A100 produces three very different real-world experiences. Same RTX 4090 with three different models produces three apples-to-apples comparisons. The benchmark is reproducible, the hardware tag is preserved, the score lands on a public, searchable leaderboard at https://pipelinescore.ai/leaderboard/users.
## Architecture
```mermaid
flowchart LR
A[Your CLI
npx @pipelinescore/cli] -->|HTTPS
OpenAI-compat| B[Your model server
Ollama / LM Studio /
MLX / llama.cpp / vLLM]
A -->|HTTPS POST
score + transcripts| C[api.pipelinescore.ai
Express + SQLite
on Render]
C -->|read| D[Cloudflare Worker
Next.js via OpenNext]
D -->|HTTPS GET| E[pipelinescore.ai
public leaderboard]
F[Claude Code skill] -->|invokes| A
G[pipelinescore-mcp
MCP server] -->|invokes| A
G -->|reads| C
style A fill:#0F766E,color:#fff
style E fill:#0F766E,color:#fff
```
**Three integration paths** to drive the CLI:
1. **Manual** — copy/paste the `npx` command into your terminal
2. **Skill** — drop [`SKILL.md`](dist/skills/pipelinescore/SKILL.md) into `~/.claude/skills/` and your AI runs it for you
3. **MCP** — install [`@pipelinescore/mcp`](mcp/) and any MCP-compatible client (Claude Code, Cursor, Codex, Continue, Cline) gets the benchmark as a tool
**Backend never sees your API key.** When `--provider anthropic/openai`, the CLI calls the provider directly. Only the score + transcripts (with API keys stripped) reach our backend. See [SECURITY.md](SECURITY.md) for the full posture.
## The score
Five deterministic categories — code (executed), reason (exact-match), tool use + RAG (JSON-match), speed (measured throughput) — weighted to mirror real LLM usage. One headline number (0–100), category breakdown underneath. Score maps to one of five tiers — TRUNK / MAINLINE / FEEDER / TAP / DRIP — for at-a-glance readability.
Full methodology + weights + anti-cheat: [pipelinescore.ai/methodology](https://pipelinescore.ai/methodology)
## Deeper documentation
This README is the front door. For specifics:
| | Where |
|---|---|
| 🤖 LLM-first usage guide | [AGENTS.md](AGENTS.md) |
| 🛠️ Local dev setup (backend + web + CLI) | [DEVELOPMENT.md](DEVELOPMENT.md) |
| 🛡️ BYOK posture + retention policy | [SECURITY.md](SECURITY.md) + [pipelinescore.ai/privacy](https://pipelinescore.ai/privacy) |
| 🧮 How scores are computed + anti-cheat | [pipelinescore.ai/methodology](https://pipelinescore.ai/methodology) |
| 🤝 Contributing | [CONTRIBUTING.md](CONTRIBUTING.md) |
| 📜 Changelog | [CHANGELOG.md](CHANGELOG.md) |
| 🗣️ Code of conduct | [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) |
## Contributing
We need help with:
- **More benchmark tasks** — submit a PR with a task in `benchmarks/tasks-v1.json`
- **More local server endpoints** — vLLM, TGI, Ramalama, anything OpenAI-compatible
- **Hardware tag suggestions** — common rigs we're missing in [seed-local-models.ts](backend/src/seed-local-models.ts)
- **Bug reports** — file an issue with the failing nickname / model / hardware combo
See [CONTRIBUTING.md](CONTRIBUTING.md) for the workflow + [SECURITY.md](SECURITY.md) for the BYOK posture.
## Star History
[](https://star-history.com/#drewmattie-code/pipelinescore&Date)
If this repo is useful to you, a star is the easiest signal to send. It helps surface PipelineScore to other devs running local models.
## License
[Apache 2.0](LICENSE).
## Authors
Drew Mattie · SaaSquach AI Labs (a division of Charles & Roe Inc.)