https://github.com/mohitsoni48/turbollm

Run any local LLM engine, auto-tuned to your GPU — polished web UI + OpenAI/Anthropic-compatible API. Point Claude Code at your own machine in one command. No Electron, no Python, offline-first.
https://github.com/mohitsoni48/turbollm

ai anthropic-api claude-code gguf gpu inference llama-cpp llama-server llm local-llm offline openai-api self-hosted

Last synced: 27 days ago
JSON representation

Run any local LLM engine, auto-tuned to your GPU — polished web UI + OpenAI/Anthropic-compatible API. Point Claude Code at your own machine in one command. No Electron, no Python, offline-first.

Host: GitHub
URL: https://github.com/mohitsoni48/turbollm
Owner: mohitsoni48
Created: 2026-06-12T14:41:29.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-06-24T04:34:36.000Z (about 1 month ago)
Last Synced: 2026-06-24T06:22:44.125Z (about 1 month ago)
Topics: ai, anthropic-api, claude-code, gguf, gpu, inference, llama-cpp, llama-server, llm, local-llm, offline, openai-api, self-hosted
Language: TypeScript
Homepage: https://www.npmjs.com/package/turbollm
Size: 5.6 MB
Stars: 55
Watchers: 0
Forks: 8
Open Issues: 1
Metadata Files:
- Readme: README.md
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

TurboLLM

Run any local LLM engine, auto-tuned to your GPU — with a polished web UI
and an OpenAI/Anthropic-compatible API.

Bring your own llama.cpp fork. No compiling. No Electron. No Python. Point Claude Code at
your own machine in one command — fully offline.

```bash
npx turbollm
```

That one command starts a local daemon, opens a browser UI, and serves your models over an
API any tool can talk to. TurboLLM is the **performance & bleeding-edge layer for local
LLMs** — built for people who today hand-compile forks and hunt forums for the right flags.

How TurboLLM works: clients -> one lightweight daemon -> any engine on your GPU

---

## Contents

- [Why TurboLLM](#why-turbollm)
- [Speed: TurboLLM vs LM Studio](#speed-turbollm-vs-lm-studio)
- [Features](#features)
- [Quick start](#quick-start)
- [⭐ Bring any engine — the headline feature](#-bring-any-engine--the-headline-feature)
- [Run Claude Code on your own GPU](#run-claude-code-on-your-own-gpu)
- [Use it from any device on your network](#use-it-from-any-device-on-your-network)
- [Command-line reference](#command-line-reference)
- [Configuration & data](#configuration--data)
- [Requirements](#requirements)
- [Privacy](#privacy)
- [How TurboLLM compares](#how-turbollm-compares)
- [Troubleshooting](#troubleshooting)
- [Develop from source](#develop-from-source)
- [Community](#community)
- [License](#license)

---

## Why TurboLLM

Local-LLM tools make two choices for you, and both cost you performance:

1. **They pick the engine.** LM Studio ships one blessed runtime; Ollama hides the engine
entirely. The fastest community innovations — new quant formats, speculative decoding,
low-bit KV cache — land in **forks** first, and you can't use them without compiling.
2. **They don't tell you what speed to expect**, and they don't tune the dozens of launch
flags (`-c`, `-ngl`, `--n-cpu-moe`, KV type, threads, flash-attn, draft models) that make
the difference between 20 and 80 tokens/sec on the *same* hardware.

TurboLLM does the opposite:

- **🔌 Any engine, including forks.** Point it at any `llama-server`-compatible binary — a
build you compiled, a community fork, or the one it auto-provisions for your GPU. It probes
the binary's real capabilities and adapts the UI to them. **This is the whole point.**
- **⚡ Auto-tuned to your hardware.** It benchmarks on load, derives fast defaults, and shows
a **VRAM-fit verdict before you load** — no more flag guessing.
- **📊 Real tokens/sec, never faked.** Speed in the model list is *measured on your machine*
from actual generation — live while you chat, and remembered per model.
- **🪶 Lightweight.** A ~0.3 MB npm package on Node — **no Electron, no bundled Chromium, no
Python**. It downloads only the engine your GPU actually needs (Vulkan ≈ 38 MB).
- **🔌 Drop-in APIs.** OpenAI **and** Anthropic-compatible — so Claude Code and every existing
tool work unchanged.
- **🔀 A gateway that loads models for you.** Name any model in your API request and TurboLLM
loads it on demand, keeping your favorites hot in a small pool — so an agent that hops between
models just works, with nothing to pre-wire.
- **🔒 Offline-first & private.** No account, no backend, no internet, **no telemetry.**

---

## Speed: TurboLLM vs LM Studio

Same GPU (RTX 5070 Ti 16 GB), same model, same 200K context — measured generation speed.
**TurboLLM is faster than LM Studio on the very same official llama.cpp, and faster still when you
run a community fork LM Studio can't.**

**① On official llama.cpp, TurboLLM is faster.** It auto-provisions a GPU-native engine build (CUDA
13 for Blackwell here) and tunes expert-offload to the layer, so at the *same* KV-cache quant it
beats LM Studio's bundled runtime:

| Qwen3.6-35B-A3B · 200K | TurboLLM | LM Studio | Speed-up |
|---|:---:|:---:|:---:|
| official llama.cpp — `q4_0` | **74.7 t/s** | 61.0 t/s | **1.2×** |
| official llama.cpp — `q8_0` | **72.3 t/s** | ~66 t/s\* | **1.1×** |

**② Run a faster engine and pull far ahead.** Because TurboLLM runs *any* engine, you can drop in
the **TurboQuant** fork — a llama.cpp fork with a low-bit `turbo4` KV cache that LM Studio simply
can't load — in one click. On a large-KV model it delivers `q8_0`-level quality at **more than
double the speed**:

| Qwen3.6-27B · 200K · matched quality | TurboLLM + TurboQuant | LM Studio | Speed-up |
|---|:---:|:---:|:---:|
| `turbo4` vs `q8_0` | **24.6 t/s** | 11.4 t/s | **2.2×** |

Same run, **1.7× faster prefill** too (1288 vs 757 tok/s).

_{\*LM Studio's `q8_0` mildly spilled VRAM at its best offload. A low-bit KV cache helps most

when the cache is large; TurboLLM's auto-tuner and on-screen measured t/s pick the fastest engine +

config for each model, so you don't have to.}

---

## Features

The headline — **[running any engine, including community forks](#-bring-any-engine--the-headline-feature)** —
has its own section below. Everything else is grouped here; each summary is the gist, expand for
the detail:

📦 Models — bring your own, or browse Hugging Face

- **Use the folders you already have.** Point TurboLLM at any directory of GGUFs — your
existing LM Studio / Ollama / manual downloads — **no re-downloading.** It parses GGUF
metadata (arch, params, quant, context, vision) for every file.
- **Browse & download from Hugging Face**, in-app: search, see the file tree, pick a quant,
and download with **resume + SHA-256 verification**. Gated models (Llama, Gemma) work via
your own HF token, which **never leaves your machine**.
- **Import from any URL** — not just Hugging Face. Paste a direct `.gguf` link (model-author
sites, mirrors, private servers); it disk-space-checks and downloads through the same manager.
- **Quant recommendation per GPU** and a **VRAM-fit verdict** so you pick a quant that
actually fits before you commit.
- **Primary download folder**, real-time **measured t/s per model**, and **delete-from-disk**.

⚡ Auto-tuning & performance

- **Auto-benchmark on load** derives fast defaults for your exact GPU.
- **Recommended sampling from the model card** — auto-tune reads the model's Hugging Face card
(falling back to the original model behind a requant) and prefills the author's recommended
`temperature / top_k / top_p / min_p`. No recommendation → your sampling is left untouched.
- **Real measured tokens/sec** in the model list — **live** while generating, **last-session**
when idle (never a synthetic estimate).
- **Full load-parameter UI**, a superset of what other tools expose: context length, GPU offload
(`-ngl`), **MoE CPU-offload (`--n-cpu-moe`)**, parallel slots, **KV-cache quant type** (incl.
low-bit on supporting forks), CPU threads, flash attention, and **speculative decoding (NextN /
MTP / draft)**.
- **Fast by default:** flash attention on, NextN self-speculative decoding on for models that
carry a draft head, threads auto — safely gated to what your engine actually accepts.
- **Multi-GPU, per model** — split a model across cards (layer/row split + main-GPU pick on
llama.cpp, tensor-parallel on vLLM). Defaults are no-ops, so single-GPU rigs are untouched.
- **Saved per-model profiles** — tune once, and it loads that way every time.

💬 Chat & agentic tools — a genuinely good UI, not an afterthought

- **Streaming** with a **stop** button, **live tokens/sec**, **prompt-processing %** and
**prefill t/s**, **time-to-first-token**, **total time**, exact **token counts**, and a
**context-usage meter** (filled / max) on every reply.
- **Thinking control** — toggle reasoning **off** for a direct answer, or leave it **on** with
collapsible, timed "thought for N s" blocks.
- **Markdown + syntax-highlighted code** with one-click copy — plus **inline Unicode charts**
the model draws when a comparison, trend, or hierarchy is genuinely worth a visual.
- **Live artifacts** — `html`, `svg`, and `mermaid` replies render as **sandboxed, offline
previews** shown as an image, with one-click export to **PNG / JPEG / SVG / animated GIF / HTML**.
- **Personas** — pick a style (Default · **Designer** · Concise · Detailed · Blunt · Formal · Tutor ·
Creative · Research) per conversation, no prompt-wrangling required. The **Designer** persona
produces polished, self-contained, previewable designs by default.
- **Edit, regenerate, delete, copy** any message; **persistent, searchable conversations**
with rename, delete, and **auto-generated titles**.
- **Per-chat system prompt** and **per-chat sampling** overrides — temperature, top-p/k, min-p,
repeat/presence/frequency penalties, and **stop strings**.
- **Image input** for vision models, and **TurboLLM Expert** — a built-in assistant that knows
the app and your hardware for onboarding and troubleshooting without leaving the UI.
- **Agentic tools** — built-in `web_search` (Tavily), `fetch_url`, and sandboxed `run_code`, plus
an **MCP marketplace** in Customize: one-click connect for hosted MCPs (GitHub, Linear, Stripe,
Atlassian, Neon, Supabase, Cloudflare, Zapier, Apify, Mixpanel) and open-source local MCPs
(filesystem, git, postgres, playwright, …), plus your own custom servers. Connected tools appear
in every chat with no restart. A **Research** persona forces multi-step web search and cites sources inline.

🤖 Background agents — long-running tasks that don't tie up your chat

- **Launch an agent and walk away.** The **Agents** screen runs tasks in the daemon, separate from
the chat tab — describe the task, pick which tools it may use (web search / fetch URL / run code),
and let it work.
- **Live, reconnectable progress.** Watch the run stream in real time; navigate away or reload and
the view **reconnects** to the in-progress output. Runs **queue** behind any active run and
**persist** across restarts.
- **Cancel anytime**, and review completed runs (messages + the tool calls they made) later.

🔌 APIs & integrations — OpenAI + Anthropic, plus a model-loading gateway

With a model loaded, TurboLLM serves two compatible APIs on the same port:

```bash
# OpenAI-compatible
curl http://127.0.0.1:6996/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"hello"}]}'
```

- **OpenAI-compatible** `/v1/chat/completions`, `/v1/embeddings`, … — point any OpenAI client
or tool at it. Embedding models are auto-detected and pooled separately, so a RAG pipeline and
a chat model can stay loaded side by side.
- **Anthropic-compatible** `/v1/messages` — including **tool use and streaming** — which powers
Claude Code below. No other local host offers this.
- **Structured output** — constrain any response to a **GBNF grammar** (or JSON shape).
- **API-key auth** you can require when sharing over a LAN (Settings → Network).

**The gateway loads models for you.** Most local hosts make you load a model first, then call it.
TurboLLM's gateway reads the `model` field of any incoming request, **fuzzy-matches it to your
library, and loads it on the fly** if it isn't already running — then keeps up to **four models
hot** in an LRU pool so the next switch is instant. An agent (or Claude Code) that hops between a
coding model, a vision model, and an embedder just names each one and it works — no pre-wiring.

🎨 Share the GPU with ComfyUI

If you run **ComfyUI** on the same GPU, an LLM holding VRAM while ComfyUI renders means both
fight for memory (and one usually OOMs). TurboLLM can hand the GPU over automatically:

- The instant ComfyUI starts a render, TurboLLM **unloads its model and pauses new loads**.
- When ComfyUI's queue drains, TurboLLM **reloads the exact model it unloaded**.

It's **push-based, not polling** — ComfyUI signals TurboLLM the moment a job starts/ends, so the
handoff is immediate and deterministic (the model is gone *before* ComfyUI executes).

**One-time setup** (Settings → ComfyUI): turn on **Pause for ComfyUI**, enter your ComfyUI folder
(the one containing `custom_nodes`), click **Install gate** (it writes a small custom node wired to
this daemon), then **restart ComfyUI** once. The panel shows a live indicator (rendering / idle /
connected); **Remove** undoes it.

🪶 Platform — tiny, offline, private

- A **~0.3 MB npm package** on Node — no Electron, no bundled Chromium, no Python.
- **Offline-first** — no account, no backend, no internet, no telemetry.
- **Windows · macOS · Linux**, with a CPU fallback when there's no GPU.

---

## Quick start

```bash
# run without installing (recommended for first try)
npx turbollm

# or install globally
npm install -g turbollm
turbollm
```

**On first run** the daemon:

1. Detects your GPU and **downloads a matching `llama-server` build** (CUDA for NVIDIA, ROCm
for AMD, Metal for Apple, SYCL for Intel, Vulkan otherwise — with a CPU fallback).
2. Starts on and opens your browser.
3. Drops you on the **Chat** screen, ready to load a model.

Then open **Models**, download or pick a GGUF, click **Load**, and start chatting. Stop the
daemon any time with **Ctrl+C**.

---

## ⭐ Bring any engine — the headline feature

No other local-LLM app lets you run **whatever inference engine you want**. TurboLLM treats
the engine as a swappable component.

**Add a custom engine** (Engines screen → **Add engine**):

1. Compile or download any `llama-server`-compatible binary — stock
[llama.cpp](https://github.com/ggml-org/llama.cpp), a community fork, or your own build.
2. Point TurboLLM at the **folder** — it scans for the `llama-server` binary, runs a
**capability probe**, and learns exactly which flags and features that build supports.
*(Optional: paste the source repo URL so TurboLLM flags when a newer build ships.)*
3. Activate it. The load-parameter UI **adapts to that engine** — features the build doesn't
support are hidden; ones it adds (e.g. low-bit KV cache, NextN) light up.

No prebuilt for your OS? The **build-from-source guide** checks your toolchain (git / CMake /
CUDA / MSVC), hands you the exact build commands, then drops you into the folder scan above.

**Auto-provisioned default.** Don't want to fetch anything? On first run TurboLLM downloads
the right upstream prebuilt for your GPU automatically — and a **backend picker** lets you
switch between CUDA / ROCm / Metal / SYCL / Vulkan / CPU at any time (it downloads the variant
you choose, LM Studio-style).

**Engine types.** **llama.cpp / GGUF**, **KoboldCpp** and **llamafile** (GGUF, every OS),
**MLX** (macOS), and **vLLM** (Linux + NVIDIA) are all first-class engine kinds — install from
the curated catalog, pick the right one per model, and switch from a single dropdown.

**Fully supervised.** Every engine runs under a real state machine: health-gated readiness,
graceful stop, an **idle auto-stop** watchdog, and **live logs + clear error surfacing** in
the UI when something fails to load.

> Why it matters: fork-exclusive features — **speculative decoding (NextN / MTP / draft)**,
> low-bit KV cache, new quant formats — are usable on day 0, with **zero compiler knowledge**
> on your part beyond producing the binary (and often not even that).

---

## Run Claude Code on your own GPU

TurboLLM's Anthropic-compatible endpoint means [Claude
Code](https://www.npmjs.com/package/@anthropic-ai/claude-code) can run against whatever model
you've loaded — no cloud key, fully offline. One command wires it up:

```bash
turbollm launch claude # auto-loads a model if none is running, then opens Claude Code
turbollm launch claude --model qwen3-8b # load a specific model first, then launch
```

It sets Claude Code's `ANTHROPIC_BASE_URL` / `ANTHROPIC_MODEL` at TurboLLM and execs `claude`;
extra args are forwarded. If no model is loaded it auto-loads your last-used one (or the first
in your library); `--model` picks a specific one by key or name. If `claude` isn't installed,
it tells you how. The in-app
**Developer** screen also shows copy-paste env snippets for any OpenAI- or Anthropic-compatible
tool (Open WebUI, Kilo Code, opencode, …).

---

## Use it from any device on your network

The UI runs in the browser, so any phone, tablet, or laptop on your LAN can use the model on
your GPU box:

```bash
turbollm --addr 0.0.0.0:6996 # bind all interfaces, then open http://:6996
```

Turn on **Require API key** in Settings → Network when you expose it.

---

## Command-line reference

```bash
turbollm # start on :6996, open browser
turbollm --port 9000 # listen on a specific port
turbollm --no-open # start without opening a browser
turbollm --addr 0.0.0.0:6996 # bind all interfaces (LAN sharing)
turbollm --stop # stop a running daemon (any terminal)
turbollm launch claude # start Claude Code (auto-loads a model if none is running)
turbollm launch claude --model qwen3-8b # load a specific model, then launch
```

| Flag | Description |
|------|-------------|
| `--port ` | Listen on a specific port (default: `6996`) |
| `--addr ` | Full host:port override, e.g. `0.0.0.0:6996` for LAN sharing |
| `--no-open` | Start without opening a browser window |
| `--config ` | Path to a custom config file |
| `--stop` | Stop a running TurboLLM daemon (reads `~/.turbollm/daemon.pid`) and exit |
| `--help`, `-h` | Show usage and exit |

`turbollm launch claude` also accepts `--model ` to load a specific model before
launching; without it, an already-loaded model is used, or the last-used / first model is
auto-loaded.

---

## Configuration & data

Everything lives under **`~/.turbollm/`** on every OS — `config.json`, the SQLite chat
database, downloaded engines, models cache, and logs. Back it up or delete it to reset.
Use `--config ` to point at an alternate config (its directory becomes the data dir).

---

## Requirements

- **Node.js 22 or newer** — enforced at startup with a clear message.
- **Windows, macOS, or Linux.**
- A GPU is recommended but **not required** — a CPU build is provisioned as a fallback.
- On Windows, the first time the auto-downloaded `llama-server` runs, SmartScreen/Defender may
prompt (it's an upstream binary). Allow it once.

---

## Privacy

TurboLLM is **offline-first**: core local use needs no account, no backend, and no internet.
**No analytics or telemetry are collected.** Your prompts, chats, files, and keys never leave
your machine.

---

## How TurboLLM compares

Focused on the differences that matter — all four are good tools, and the others move fast.
Marks reflect mid-2026; verify the moving rows against each tool's current docs.

| | **TurboLLM** | LM Studio | Ollama | Open WebUI |
|---|:---:|:---:|:---:|:---:|
| Run **any engine / community forks** | ✅ | ❌ llama.cpp/MLX only | ❌ hidden | ❌ frontend |
| **Benchmark-based auto-tune** of launch flags | ✅ | ◐ basic offload | ◐ basic offload | ❌ |
| **Measured** t/s in the model list | ✅ | ◐ per-run | ◐ `--verbose` | ❌ |
| **Anthropic** API (`/v1/messages`) → Claude Code | ✅ | ✅ 0.4.1+ | ✅ v0.14+ | ❌ |
| OpenAI-compatible API | ✅ | ✅ | ✅ | ◐ proxy |
| Auto-load the requested model / multi-model pool | ✅ | ✅ JIT | ✅ | ❌ |
| Use existing model folders (no re-download) | ✅ | ◐ import | ◐ import | ❌ frontend |
| Speculative decoding (draft / MTP) | ✅ | ✅ | ◐ env flag | ❌ |
| Web UI from any LAN device | ✅ | ❌ | ❌ | ✅ |
| **Lightweight** (no Electron / no Python) | ✅ npm | ❌ Electron | ✅ Go | ❌ Python |
| Offline-first · **no telemetry** | ✅ | ◐ analytics on by default | ✅ | ✅ |

LM Studio and Ollama both added Anthropic `/v1/messages` endpoints in 2026, so the API rows are
now parity — Claude Code works against any of them. TurboLLM's durable edges are **any engine
including community forks**, **benchmark-based auto-tuning with a VRAM-fit verdict + measured t/s
before you commit**, and **zero telemetry**.

Prefer Open WebUI's chat breadth? It works great pointed at TurboLLM's OpenAI endpoint.

---

## Troubleshooting

- **`TurboLLM requires Node.js 22 or newer`** — upgrade Node: .
- **Model won't load / OOM** — pick a smaller quant (the VRAM verdict warns you), lower GPU
offload, or close other GPU apps. Failures surface in the Engines screen with the engine log.
- **Windows Defender / SmartScreen prompt** — that's the upstream `llama-server` binary on
first run; allow it once.
- **Port already in use** — `turbollm --port 9000`.
- **Slow generation** — open the model's load params; ensure GPU offload is high and flash
attention / NextN are on for supported models.

---

## Develop from source

```bash
npm install # daemon deps
cd web && npm install && cd ..

npm run build:web # build the React UI -> src/webdist
npm run start # run the daemon in dev (hot TS via tsx) -> :6996

npm run build # production bundle -> dist/cli.js (web assets included)
node dist/cli.js --port 6996
```

Frontend hot-reload: `cd web && npm run dev` (proxies `/api` and `/v1` to the daemon on
:6996).

**Stack:** Node ≥22 · TypeScript · Hono · `node:sqlite` · tsup — and a React 19 + Tailwind v4 +
shadcn/ui frontend. One TypeScript codebase, shipped as an npm package.

---

## Community

Questions, ideas, and show-and-tell — join the [Discord](https://discord.gg/v6kRbV7nC).

---

## License

Source-available under the **Functional Source License 1.1 (Apache-2.0 future grant)** — SPDX
**`FSL-1.1-ALv2`**. Free for personal use, internal business use, education, and research; the
only restriction is shipping a competing product. Each release converts to Apache-2.0 two
years after it's published. Full text: [LICENSE.md](https://github.com/mohitsoni48/TurboLLM/blob/main/turbollm/LICENSE.md).

_{Built for people who refuse to wait for the mainstream to bless the fast path. ⚡}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mohitsoni48/turbollm

Awesome Lists containing this project

README

TurboLLM