{"id":51250828,"url":"https://github.com/mohitsoni48/turbollm","last_synced_at":"2026-06-29T07:00:51.554Z","repository":{"id":364931831,"uuid":"1267439901","full_name":"mohitsoni48/TurboLLM","owner":"mohitsoni48","description":"Run any local LLM engine, auto-tuned to your GPU — polished web UI + OpenAI/Anthropic-compatible API. Point Claude Code at your own machine in one command. No Electron, no Python, offline-first.","archived":false,"fork":false,"pushed_at":"2026-06-24T04:34:36.000Z","size":5869,"stargazers_count":55,"open_issues_count":1,"forks_count":8,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-24T06:22:44.125Z","etag":null,"topics":["ai","anthropic-api","claude-code","gguf","gpu","inference","llama-cpp","llama-server","llm","local-llm","offline","openai-api","self-hosted"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/turbollm","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mohitsoni48.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-12T14:41:29.000Z","updated_at":"2026-06-24T06:18:46.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mohitsoni48/TurboLLM","commit_stats":null,"previous_names":["mohitsoni48/turbo-llm","mohitsoni48/turbollm"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/mohitsoni48/TurboLLM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohitsoni48%2FTurboLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohitsoni48%2FTurboLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohitsoni48%2FTurboLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohitsoni48%2FTurboLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mohitsoni48","download_url":"https://codeload.github.com/mohitsoni48/TurboLLM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mohitsoni48%2FTurboLLM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34916411,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","anthropic-api","claude-code","gguf","gpu","inference","llama-cpp","llama-server","llm","local-llm","offline","openai-api","self-hosted"],"created_at":"2026-06-29T07:00:27.221Z","updated_at":"2026-06-29T07:00:51.533Z","avatar_url":"https://github.com/mohitsoni48.png","language":"TypeScript","funding_links":["https://ko-fi.com/mohitsoni","https://github.com/sponsors/mohitsoni48"],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/mohitsoni48/TurboLLM/main/turbollm/web/public/brand/turbollm-icon-512.jpeg?v=2\" width=\"92\" height=\"92\" alt=\"TurboLLM\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eTurboLLM\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eRun \u003cem\u003eany\u003c/em\u003e local LLM engine, auto-tuned to your GPU — with a polished web UI\n  and an OpenAI/Anthropic-compatible API.\u003c/strong\u003e\u003cbr/\u003e\n  Bring your own llama.cpp fork. No compiling. No Electron. No Python. Point Claude Code at\n  your own machine in one command — fully offline.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.npmjs.com/package/turbollm\"\u003e\u003cimg src=\"https://img.shields.io/npm/v/turbollm.svg?color=e2552e\" alt=\"npm version\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.npmjs.com/package/turbollm\"\u003e\u003cimg src=\"https://img.shields.io/npm/dm/turbollm.svg?color=e2552e\" alt=\"npm downloads\" /\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/node-%E2%89%A522-3c873a.svg\" alt=\"node \u003e= 22\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/license-FSL--1.1--ALv2-blue.svg\" alt=\"license\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/platform-Windows%20%C2%B7%20macOS%20%C2%B7%20Linux-555.svg\" alt=\"platforms\" /\u003e\n  \u003ca href=\"https://ko-fi.com/mohitsoni\"\u003e\u003cimg src=\"https://img.shields.io/badge/Ko--fi-support%20us-FF5E5B?logo=kofi\u0026logoColor=white\" alt=\"Ko-fi\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/sponsors/mohitsoni48\"\u003e\u003cimg src=\"https://img.shields.io/badge/GitHub%20Sponsors-support%20us-EA4AAA?logo=githubsponsors\u0026logoColor=white\" alt=\"GitHub Sponsors\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://discord.gg/v6kRbV7nC\"\u003e\u003cimg src=\"https://img.shields.io/badge/Discord-join%20chat-5865F2?logo=discord\u0026logoColor=white\" alt=\"Discord\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003c!-- Brand: shipped app icon web/public/brand/turbollm-icon-512.jpeg · high-res masters web/brand-assets/ (unshipped) · in-app mark web/src/components/Logo.tsx · favicon web/public/favicon.svg --\u003e\n\n```bash\nnpx turbollm\n```\n\nThat one command starts a local daemon, opens a browser UI, and serves your models over an\nAPI any tool can talk to. TurboLLM is the **performance \u0026 bleeding-edge layer for local\nLLMs** — built for people who today hand-compile forks and hunt forums for the right flags.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/mohitsoni48/TurboLLM/main/assets/how-it-works.svg?v=2\" width=\"860\" alt=\"How TurboLLM works: clients -\u003e one lightweight daemon -\u003e any engine on your GPU\" /\u003e\n\u003c/p\u003e\n\n---\n\n## Contents\n\n- [Why TurboLLM](#why-turbollm)\n- [Speed: TurboLLM vs LM Studio](#speed-turbollm-vs-lm-studio)\n- [Features](#features)\n- [Quick start](#quick-start)\n- [⭐ Bring any engine — the headline feature](#-bring-any-engine--the-headline-feature)\n- [Run Claude Code on your own GPU](#run-claude-code-on-your-own-gpu)\n- [Use it from any device on your network](#use-it-from-any-device-on-your-network)\n- [Command-line reference](#command-line-reference)\n- [Configuration \u0026 data](#configuration--data)\n- [Requirements](#requirements)\n- [Privacy](#privacy)\n- [How TurboLLM compares](#how-turbollm-compares)\n- [Troubleshooting](#troubleshooting)\n- [Develop from source](#develop-from-source)\n- [Community](#community)\n- [License](#license)\n\n---\n\n## Why TurboLLM\n\nLocal-LLM tools make two choices for you, and both cost you performance:\n\n1. **They pick the engine.** LM Studio ships one blessed runtime; Ollama hides the engine\n   entirely. The fastest community innovations — new quant formats, speculative decoding,\n   low-bit KV cache — land in **forks** first, and you can't use them without compiling.\n2. **They don't tell you what speed to expect**, and they don't tune the dozens of launch\n   flags (`-c`, `-ngl`, `--n-cpu-moe`, KV type, threads, flash-attn, draft models) that make\n   the difference between 20 and 80 tokens/sec on the *same* hardware.\n\nTurboLLM does the opposite:\n\n- **🔌 Any engine, including forks.** Point it at any `llama-server`-compatible binary — a\n  build you compiled, a community fork, or the one it auto-provisions for your GPU. It probes\n  the binary's real capabilities and adapts the UI to them. **This is the whole point.**\n- **⚡ Auto-tuned to your hardware.** It benchmarks on load, derives fast defaults, and shows\n  a **VRAM-fit verdict before you load** — no more flag guessing.\n- **📊 Real tokens/sec, never faked.** Speed in the model list is *measured on your machine*\n  from actual generation — live while you chat, and remembered per model.\n- **🪶 Lightweight.** A ~0.3 MB npm package on Node — **no Electron, no bundled Chromium, no\n  Python**. It downloads only the engine your GPU actually needs (Vulkan ≈ 38 MB).\n- **🔌 Drop-in APIs.** OpenAI **and** Anthropic-compatible — so Claude Code and every existing\n  tool work unchanged.\n- **🔀 A gateway that loads models for you.** Name any model in your API request and TurboLLM\n  loads it on demand, keeping your favorites hot in a small pool — so an agent that hops between\n  models just works, with nothing to pre-wire.\n- **🔒 Offline-first \u0026 private.** No account, no backend, no internet, **no telemetry.**\n\n---\n\n## Speed: TurboLLM vs LM Studio\n\nSame GPU (RTX 5070 Ti 16 GB), same model, same 200K context — measured generation speed.\n**TurboLLM is faster than LM Studio on the very same official llama.cpp, and faster still when you\nrun a community fork LM Studio can't.**\n\n**① On official llama.cpp, TurboLLM is faster.** It auto-provisions a GPU-native engine build (CUDA\n13 for Blackwell here) and tunes expert-offload to the layer, so at the *same* KV-cache quant it\nbeats LM Studio's bundled runtime:\n\n| Qwen3.6-35B-A3B · 200K | TurboLLM | LM Studio | Speed-up |\n|---|:---:|:---:|:---:|\n| official llama.cpp — `q4_0` | **74.7 t/s** | 61.0 t/s | **1.2×** |\n| official llama.cpp — `q8_0` | **72.3 t/s** | ~66 t/s\\* | **1.1×** |\n\n**② Run a faster engine and pull far ahead.** Because TurboLLM runs *any* engine, you can drop in\nthe **TurboQuant** fork — a llama.cpp fork with a low-bit `turbo4` KV cache that LM Studio simply\ncan't load — in one click. On a large-KV model it delivers `q8_0`-level quality at **more than\ndouble the speed**:\n\n| Qwen3.6-27B · 200K · matched quality | TurboLLM\u0026nbsp;+\u0026nbsp;TurboQuant | LM Studio | Speed-up |\n|---|:---:|:---:|:---:|\n| `turbo4` vs `q8_0` | **24.6 t/s** | 11.4 t/s | **2.2×** |\n\nSame run, **1.7× faster prefill** too (1288 vs 757 tok/s).\n\n\u003csub\u003e\\*LM Studio's `q8_0` mildly spilled VRAM at its best offload. A low-bit KV cache helps most\nwhen the cache is large; TurboLLM's auto-tuner and on-screen measured t/s pick the fastest engine +\nconfig for each model, so you don't have to.\u003c/sub\u003e\n\n---\n\n## Features\n\nThe headline — **[running any engine, including community forks](#-bring-any-engine--the-headline-feature)** —\nhas its own section below. Everything else is grouped here; each summary is the gist, expand for\nthe detail:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e📦 Models — bring your own, or browse Hugging Face\u003c/strong\u003e\u003c/summary\u003e\n\n\u003cbr/\u003e\n\n- **Use the folders you already have.** Point TurboLLM at any directory of GGUFs — your\n  existing LM Studio / Ollama / manual downloads — **no re-downloading.** It parses GGUF\n  metadata (arch, params, quant, context, vision) for every file.\n- **Browse \u0026 download from Hugging Face**, in-app: search, see the file tree, pick a quant,\n  and download with **resume + SHA-256 verification**. Gated models (Llama, Gemma) work via\n  your own HF token, which **never leaves your machine**.\n- **Import from any URL** — not just Hugging Face. Paste a direct `.gguf` link (model-author\n  sites, mirrors, private servers); it disk-space-checks and downloads through the same manager.\n- **Quant recommendation per GPU** and a **VRAM-fit verdict** so you pick a quant that\n  actually fits before you commit.\n- **Primary download folder**, real-time **measured t/s per model**, and **delete-from-disk**.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e⚡ Auto-tuning \u0026amp; performance\u003c/strong\u003e\u003c/summary\u003e\n\n\u003cbr/\u003e\n\n- **Auto-benchmark on load** derives fast defaults for your exact GPU.\n- **Recommended sampling from the model card** — auto-tune reads the model's Hugging Face card\n  (falling back to the original model behind a requant) and prefills the author's recommended\n  `temperature / top_k / top_p / min_p`. No recommendation → your sampling is left untouched.\n- **Real measured tokens/sec** in the model list — **live** while generating, **last-session**\n  when idle (never a synthetic estimate).\n- **Full load-parameter UI**, a superset of what other tools expose: context length, GPU offload\n  (`-ngl`), **MoE CPU-offload (`--n-cpu-moe`)**, parallel slots, **KV-cache quant type** (incl.\n  low-bit on supporting forks), CPU threads, flash attention, and **speculative decoding (NextN /\n  MTP / draft)**.\n- **Fast by default:** flash attention on, NextN self-speculative decoding on for models that\n  carry a draft head, threads auto — safely gated to what your engine actually accepts.\n- **Multi-GPU, per model** — split a model across cards (layer/row split + main-GPU pick on\n  llama.cpp, tensor-parallel on vLLM). Defaults are no-ops, so single-GPU rigs are untouched.\n- **Saved per-model profiles** — tune once, and it loads that way every time.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e💬 Chat \u0026amp; agentic tools — a genuinely good UI, not an afterthought\u003c/strong\u003e\u003c/summary\u003e\n\n\u003cbr/\u003e\n\n- **Streaming** with a **stop** button, **live tokens/sec**, **prompt-processing %** and\n  **prefill t/s**, **time-to-first-token**, **total time**, exact **token counts**, and a\n  **context-usage meter** (filled / max) on every reply.\n- **Thinking control** — toggle reasoning **off** for a direct answer, or leave it **on** with\n  collapsible, timed \"thought for N s\" blocks.\n- **Markdown + syntax-highlighted code** with one-click copy — plus **inline Unicode charts**\n  the model draws when a comparison, trend, or hierarchy is genuinely worth a visual.\n- **Live artifacts** — `html`, `svg`, and `mermaid` replies render as **sandboxed, offline\n  previews** shown as an image, with one-click export to **PNG / JPEG / SVG / animated GIF / HTML**.\n- **Personas** — pick a style (Default · **Designer** · Concise · Detailed · Blunt · Formal · Tutor ·\n  Creative · Research) per conversation, no prompt-wrangling required. The **Designer** persona\n  produces polished, self-contained, previewable designs by default.\n- **Edit, regenerate, delete, copy** any message; **persistent, searchable conversations**\n  with rename, delete, and **auto-generated titles**.\n- **Per-chat system prompt** and **per-chat sampling** overrides — temperature, top-p/k, min-p,\n  repeat/presence/frequency penalties, and **stop strings**.\n- **Image input** for vision models, and **TurboLLM Expert** — a built-in assistant that knows\n  the app and your hardware for onboarding and troubleshooting without leaving the UI.\n- **Agentic tools** — built-in `web_search` (Tavily), `fetch_url`, and sandboxed `run_code`, plus\n  an **MCP marketplace** in Customize: one-click connect for hosted MCPs (GitHub, Linear, Stripe,\n  Atlassian, Neon, Supabase, Cloudflare, Zapier, Apify, Mixpanel) and open-source local MCPs\n  (filesystem, git, postgres, playwright, …), plus your own custom servers. Connected tools appear\n  in every chat with no restart. A **Research** persona forces multi-step web search and cites sources inline.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e🤖 Background agents — long-running tasks that don't tie up your chat\u003c/strong\u003e\u003c/summary\u003e\n\n\u003cbr/\u003e\n\n- **Launch an agent and walk away.** The **Agents** screen runs tasks in the daemon, separate from\n  the chat tab — describe the task, pick which tools it may use (web search / fetch URL / run code),\n  and let it work.\n- **Live, reconnectable progress.** Watch the run stream in real time; navigate away or reload and\n  the view **reconnects** to the in-progress output. Runs **queue** behind any active run and\n  **persist** across restarts.\n- **Cancel anytime**, and review completed runs (messages + the tool calls they made) later.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e🔌 APIs \u0026amp; integrations — OpenAI + Anthropic, plus a model-loading gateway\u003c/strong\u003e\u003c/summary\u003e\n\n\u003cbr/\u003e\n\nWith a model loaded, TurboLLM serves two compatible APIs on the same port:\n\n```bash\n# OpenAI-compatible\ncurl http://127.0.0.1:6996/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"local\",\"messages\":[{\"role\":\"user\",\"content\":\"hello\"}]}'\n```\n\n- **OpenAI-compatible** `/v1/chat/completions`, `/v1/embeddings`, … — point any OpenAI client\n  or tool at it. Embedding models are auto-detected and pooled separately, so a RAG pipeline and\n  a chat model can stay loaded side by side.\n- **Anthropic-compatible** `/v1/messages` — including **tool use and streaming** — which powers\n  Claude Code below. No other local host offers this.\n- **Structured output** — constrain any response to a **GBNF grammar** (or JSON shape).\n- **API-key auth** you can require when sharing over a LAN (Settings → Network).\n\n**The gateway loads models for you.** Most local hosts make you load a model first, then call it.\nTurboLLM's gateway reads the `model` field of any incoming request, **fuzzy-matches it to your\nlibrary, and loads it on the fly** if it isn't already running — then keeps up to **four models\nhot** in an LRU pool so the next switch is instant. An agent (or Claude Code) that hops between a\ncoding model, a vision model, and an embedder just names each one and it works — no pre-wiring.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e🎨 Share the GPU with ComfyUI\u003c/strong\u003e\u003c/summary\u003e\n\n\u003cbr/\u003e\n\nIf you run **ComfyUI** on the same GPU, an LLM holding VRAM while ComfyUI renders means both\nfight for memory (and one usually OOMs). TurboLLM can hand the GPU over automatically:\n\n- The instant ComfyUI starts a render, TurboLLM **unloads its model and pauses new loads**.\n- When ComfyUI's queue drains, TurboLLM **reloads the exact model it unloaded**.\n\nIt's **push-based, not polling** — ComfyUI signals TurboLLM the moment a job starts/ends, so the\nhandoff is immediate and deterministic (the model is gone *before* ComfyUI executes).\n\n**One-time setup** (Settings → ComfyUI): turn on **Pause for ComfyUI**, enter your ComfyUI folder\n(the one containing `custom_nodes`), click **Install gate** (it writes a small custom node wired to\nthis daemon), then **restart ComfyUI** once. The panel shows a live indicator (rendering / idle /\nconnected); **Remove** undoes it.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cstrong\u003e🪶 Platform — tiny, offline, private\u003c/strong\u003e\u003c/summary\u003e\n\n\u003cbr/\u003e\n\n- A **~0.3 MB npm package** on Node — no Electron, no bundled Chromium, no Python.\n- **Offline-first** — no account, no backend, no internet, no telemetry.\n- **Windows · macOS · Linux**, with a CPU fallback when there's no GPU.\n\n\u003c/details\u003e\n\n---\n\n## Quick start\n\n```bash\n# run without installing (recommended for first try)\nnpx turbollm\n\n# or install globally\nnpm install -g turbollm\nturbollm\n```\n\n**On first run** the daemon:\n\n1. Detects your GPU and **downloads a matching `llama-server` build** (CUDA for NVIDIA, ROCm\n   for AMD, Metal for Apple, SYCL for Intel, Vulkan otherwise — with a CPU fallback).\n2. Starts on \u003chttp://127.0.0.1:6996\u003e and opens your browser.\n3. Drops you on the **Chat** screen, ready to load a model.\n\nThen open **Models**, download or pick a GGUF, click **Load**, and start chatting. Stop the\ndaemon any time with **Ctrl+C**.\n\n---\n\n## ⭐ Bring any engine — the headline feature\n\nNo other local-LLM app lets you run **whatever inference engine you want**. TurboLLM treats\nthe engine as a swappable component.\n\n**Add a custom engine** (Engines screen → **Add engine**):\n\n1. Compile or download any `llama-server`-compatible binary — stock\n   [llama.cpp](https://github.com/ggml-org/llama.cpp), a community fork, or your own build.\n2. Point TurboLLM at the **folder** — it scans for the `llama-server` binary, runs a\n   **capability probe**, and learns exactly which flags and features that build supports.\n   *(Optional: paste the source repo URL so TurboLLM flags when a newer build ships.)*\n3. Activate it. The load-parameter UI **adapts to that engine** — features the build doesn't\n   support are hidden; ones it adds (e.g. low-bit KV cache, NextN) light up.\n\nNo prebuilt for your OS? The **build-from-source guide** checks your toolchain (git / CMake /\nCUDA / MSVC), hands you the exact build commands, then drops you into the folder scan above.\n\n**Auto-provisioned default.** Don't want to fetch anything? On first run TurboLLM downloads\nthe right upstream prebuilt for your GPU automatically — and a **backend picker** lets you\nswitch between CUDA / ROCm / Metal / SYCL / Vulkan / CPU at any time (it downloads the variant\nyou choose, LM Studio-style).\n\n**Engine types.** **llama.cpp / GGUF**, **KoboldCpp** and **llamafile** (GGUF, every OS),\n**MLX** (macOS), and **vLLM** (Linux + NVIDIA) are all first-class engine kinds — install from\nthe curated catalog, pick the right one per model, and switch from a single dropdown.\n\n**Fully supervised.** Every engine runs under a real state machine: health-gated readiness,\ngraceful stop, an **idle auto-stop** watchdog, and **live logs + clear error surfacing** in\nthe UI when something fails to load.\n\n\u003e Why it matters: fork-exclusive features — **speculative decoding (NextN / MTP / draft)**,\n\u003e low-bit KV cache, new quant formats — are usable on day 0, with **zero compiler knowledge**\n\u003e on your part beyond producing the binary (and often not even that).\n\n---\n\n## Run Claude Code on your own GPU\n\nTurboLLM's Anthropic-compatible endpoint means [Claude\nCode](https://www.npmjs.com/package/@anthropic-ai/claude-code) can run against whatever model\nyou've loaded — no cloud key, fully offline. One command wires it up:\n\n```bash\nturbollm launch claude               # auto-loads a model if none is running, then opens Claude Code\nturbollm launch claude --model qwen3-8b   # load a specific model first, then launch\n```\n\nIt sets Claude Code's `ANTHROPIC_BASE_URL` / `ANTHROPIC_MODEL` at TurboLLM and execs `claude`;\nextra args are forwarded. If no model is loaded it auto-loads your last-used one (or the first\nin your library); `--model` picks a specific one by key or name. If `claude` isn't installed,\nit tells you how. The in-app\n**Developer** screen also shows copy-paste env snippets for any OpenAI- or Anthropic-compatible\ntool (Open WebUI, Kilo Code, opencode, …).\n\n---\n\n## Use it from any device on your network\n\nThe UI runs in the browser, so any phone, tablet, or laptop on your LAN can use the model on\nyour GPU box:\n\n```bash\nturbollm --addr 0.0.0.0:6996    # bind all interfaces, then open http://\u003cyour-ip\u003e:6996\n```\n\nTurn on **Require API key** in Settings → Network when you expose it.\n\n---\n\n## Command-line reference\n\n```bash\nturbollm                        # start on :6996, open browser\nturbollm --port 9000            # listen on a specific port\nturbollm --no-open              # start without opening a browser\nturbollm --addr 0.0.0.0:6996    # bind all interfaces (LAN sharing)\nturbollm --stop                 # stop a running daemon (any terminal)\nturbollm launch claude          # start Claude Code (auto-loads a model if none is running)\nturbollm launch claude --model qwen3-8b   # load a specific model, then launch\n```\n\n| Flag | Description |\n|------|-------------|\n| `--port \u003cn\u003e` | Listen on a specific port (default: `6996`) |\n| `--addr \u003chost:port\u003e` | Full host:port override, e.g. `0.0.0.0:6996` for LAN sharing |\n| `--no-open` | Start without opening a browser window |\n| `--config \u003cfile\u003e` | Path to a custom config file |\n| `--stop` | Stop a running TurboLLM daemon (reads `~/.turbollm/daemon.pid`) and exit |\n| `--help`, `-h` | Show usage and exit |\n\n`turbollm launch claude` also accepts `--model \u003ckey|name\u003e` to load a specific model before\nlaunching; without it, an already-loaded model is used, or the last-used / first model is\nauto-loaded.\n\n---\n\n## Configuration \u0026 data\n\nEverything lives under **`~/.turbollm/`** on every OS — `config.json`, the SQLite chat\ndatabase, downloaded engines, models cache, and logs. Back it up or delete it to reset.\nUse `--config \u003cfile\u003e` to point at an alternate config (its directory becomes the data dir).\n\n---\n\n## Requirements\n\n- **Node.js 22 or newer** — enforced at startup with a clear message. \u003chttps://nodejs.org\u003e\n- **Windows, macOS, or Linux.**\n- A GPU is recommended but **not required** — a CPU build is provisioned as a fallback.\n- On Windows, the first time the auto-downloaded `llama-server` runs, SmartScreen/Defender may\n  prompt (it's an upstream binary). Allow it once.\n\n---\n\n## Privacy\n\nTurboLLM is **offline-first**: core local use needs no account, no backend, and no internet.\n**No analytics or telemetry are collected.** Your prompts, chats, files, and keys never leave\nyour machine.\n\n---\n\n## How TurboLLM compares\n\nFocused on the differences that matter — all four are good tools, and the others move fast.\nMarks reflect mid-2026; verify the moving rows against each tool's current docs.\n\n| | **TurboLLM** | LM Studio | Ollama | Open WebUI |\n|---|:---:|:---:|:---:|:---:|\n| Run **any engine / community forks** | ✅ | ❌ llama.cpp/MLX only | ❌ hidden | ❌ frontend |\n| **Benchmark-based auto-tune** of launch flags | ✅ | ◐ basic offload | ◐ basic offload | ❌ |\n| **Measured** t/s in the model list | ✅ | ◐ per-run | ◐ `--verbose` | ❌ |\n| **Anthropic** API (`/v1/messages`) → Claude Code | ✅ | ✅ 0.4.1+ | ✅ v0.14+ | ❌ |\n| OpenAI-compatible API | ✅ | ✅ | ✅ | ◐ proxy |\n| Auto-load the requested model / multi-model pool | ✅ | ✅ JIT | ✅ | ❌ |\n| Use existing model folders (no re-download) | ✅ | ◐ import | ◐ import | ❌ frontend |\n| Speculative decoding (draft / MTP) | ✅ | ✅ | ◐ env flag | ❌ |\n| Web UI from any LAN device | ✅ | ❌ | ❌ | ✅ |\n| **Lightweight** (no Electron / no Python) | ✅ npm | ❌ Electron | ✅ Go | ❌ Python |\n| Offline-first · **no telemetry** | ✅ | ◐ analytics on by default | ✅ | ✅ |\n\nLM Studio and Ollama both added Anthropic `/v1/messages` endpoints in 2026, so the API rows are\nnow parity — Claude Code works against any of them. TurboLLM's durable edges are **any engine\nincluding community forks**, **benchmark-based auto-tuning with a VRAM-fit verdict + measured t/s\nbefore you commit**, and **zero telemetry**.\n\nPrefer Open WebUI's chat breadth? It works great pointed at TurboLLM's OpenAI endpoint.\n\n---\n\n## Troubleshooting\n\n- **`TurboLLM requires Node.js 22 or newer`** — upgrade Node: \u003chttps://nodejs.org\u003e.\n- **Model won't load / OOM** — pick a smaller quant (the VRAM verdict warns you), lower GPU\n  offload, or close other GPU apps. Failures surface in the Engines screen with the engine log.\n- **Windows Defender / SmartScreen prompt** — that's the upstream `llama-server` binary on\n  first run; allow it once.\n- **Port already in use** — `turbollm --port 9000`.\n- **Slow generation** — open the model's load params; ensure GPU offload is high and flash\n  attention / NextN are on for supported models.\n\n---\n\n## Develop from source\n\n```bash\nnpm install                  # daemon deps\ncd web \u0026\u0026 npm install \u0026\u0026 cd ..\n\nnpm run build:web            # build the React UI -\u003e src/webdist\nnpm run start                # run the daemon in dev (hot TS via tsx) -\u003e :6996\n\nnpm run build                # production bundle -\u003e dist/cli.js (web assets included)\nnode dist/cli.js --port 6996\n```\n\nFrontend hot-reload: `cd web \u0026\u0026 npm run dev` (proxies `/api` and `/v1` to the daemon on\n:6996).\n\n**Stack:** Node ≥22 · TypeScript · Hono · `node:sqlite` · tsup — and a React 19 + Tailwind v4 +\nshadcn/ui frontend. One TypeScript codebase, shipped as an npm package.\n\n---\n\n## Community\n\nQuestions, ideas, and show-and-tell — join the [Discord](https://discord.gg/v6kRbV7nC).\n\n---\n\n## License\n\nSource-available under the **Functional Source License 1.1 (Apache-2.0 future grant)** — SPDX\n**`FSL-1.1-ALv2`**. Free for personal use, internal business use, education, and research; the\nonly restriction is shipping a competing product. Each release converts to Apache-2.0 two\nyears after it's published. Full text: [LICENSE.md](https://github.com/mohitsoni48/TurboLLM/blob/main/turbollm/LICENSE.md).\n\n\u003cp align=\"center\"\u003e\u003csub\u003eBuilt for people who refuse to wait for the mainstream to bless the fast path. ⚡\u003c/sub\u003e\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohitsoni48%2Fturbollm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmohitsoni48%2Fturbollm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohitsoni48%2Fturbollm/lists"}