{"id":49505199,"url":"https://github.com/devnen/qwen3.6-windows-server","last_synced_at":"2026-05-06T10:01:42.926Z","repository":{"id":354680082,"uuid":"1224703318","full_name":"devnen/qwen3.6-windows-server","owner":"devnen","description":"One-click Qwen3.6-27B inference on Windows. 64.5 tok/s on a single RTX 3090. Native, no WSL, no Docker, no telemetry.","archived":false,"fork":false,"pushed_at":"2026-05-01T15:19:29.000Z","size":11223,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-01T15:33:05.922Z","etag":null,"topics":["llm-inference","local-llm","offline-ai","privacy","qwen","qwen3","rtx-3090","textual-tui","vllm","windows"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devnen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-29T14:41:37.000Z","updated_at":"2026-05-01T15:19:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/devnen/qwen3.6-windows-server","commit_stats":null,"previous_names":["devnen/qwen3.6-windows-server"],"tags_count":20,"template":false,"template_full_name":null,"purl":"pkg:github/devnen/qwen3.6-windows-server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2Fqwen3.6-windows-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2Fqwen3.6-windows-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2Fqwen3.6-windows-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2Fqwen3.6-windows-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devnen","download_url":"https://codeload.github.com/devnen/qwen3.6-windows-server/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devnen%2Fqwen3.6-windows-server/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32688333,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-06T08:33:17.875Z","status":"ssl_error","status_checked_at":"2026-05-06T08:33:17.221Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-inference","local-llm","offline-ai","privacy","qwen","qwen3","rtx-3090","textual-tui","vllm","windows"],"created_at":"2026-05-01T15:01:15.291Z","updated_at":"2026-05-06T10:01:42.913Z","avatar_url":"https://github.com/devnen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# qwen3.6-windows-server\n\n\u003e **One-click [Qwen3.6-27B](https://huggingface.co/Qwen) inference on Windows.**\n\u003e Unzip, double-click, you're serving on `http://127.0.0.1:5001/v1`.\n\u003e No WSL, no Docker, no conda, no pip, no admin. **Everything runs on\n\u003e your machine. No telemetry. No analytics. No phone-home.**\n\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Made for Windows](https://img.shields.io/badge/OS-Windows%2010%2F11-0078d6.svg)](https://www.microsoft.com/windows)\n[![GPU](https://img.shields.io/badge/tested-RTX%203090-76b900.svg)](https://www.nvidia.com/)\n[![Local AI](https://img.shields.io/badge/100%25-local%20%2F%20offline-success.svg)](#the-local-ai-ethos)\n\n---\n\n\u003e ## Important fix in v1.2.2 — please upgrade\n\u003e\n\u003e Earlier releases (v1.2.1 and prior) shipped with `--enable-prefix-caching`\n\u003e turned on in every snapshot. Qwen3.6-27B is a hybrid Mamba/SSM model, and\n\u003e per [vLLM issue #17140](https://github.com/vllm-project/vllm/issues/17140)\n\u003e prefix caching is **incompatible** with SSM state management. The result\n\u003e was a stepwise decode-tps regression after long-context requests:\n\u003e\n\u003e - Fresh server, short prompts: full speed (~130 tok/s on 5090, ~72 on 3090).\n\u003e - After one ~24 k-token request: dropped ~30 %, never recovered.\n\u003e - After a second long request: dropped to ~30 % of original, never recovered.\n\u003e - Workaround pre-fix was to restart the server between long-context turns.\n\u003e\n\u003e **v1.2.2 disables prefix caching in all 12 snapshots.** Decode stays at\n\u003e documented speed across mixed long+short workloads. The only thing lost\n\u003e is the warm-prefix TTFT speedup on identical repeat prompts — irrelevant\n\u003e for single-user serving (which every snapshot uses, `--max-num-seqs=1`).\n\u003e\n\u003e **To get the fix:** `update.bat` (or download the new release zip and\n\u003e re-extract). See [`docs/UPGRADING.md`](docs/UPGRADING.md) for details\n\u003e and [`docs/TUNING.md`](docs/TUNING.md) for the full bug write-up.\n\n---\n\n## What this is\n\nA small portable Windows app that gives you an OpenAI-compatible API\nserving Qwen3.6-27B locally, with config presets that I actually\nmeasured myself. The launcher is a Textual TUI: arrow keys, Enter\nto start a snapshot, Esc to stop. Press `e` to add, edit, duplicate, or\ndelete your own snapshot configs from inside the TUI, no hand-editing\nfiles. That's the whole UX.\n\nIt is the matching launcher for the [`devnen/vllm-windows`](https://github.com/devnen/vllm-windows)\npatched wheel, but you don't need to know or care about that. The wheel\nships inside the launcher zip.\n\n## What you get\n\nOn a single RTX 3090 (24 GB), running [Lorbus AutoRound INT4](https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound):\n\nEvery snapshot below has the tool-calling fix baked in (PR #35687 + #40861 + `qwen3.5-enhanced.jinja` + `preserve_thinking=false`), so any one of them works with any OpenAI-compatible client, Claude Code, Cline, Cursor, Codex, OpenCode, KiloCode, LM Studio, etc. Just point it at the listed port.\n\n| Snapshot              | Decode tok/s | Prompt class      | Context | Use it when |\n|-----------------------|--------------|-------------------|---------|-------------|\n| `start_72tps`         | **~72**      | short (~200 tok)  | 32 k    | Short-prompt / chat baseline. MTP n=3. |\n| `start_speed`         | **64.5**     | long (100 KB)     | 90 k    | Default for long prompts. MTP n=6, see note below. |\n| `start_127k`          | 53.4         | long (100 KB)     | 127 k   | Maximum context on a single 3090. |\n| `start_mtp4`          | 58.3         | long (100 KB)     | 120 k   | Mid-balance speed vs context. |\n| `start_pp2_160k` (2 GPU) | 43.5      | long (100 KB)     | 160 k   | Pipeline-parallel for the largest contexts. |\n| `start_gpu0_50k`      | 56.9         | mixed             | 9–50 k  | Single-GPU + display, fallback when you can't boot-quiet. |\n\n\u003e **GPU index note.** `start_72tps`, `start_speed`, `start_127k`, and\n\u003e `start_mtp4` pin to **GPU 1** so GPU 0 stays free for the desktop\n\u003e compositor and other apps on a 2× 3090 box. On a single-GPU host the\n\u003e snapshot detects that via `nvidia-smi` and falls back to GPU 0 with a\n\u003e warning. `start_pp2_160k` requires two GPUs.\n\n\u003e **Single 3090 with display attached.** You can run the full\n\u003e `start_speed` snapshot at 90 k context if you close heavy GPU apps\n\u003e (Chrome, Discord, Slack, video playback) **during boot**. Once vLLM\n\u003e has reserved its KV pool, the driver schedules everything else around\n\u003e what vLLM already owns, so you can reopen those apps and they'll\n\u003e behave normally. The danger is reopening them before boot finishes,\n\u003e mid-allocation OOM is what kills runs. If you can't or won't\n\u003e boot-quiet, `start_gpu0_50k` is the conservative fallback (`mem_util\n\u003e 0.92`, ~50 k ctx, same decode tok/s).\n\nLong-prompt rows were measured on a ~100 KB / ~24 k-token Python\nsource-summary prompt (a real Windows-service module fed to\n`windows_tools\\bench_summarize.py`). The short-prompt row was measured\non a ~200-token chat turn via `windows_tools\\bench.py`. All numbers\n[coherence-validated](docs/COHERENCE.md), TPS without coherence is a\nlie.\n\n\u003e **Why MTP n=6 on `start_speed`?** n=3 is the universal *short-prompt*\n\u003e sweet spot and ships as `start_72tps`. On long, dense Python source\n\u003e the acceptance curve shifts later, n=6 won my coherence sweep\n\u003e (n=3 / 4 / 5 / 6 / 7 / 8 → 53.4 / 58.3 / 62.8 / 64.5 / 61.5 / 58.0\n\u003e tok/s; full sweep in [`docs/TUNING.md`](docs/TUNING.md)). Always\n\u003e re-sweep on a representative prompt for your workload.\n\n\u003e **Honest framing:** these are not r/LocalLLaMA records. Community has\n\u003e hit 80–82 tok/s on a 3090 with TurboQuant 3-bit KV, and 160 tok/s on a\n\u003e 5090. The unique angle here is **native Windows, no WSL**. Same\n\u003e recipe, no virtualization tax, one community member measured the\n\u003e same hardware going from **85 tok/s in WSL to 160 tok/s in native\n\u003e Ubuntu** ([reported here](https://www.reddit.com/r/LocalLLaMA/comments/1sw21op/comment/oid8d9n/)).\n\u003e This launcher closes that gap on Windows.\n\n## Why this exists\n\nMost fast Qwen3.6-27B recipes on r/LocalLLaMA assume Linux + Docker, or\nLinux-in-WSL. Windows users either pay the WSL tax, dual-boot, or skip\ninference entirely. None of those is great if your daily driver is\nWindows.\n\nThis launcher is the third option:\n\n- **Native Windows.** Runs as a normal Windows process. No virtualization layer.\n- **Portable.** Unzip the launcher, drop your model into a folder, double-click. That's it.\n- **Validated.** Every config in here was measured against a coherence battery before being checked in. No copy-pasted Reddit recipes that look fast but emit `* * * *`.\n- **Local-only.** No outbound calls except when you explicitly ask the launcher to download a model from HuggingFace. No telemetry of any kind, ever.\n\n## Install\n\n**TL;DR for CI / agents / scripted installs**, one line, no TUI:\n\n```powershell\nstart.bat --auto-download --snapshot start_72tps\n```\n\nThat installs the runtime, downloads the model if missing, and starts\nserving on `http://127.0.0.1:5001/v1`. See\n[Headless / scripted install](#headless--scripted-install) below for\nall the flags.\n\n**Hand the install to a coding agent**, copy/paste prompt at\n[`docs/AGENT_INSTALL_PROMPT.md`](docs/AGENT_INSTALL_PROMPT.md). Edit the\none `INSTALL_DIR` line, paste into Claude Code / Cursor / Codex CLI /\nany agent with shell access, and it does the download + extract +\nruntime install + model fetch + smoke test end-to-end while you do\nsomething else.\n\n**Interactive path:**\n\n1. Download [`qwen3.6-windows-server-portable-x64.zip`](../../releases/latest)\n   from the latest Release. Extract anywhere (no admin needed).\n2. Double-click `start.bat`. The first run does two one-time steps,\n   then drops you in the TUI:\n   - **Runtime install** (~5–15 min, several GB). The bundled vLLM\n     wheel + ~150 transitive deps (torch, CUDA wheels, transformers,\n     etc.) install into the embedded Python's `site-packages`. A\n     marker file is written so subsequent launches skip this entirely.\n   - **Model setup.** Looks for `Qwen3.6-27B-int4-AutoRound` weights\n     on your fixed drives (scans `\u003cdrive\u003e:\\`, `_models\\`, `models\\`,\n     `AI\\`, `AI\\models\\`, `huggingface\\`, `huggingface\\hub\\`,\n     `models\\Lorbus\\`). If it doesn't find them, offers to\n     **auto-download from Hugging Face** (~16 GB, public, no token)\n     or accepts a path to weights you already have. If your weights\n     live somewhere else, pass `--model-dir \u003cpath\u003e` to skip the scan.\n3. Pick a snapshot, press Enter, you're serving on\n   `http://127.0.0.1:5001/v1`.\n\nThe portable zip ships with an embedded Python 3.12 runtime, the\npatched vLLM wheel, the launcher TUI, a portable Windows Terminal,\nand a vendored `get-pip.py`. No conda, no system-Python install, no\nregistry changes, no admin prompts. The runtime install on first run\nis the only network-dependent step besides the model download.\n\nDon't have the model yet? See [`docs/MTP_HEAD.md`](docs/MTP_HEAD.md),\n**use the Lorbus AutoRound quant**, the others won't draft.\n\nDetailed install (including the wheel-only path for users who already\nhave their own venv): [`docs/INSTALL.md`](docs/INSTALL.md).\n\n## Optional: install MSVC 2022 for the small decode boost\n\nThe launcher works on a vanilla Windows install, no MSVC required.\nBut if you install **Visual Studio 2022 Build Tools** (free, no full\nIDE) with the **\"Desktop development with C++\"** workload, the\nsnapshots auto-detect it and turn on vLLM's flashinfer sampler path,\nwhich JIT-compiles a faster top-k / top-p kernel on first launch.\n\nWhat it costs:\n- ~7 GB download, one-time install.\n- Extra 30 to 60 s on the first `profile_run` of each new snapshot\n  while the kernel compiles. Subsequent boots reuse the compiled\n  cache.\n\nWhat you get:\n- A small but measurable decode boost on the sampler path.\n\nWithout MSVC, the snapshots transparently fall back to the PyTorch\nsampler, which never JIT-compiles anything. Boot is faster and the\nserver is reliable; you just leave a few percent of decode tok/s on\nthe table. The launcher prints a one-line `[info]` at startup telling\nyou which path it picked.\n\nGet the Build Tools installer here (official Microsoft `aka.ms`\nshortlink, pinned to VS 2022 / 17.x so it stays on the right product\neven after VS 2026 ships):\nhttps://aka.ms/vs/17/release/vs_buildtools.exe\n\nninja (the other half of the JIT toolchain) ships inside the\nlauncher zip, you don't need to install it separately.\n\n## Test it\n\nOnce the server is up:\n\n```powershell\ncurl http://127.0.0.1:5001/v1/chat/completions ^\n  -H \"Content-Type: application/json\" ^\n  -d \"{\\\"model\\\":\\\"any\\\",\\\"messages\\\":[{\\\"role\\\":\\\"user\\\",\\\"content\\\":\\\"Capital of France?\\\"}],\\\"max_tokens\\\":2000}\"\n```\n\nNote the `\"model\": \"any\"`, the patched wheel accepts any value. You\ndon't have to know what the model is called.\n\n\u003e **Why `max_tokens: 2000`?** Qwen3.6 is a thinking model: it spends the\n\u003e first chunk of its budget reasoning inside `\u003cthink\u003e...\u003c/think\u003e` and\n\u003e only then writes the answer to `content`. With `max_tokens: 50` the\n\u003e entire budget gets eaten by the thinking phase and you'll see\n\u003e `content: null` plus `finish_reason: \"length\"`, the server is fine,\n\u003e the budget was just too small. 1500–2000 is a safe floor for short Q\u0026A.\n\n\u003e **Where's the answer in the response?** The final answer lands in\n\u003e `choices[0].message.content`. The chain-of-thought lands in a separate\n\u003e `choices[0].message.reasoning` field, that's the `--reasoning-parser=qwen3`\n\u003e wheel patch doing its job, not a bug. Most chat clients show\n\u003e `content` and ignore `reasoning`; if yours doesn't, point it at\n\u003e `content`.\n\n\u003e **If the request hangs**, tail `logs\\vllm_server.\u003cport\u003e.log` for vLLM's\n\u003e own stdout, the parent launcher logs only the boot banner; the\n\u003e serving process tees its progress to that file.\n\n## Headless / scripted install\n\nEnd-to-end automated install (no TUI, no prompts), useful for CI,\nremote machines, agent installers, or just keeping a repeatable\nrecipe:\n\n```powershell\nstart.bat --auto-download --snapshot start_72tps\n```\n\nThe launcher runs the first-run setup (vLLM wheel + ~150 deps),\nauto-downloads the Lorbus quant from Hugging Face if it's missing,\napplies the tokenizer patch automatically, and execs the chosen\nsnapshot, all without opening the TUI. Other useful flags:\n\n```powershell\nstart.bat --model-dir G:\\_models\\Qwen3.6-27B-int4-AutoRound --snapshot start_speed\nstart.bat --headless     :: skip TUI, run the default snapshot (start_72tps)\nstart.bat --setup-only   :: install runtime + model, then exit (no serving)\n```\n\n`--headless` without `--snapshot` now runs the default snapshot\n(`start_72tps`) instead of exiting after setup checks. To run only\nthe setup checks (the old `--headless` behavior), pass `--setup-only`.\n\nThe launcher also stays in the parent terminal, instead of detaching\ninto a new Windows Terminal window, when it sees any of `WT_SESSION`,\n`VLLM_NO_WT`, `CI`, `GITHUB_ACTIONS`, `MSYSTEM`, or `TERM` in the\nenvironment. That covers GitHub Actions, git-bash, MSYS, agent\nrunners, and anything that exports `TERM`. So your captured stdout\nwon't go missing.\n\nFor benchmark numbers like the table above, use the bundled tools:\n\n```powershell\nwindows_tools\\bench.bat              :: short prompt, decode-only TPS\nwindows_tools\\bench_summarize.bat    :: ~100 KB / ~24 k-token prompt, prefill + decode + KV\nwindows_tools\\check_coherence.bat    :: 3-tier coherence validator\n```\n\n## Hardware reality\n\nTuned and measured on:\n\n- Windows 10 Enterprise 22H2\n- 2× NVIDIA RTX 3090 (Ampere `sm_86`), no NVLink, PCIe Gen 4\n- 350 W power cap (250 W also benchmarked, see [`docs/TUNING.md`](docs/TUNING.md))\n\nShould also work on any Ampere or Ada NVIDIA GPU running Windows 10/11,\n3090, 4090, A6000, etc. **Will not work** on Pascal, Turing, Intel Arc,\nor any AMD card. **Single GPU with the display attached** loses 1–3 GiB\nof VRAM to the desktop compositor and another 2–5 GiB to running apps,\nbut you can still run the full `start_speed` snapshot at 90 k context\nby closing heavy GPU apps (Chrome, Discord, Slack, video playback)\nduring boot, then reopening them after vLLM finishes booting. If you\ncan't boot-quiet, fall back to `start_gpu0_50k`. Either path is\ncovered in [`docs/WINDOWS_VRAM_HEADLESS.md`](docs/WINDOWS_VRAM_HEADLESS.md).\n\n\u003e **RTX 50-series (Blackwell, 5060 / 5070 / 5080 / 5090): supported via\n\u003e the Blackwell zip.** Download\n\u003e `qwen3.6-windows-server-portable-x64-blackwell.zip` instead of the\n\u003e default zip. It bundles `vllm-0.20.0+cu132.devnen.1` against CUDA 13.2\n\u003e / PyTorch cu130 with `sm_120` kernels. Verified end-to-end on a single\n\u003e RTX 5090: **130.9 decode tok/s** on `rtx5090_speed` (ctx 120k, MTP\n\u003e n=6) and **138.0 tok/s** on `rtx5090_max` (ctx 280k, MTP n=3) — both\n\u003e verified single-card. Three 5090 snapshots ship: `rtx5090_speed` (120k),\n\u003e `rtx5090` (240k), `rtx5090_max` (280k). NVIDIA driver 596+ required.\n\u003e See [`docs/BLACKWELL.md`](docs/BLACKWELL.md) for the full story.\n\nIf you're on a 4090, expect slightly higher numbers than mine. If\nyou're on something more exotic, nothing here is going to work without\nyour own tuning, that's fine, please share what you find.\n\n\u003e **Scope.** This launcher serves Qwen3.6-27B specifically through a\n\u003e fixed set of validated snapshots. It is not a general vLLM server you\n\u003e can point at any model. Adding configs for smaller Qwen variants is\n\u003e straightforward (see [`docs/SNAPSHOTS.md`](docs/SNAPSHOTS.md));\n\u003e running unrelated models like ACE-Step, Stable Diffusion, or other\n\u003e diffusion / multimodal stacks is out of scope.\n\n## The local-AI ethos\n\nEverything runs on your machine. No telemetry. No analytics. No\nphone-home. No cloud inference. No model weights downloaded behind your\nback. The launcher never opens an outbound connection except when you\nexplicitly ask it to (downloading a model from HuggingFace via your own\nbrowser/`huggingface-cli`). This is in the spirit of\n[r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/): your hardware,\nyour weights, your prompts, your business.\n\nThe launcher and every script are Apache-2.0. The bundled wheel inherits\nupstream vLLM's Apache-2.0 license. SHA256 of every release asset is\npublished next to the release, verify before extracting.\n\n## What's under the hood\n\nThe wheel that powers this launcher is\n[`devnen/vllm-windows`](https://github.com/devnen/vllm-windows): a\npatched native-Windows build of [vLLM](https://github.com/vllm-project/vllm),\nwith three Windows-specific fixes (CPU-relay for Gloo collectives, Qwen3\nreasoning-parser fix mirrored from PR #35687, hardwired wildcard model\nname). The full diff is at\n[`CHANGES_VS_SYSTEMPANIC.md`](https://github.com/devnen/vllm-windows/blob/main/CHANGES_VS_SYSTEMPANIC.md)\nin that repo. You don't have to download it separately, it's bundled\ninside this launcher's portable zip.\n\n## Documentation\n\n- [`docs/INSTALL.md`](docs/INSTALL.md), full install + the bring-your-own-venv path.\n- [`docs/UPGRADING.md`](docs/UPGRADING.md), in-place updater (`update.bat`), preserve list, variant switching.\n- [`docs/BLACKWELL.md`](docs/BLACKWELL.md), single landing page for RTX 50-series users.\n- [`docs/MODELS.md`](docs/MODELS.md), swapping in other quants / model sizes / Qwen variants.\n- [`docs/CLAUDE_CODE.md`](docs/CLAUDE_CODE.md), point Claude Code at the local server (native `/v1/messages`, no proxy).\n- [`docs/CODEX.md`](docs/CODEX.md), use OpenAI Codex CLI with this server (Responses API, `developer`-role template fix).\n- [`docs/OPENCODE.md`](docs/OPENCODE.md), point OpenCode at the local server (custom OpenAI-compatible provider, AGENTS.md path-handling rule).\n- [`docs/QWEN_CLI.md`](docs/QWEN_CLI.md), use Alibaba's official Qwen Code agent against this server.\n- [`docs/PI.md`](docs/PI.md), use the Pi coding agent with this server (custom provider extension, auto-compaction notes).\n- [`docs/UNINSTALL.md`](docs/UNINSTALL.md), clean removal (it's portable, so just delete folders).\n- [`docs/HARDWARE.md`](docs/HARDWARE.md), what works, what doesn't, and why.\n- [`docs/COMPARISON.md`](docs/COMPARISON.md), how this stacks up against Ollama, LM Studio, llama.cpp, Docker, and WSL2.\n- [`docs/COHERENCE.md`](docs/COHERENCE.md), degenerate-output guide and the 3-tier validator.\n- [`docs/TROUBLESHOOTING.md`](docs/TROUBLESHOOTING.md), every failure mode I've hit.\n- [`docs/TUNING.md`](docs/TUNING.md), the lever set, anti-levers, how to sweep your own configs.\n- [`docs/MTP_HEAD.md`](docs/MTP_HEAD.md), why Lorbus AutoRound is the only INT4 quant that works.\n- [`docs/SPEC_DECODE_MATRIX.md`](docs/SPEC_DECODE_MATRIX.md), what spec-decode + parallelism combos work.\n- [`docs/SNAPSHOTS.md`](docs/SNAPSHOTS.md), managing snapshots from inside the TUI: keyboard shortcuts, the CRUD editor, flag invariants, hand-edit fallback.\n- [`docs/WINDOWS_VRAM_HEADLESS.md`](docs/WINDOWS_VRAM_HEADLESS.md), free VRAM on Windows for single-GPU.\n- [`docs/HALLUCINATED_FLAGS.md`](docs/HALLUCINATED_FLAGS.md), flags from web search results that don't exist on this wheel.\n- [`docs/CREDITS.md`](docs/CREDITS.md), vLLM team, SystemPanic, Lorbus, the community.\n\n## Contributing\n\nBug reports welcome, please include GPU model, driver version, Windows\nbuild, and the relevant slice of `logs\\vllm_server.\u003cport\u003e.log`. The\n[issue template](.github/ISSUE_TEMPLATE/bug_report.md) walks you\nthrough it.\n\n**Share your configs.** Each snapshot in `snapshots/` is just a\nvalidated set of vLLM flags for one hardware/model combo, plus a card\nin `launcher/configs.yaml` so the launcher can list it. If you've got\na config that runs coherent and faster (or with more context) than\nwhat's in here, please send a PR. The bar is the\n[3-tier coherence check](docs/COHERENCE.md), TPS without coherence\nwon't be merged.\n\nConfigs I'd love to see:\n\n- Other Qwen3.6-27B quants (FP8, NVFP4, smaller AutoRound variants)\n- Smaller Qwen models (14B, 8B, 4B) for 16 GB cards\n- 4090 / 5090 / 5060 Ti / A6000 tunings\n- New parallelism or KV-cache combos as vLLM adds them\n\nHow to add a snapshot: [`docs/SNAPSHOTS.md`](docs/SNAPSHOTS.md) (in-TUI editor and hand-edit fallback).\n\nThis project is intentionally narrow scope: **Windows + Ampere/Ada/Blackwell\nNVIDIA**. PRs for other operating systems or GPU vendors are politely\nout of scope, please go upstream.\n\n## Credits\n\n- [vLLM](https://github.com/vllm-project/vllm), the engine.\n- [SystemPanic/vllm-windows](https://github.com/SystemPanic/vllm-windows), the upstream Windows wheel build infrastructure.\n- [Lorbus](https://huggingface.co/Lorbus), the AutoRound INT4 quant of Qwen3.6-27B that makes any of this fast.\n- [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/), the configs in here started from recipes posted on the subreddit, and got refined by the honest feedback in the comments.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevnen%2Fqwen3.6-windows-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevnen%2Fqwen3.6-windows-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevnen%2Fqwen3.6-windows-server/lists"}