{"id":50729420,"url":"https://github.com/kibotu/llm-windows-server","last_synced_at":"2026-06-10T07:03:37.602Z","repository":{"id":349958318,"uuid":"1199297591","full_name":"kibotu/llm-windows-server","owner":"kibotu","description":"Turn your Windows GPU into a private, low-latency LLM server. Docker-based, OpenAI-compatible API.","archived":false,"fork":false,"pushed_at":"2026-04-08T08:39:33.000Z","size":34,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-08T10:27:51.815Z","etag":null,"topics":["agentic","cuda","docker","gguf","llma-cpp","local-llm","nvidia-gpu","openai-api","opencode","qwen","self-hosted","windows"],"latest_commit_sha":null,"homepage":"","language":"PowerShell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kibotu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"kibotu","buy_me_a_coffee":"kibotu","custom":"https://paypal.me/janrabe/5"}},"created_at":"2026-04-02T08:06:33.000Z","updated_at":"2026-04-08T08:39:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kibotu/llm-windows-server","commit_stats":null,"previous_names":["kibotu/llm-windows-server"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/kibotu/llm-windows-server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kibotu%2Fllm-windows-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kibotu%2Fllm-windows-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kibotu%2Fllm-windows-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kibotu%2Fllm-windows-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kibotu","download_url":"https://codeload.github.com/kibotu/llm-windows-server/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kibotu%2Fllm-windows-server/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34140776,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic","cuda","docker","gguf","llma-cpp","local-llm","nvidia-gpu","openai-api","opencode","qwen","self-hosted","windows"],"created_at":"2026-06-10T07:03:33.340Z","updated_at":"2026-06-10T07:03:37.584Z","avatar_url":"https://github.com/kibotu.png","language":"PowerShell","funding_links":["https://github.com/sponsors/kibotu","https://buymeacoffee.com/kibotu","https://paypal.me/janrabe/5"],"categories":[],"sub_categories":[],"readme":"# LLM Server\n\n[![Medium](https://img.shields.io/badge/Medium-@kibotu-000000?style=flat-square\u0026logo=medium\u0026logoColor=white)](https://medium.com/@kibotu/two-paths-to-local-llm-servers-windows-nvidia-vs-mac-apple-silicon-1e28d606f600?sk=a5d9989d124d7f9b844927f0f545ed09)\n\nTurn your idle Windows machine with an NVIDIA GPU into a low-latency, private LLM inference server. Docker-based OpenAI-compatible API with usage tracking and optional high-load benchmarking.\n\n## Quick Start\n\n```powershell\n.\\setup.ps1                    # pulls image, downloads 9B model, configures firewall\n.\\run.ps1                      # starts server on port 8899\n.\\test.ps1                     # verify it works\n```\n\nAPI endpoint: `http://localhost:8899/v1`\n\n---\n\n## Models\n\n| Alias | Model | Size | VRAM (MoE offload) | Notes |\n|-------|-------|------|-------------------|-------|\n| `9b` | Qwen3.5-9B Q4_K_M | ~5 GB | ~8 GB | Default. Fast, good for agentic loops. |\n| `35b` | Qwen3.6-35B-A3B Q4_K_S | ~21 GB | ~6-8 GB | Auto-picks Qwen3.6 if present. |\n| `qwen3635ba3b` | Qwen3.6-35B-A3B Q4_K_S | ~21 GB | ~6-8 GB | Explicit Qwen3.6 selection. |\n| `qwen3635ba3b2bit` | Qwen3.6-35B-A3B Q2_K_XL | ~13 GB | ~4-5 GB | Ultra-low VRAM. Setup: `-IncludeQwen36Q2` |\n| `qwen3635ba3b4bit` | Qwen3.6-35B-A3B IQ4_XS | ~14 GB | ~5-6 GB | imatrix quant. Setup: `-IncludeQwen36IQ4` |\n| `qwen36heretic` | Qwen3.6-35B-A3B Uncensored Heretic MTP | ~22 GB | ~6-8 GB | Uncensored + MTP. Setup: `-IncludeQwen36Heretic` |\n| `qwen36opus47` | Qwen3.6-35B-A3B Claude Opus Distill MTP | ~20 GB | ~6-8 GB | Claude 4.7 Opus reasoning distill + MTP. Best for code. Setup: `-IncludeQwen36Opus47` |\n| `qwen3uncensored8b` | Qwen3-8B-Uncensor-v2 Q4_K_M | ~5 GB | ~8 GB | Uncensored 8B variant. |\n| `gemma312` | Gemma 3 12B IT Q4_K_M | ~7 GB | ~9 GB | Setup: `-IncludeGemma312` |\n| `gemma426ba4b` | Gemma 4 26B-A4B IT Q4_K_M | ~16 GB | ~8-10 GB | MoE. Setup: `-IncludeGemma426BA4B` |\n\nSwitch models: `.\\run.ps1 -Model \u003calias\u003e -Restart`\n\nModels are downloaded to the HuggingFace cache (`HF_HOME` env var, or `~/.cache/huggingface/`) and linked to `models/` for Docker.\n\n---\n\n## run.ps1\n\n```powershell\n.\\run.ps1 [options]\n```\n\n### Model \u0026 Server\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `-Model` | `9b` | Model alias (see table above) |\n| `-Restart` | - | Force restart container |\n| `-Stop` | - | Stop server |\n| `-Context` | `262144` | Context window (tokens) |\n\n### Inference Features\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `-Thinking` | off | Extended reasoning mode |\n| `-Mtp` | off | Multi-Token Prediction (requires MTP model like `qwen36opus47`) |\n| `-MtpTokens` | `4` | Extra tokens to predict with MTP (1-16) |\n\n### MoE Offloading\n\nFor MoE models (`35b`, `qwen3635ba3b*`, `qwen36opus47`, `gemma426ba4b`):\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `-MoeOffload` | `auto` | `auto`/`all` = experts to CPU, `off` = full GPU, `N` = first N layers |\n\n### Performance Tuning\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `-Threads` | `0` (auto) | CPU threads. Auto: 20 on 32-core, 12 on 16-core. |\n| `-Batch` | `2048` | Physical batch size. Auto-bumped to 4096 for MoE. |\n| `-UBatch` | `1024` | Micro-batch size (≤ Batch). Auto-bumped to 4096 for MoE. |\n| `-KvCache` | `q8_0` | KV cache type: `q4_0` (smallest), `q8_0` (balanced), `f16` (best) |\n\n### Speculative Decoding\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `-DraftModelFile` | - | Draft GGUF filename in `models/` |\n| `-DraftMin` | `5` | Min draft tokens before verification |\n| `-DraftMax` | `16` | Max draft tokens per step |\n| `-DraftGpuLayers` | `99` | GPU layers for draft model |\n\n### Advanced\n\n| Parameter | Description |\n|-----------|-------------|\n| `-ExtraFlags` | Pass-through to `llama-server` |\n\n### Examples\n\n```powershell\n.\\run.ps1 -Model qwen36heretic -MoeOffload auto -Thinking -Mtp  # uncensored + MTP\n.\\run.ps1 -Model qwen36opus47 -MoeOffload auto -Thinking -Mtp   # Opus distill + MTP (best for code)\n.\\run.ps1 -Model qwen3635ba3b2bit -MoeOffload auto              # 2-bit, minimal VRAM\n.\\run.ps1 -Model gemma426ba4b -Thinking                         # Gemma 4 MoE\n.\\run.ps1 -Batch 4096 -UBatch 4096 -Threads 24                  # perf tuning\n```\n\n---\n\n## setup.ps1\n\n```powershell\n.\\setup.ps1 [options]\n```\n\nDownloads Qwen3.5-9B by default. All other models require explicit flags.\n\n### Additional Models\n\n| Parameter | Model | Size |\n|-----------|-------|------|\n| `-IncludeQwen3635ba3b` | Qwen3.6-35B-A3B Q4_K_S | ~21 GB |\n| `-IncludeQwen3Uncensored8b` | Qwen3-8B-Uncensor-v2 Q4_K_M | ~5 GB |\n| `-IncludeQwen36Q2` | Qwen3.6-35B-A3B Q2_K_XL (2-bit) | ~13 GB |\n| `-IncludeQwen36IQ4` | Qwen3.6-35B-A3B IQ4_XS | ~14 GB |\n| `-IncludeQwen36Heretic` | Qwen3.6-35B-A3B Uncensored Heretic MTP | ~22 GB |\n| `-IncludeQwen36Opus47` | Qwen3.6-35B-A3B Claude Opus Distill MTP Q4_K_M | ~20 GB |\n| `-IncludeGemma312` | Gemma 3 12B IT Q4_K_M | ~7 GB |\n| `-IncludeGemma426BA4B` | Gemma 4 26B-A4B IT Q4_K_M | ~16 GB |\n\n```powershell\n.\\setup.ps1 -IncludeQwen36Opus47 -IncludeQwen36Heretic  # download both MTP models\n.\\setup.ps1 -Model 35b                                  # shorthand for -IncludeQwen3635ba3b\n```\n\n---\n\n## Client Configuration\n\nBase URL: `http://\u003chost-ip\u003e:8899/v1`  \nAPI Key: any string (not validated)\n\n**Cursor / Continue / OpenAI SDK**: Set base URL, any API key.\n\n**OpenCode** (`~/.config/opencode/opencode.json`):\n\n```json\n{\n  \"provider\": {\n    \"llama-at-home\": {\n      \"name\": \"Local LLM\",\n      \"npm\": \"@ai-sdk/openai-compatible\",\n      \"options\": { \"baseURL\": \"http://\u003chost-ip\u003e:8899/v1\" },\n      \"models\": {\n        \"Qwen3.5-9B\": { \"name\": \"Qwen3.5-9B\" },\n        \"Qwen3.6-35B-A3B\": { \"name\": \"Qwen3.6-35B-A3B\" }\n      }\n    }\n  }\n}\n```\n\n---\n\n## MoE Expert Offloading\n\nMoE models (Qwen 35B, Gemma 4 26B-A4B) have many \"expert\" sub-networks but only activate a few per token. CPU expert offloading keeps attention on GPU while experts run on CPU/RAM.\n\n**Result**: Run 35B models on 8-12 GB VRAM instead of 20+ GB.\n\n```powershell\n.\\run.ps1 -Model qwen3635ba3b -MoeOffload auto   # default: experts to CPU\n.\\run.ps1 -Model qwen3635ba3b -MoeOffload off    # full GPU (needs 20+ GB)\n.\\run.ps1 -Model qwen3635ba3b -MoeOffload 30     # first 30 layers' experts to CPU\n```\n\n---\n\n## Multi-Token Prediction (MTP)\n\nMTP predicts multiple tokens in parallel for ~25% speedup on code workloads. Requires a model with native MTP support.\n\n**Note:** MTP models require a custom llama.cpp build. The first run with `-Mtp` will build the image (10-20 minutes). Subsequent runs use the cached image.\n\n| Model | Best `-MtpTokens` | Notes |\n|-------|-------------------|-------|\n| `qwen36heretic` | 4 (default) | Uncensored, good general MTP |\n| `qwen36opus47` | 2 (auto-set) | Claude Opus distill. Best for code. |\n\n```powershell\n.\\run.ps1 -Model qwen36heretic -Mtp                 # uncensored + MTP\n.\\run.ps1 -Model qwen36opus47 -Mtp                  # Opus distill + MTP (auto n=2)\n.\\run.ps1 -Model qwen36opus47 -Mtp -MtpTokens 1     # Opus distill, better for prose\n```\n\n### Standalone MTP Runner\n\nFor direct control over MTP settings, use `run-mtp.ps1`:\n\n```powershell\n.\\run-mtp.ps1                                        # Code-optimized defaults (n=2)\n.\\run-mtp.ps1 -MtpTokens 1                           # Prose/creative mode\n.\\run-mtp.ps1 -Context 65536 -KvCache q4_0           # Longer context\n.\\run-mtp.ps1 -ReasoningBudget 16384                 # Hard math/logic problems\n```\n\n### Model Card Recommendations\n\nBased on [Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF](https://huggingface.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF):\n\n| Setting | Code Workloads | Prose/Creative |\n|---------|---------------|----------------|\n| `--spec-draft-n-max` | **2** (89-91% accept, ~25% speedup) | **1** (more reliable) |\n| `--cache-type-k/v` | `q8_0` (quality) or `q4_0` (longer ctx) | `q8_0` |\n| `--parallel` | **1** (required for MTP) | **1** |\n| `--reasoning-budget` | 4096 (default) | 16384 (hard problems) |\n\n**Architecture:** `qwen35moe_mtp` — this is MoE + MTP working together.\n\n### MTP + TBQ4 (Optional, Experimental)\n\nThe [Indra's Mirror fork](https://indrasmirror.au/blog-mtp-shared-tensors-200k) adds TBQ4 fused flash attention for massive context on 27B dense models (80-87 tok/s at 262K).\n\n**TBQ4 is separate from MTP** — you don't need it to use MTP. The lordx64 model card recommends standard `q8_0` or `q4_0` KV cache.\n\n| KV Cache | Use Case | Notes |\n|----------|----------|-------|\n| `q8_0` | Default, quality | Model card recommendation |\n| `q4_0` | Longer context (65k+) | Good for 16GB VRAM |\n| `tbq4_0` | Experimental on MoE | May have alignment issues |\n\n**Recommended Settings for RTX 4080 + 35B MoE:**\n\n```powershell\n# Code-optimized (model card defaults)\n.\\run-mtp.ps1 -Model \"lordx64-distill-MTP-Q4_K_M.gguf\" `\n              -Context 32768 `\n              -KvCache q8_0 `\n              -MtpTokens 2\n\n# Prose/creative\n.\\run-mtp.ps1 -MtpTokens 1 -ReasoningBudget 8192\n```\n\n---\n\n## Benchmarking\n\n```powershell\n.\\benchmark.ps1                  # quick tok/s check\n.\\benchmark.ps1 -Runs 5          # stable average\n\n# Python suite\npip install -r requirements-benchmark.txt\npython benchmark.py --test standard\npython analyze-benchmark.py benchmark_results.json\n```\n\n### A/B Testing\n\n```powershell\n.\\run.ps1 -Model 9b -Batch 2048 -UBatch 512 -Restart \u0026\u0026 .\\benchmark.ps1 -Runs 5\n.\\run.ps1 -Model 9b -Batch 4096 -UBatch 1024 -Restart \u0026\u0026 .\\benchmark.ps1 -Runs 5\n```\n\n---\n\n## Host Control API\n\nRemote model switching via HTTP.\n\n```powershell\n.\\start-control.ps1   # starts on port 8898\n```\n\n| Endpoint | Method | Description |\n|----------|--------|-------------|\n| `/models` | GET | Available models + last switch state |\n| `/status` | GET | Current status + health |\n| `/switch` | POST | Switch model, streams SSE progress |\n\n```powershell\nInvoke-WebRequest -Uri \"http://localhost:8898/switch\" -Method POST `\n  -ContentType \"application/json\" -Body '{\"model\":\"35b\",\"thinking\":true,\"restart\":true}'\n```\n\n---\n\n## Tailscale\n\nFor secure remote access without port forwarding:\n\n1. Install [Tailscale](https://tailscale.com) on host + clients\n2. Use Tailscale IP: `tailscale ip -4`\n\n---\n\n## Requirements\n\n| Component | Minimum |\n|-----------|---------|\n| GPU | NVIDIA RTX 3060 12 GB |\n| RAM | 32 GB |\n| OS | Windows 10/11 |\n| Docker | Desktop with WSL2 backend |\n\n---\n\n## Docker Commands\n\n```powershell\ndocker compose logs -f    # stream logs\ndocker compose down       # stop everything\ndocker ps                 # running containers\n```\n\n---\n\n## Files\n\n```\nsetup.ps1             Setup (Docker, models, firewall)\nrun.ps1               Start/stop/configure server\nstart-control.ps1     Host control API (port 8898)\ntest.ps1              Connectivity test\nbenchmark.ps1         Quick benchmark\nbenchmark.py          Python benchmark suite\nanalyze-benchmark.py  Analyze results\ndocker-compose.yml    Service definitions\nmodels/               Downloaded GGUFs (gitignored)\n```\n\n---\n\n## License\n\nScripts: public domain. Models: see respective Hugging Face model cards.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkibotu%2Fllm-windows-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkibotu%2Fllm-windows-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkibotu%2Fllm-windows-server/lists"}