https://github.com/kibotu/llm-windows-server

Turn your Windows GPU into a private, low-latency LLM server. Docker-based, OpenAI-compatible API.
https://github.com/kibotu/llm-windows-server
agentic cuda docker gguf llma-cpp local-llm nvidia-gpu openai-api opencode qwen self-hosted windows
Last synced: 12 days ago
JSON representation
Turn your Windows GPU into a private, low-latency LLM server. Docker-based, OpenAI-compatible API.
Host: GitHub
URL: https://github.com/kibotu/llm-windows-server
Owner: kibotu
Created: 2026-04-02T08:06:33.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-08T08:39:33.000Z (3 months ago)
Last Synced: 2026-04-08T10:27:51.815Z (3 months ago)
Topics: agentic, cuda, docker, gguf, llma-cpp, local-llm, nvidia-gpu, openai-api, opencode, qwen, self-hosted, windows
Language: PowerShell
Homepage:
Size: 33.2 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
Awesome Lists containing this project

README

          # LLM Server

[![Medium](https://img.shields.io/badge/Medium-@kibotu-000000?style=flat-square&logo=medium&logoColor=white)](https://medium.com/@kibotu/two-paths-to-local-llm-servers-windows-nvidia-vs-mac-apple-silicon-1e28d606f600?sk=a5d9989d124d7f9b844927f0f545ed09)

Turn your idle Windows machine with an NVIDIA GPU into a low-latency, private LLM inference server. Docker-based OpenAI-compatible API with usage tracking and optional high-load benchmarking.

## Quick Start

```powershell

.\setup.ps1                    # pulls image, downloads 9B model, configures firewall

.\run.ps1                      # starts server on port 8899

.\test.ps1                     # verify it works

```

API endpoint: `http://localhost:8899/v1`

---

## Models

| Alias | Model | Size | VRAM (MoE offload) | Notes |

|-------|-------|------|-------------------|-------|

| `9b` | Qwen3.5-9B Q4_K_M | ~5 GB | ~8 GB | Default. Fast, good for agentic loops. |

| `35b` | Qwen3.6-35B-A3B Q4_K_S | ~21 GB | ~6-8 GB | Auto-picks Qwen3.6 if present. |

| `qwen3635ba3b` | Qwen3.6-35B-A3B Q4_K_S | ~21 GB | ~6-8 GB | Explicit Qwen3.6 selection. |

| `qwen3635ba3b2bit` | Qwen3.6-35B-A3B Q2_K_XL | ~13 GB | ~4-5 GB | Ultra-low VRAM. Setup: `-IncludeQwen36Q2` |

| `qwen3635ba3b4bit` | Qwen3.6-35B-A3B IQ4_XS | ~14 GB | ~5-6 GB | imatrix quant. Setup: `-IncludeQwen36IQ4` |

| `qwen36heretic` | Qwen3.6-35B-A3B Uncensored Heretic MTP | ~22 GB | ~6-8 GB | Uncensored + MTP. Setup: `-IncludeQwen36Heretic` |

| `qwen36opus47` | Qwen3.6-35B-A3B Claude Opus Distill MTP | ~20 GB | ~6-8 GB | Claude 4.7 Opus reasoning distill + MTP. Best for code. Setup: `-IncludeQwen36Opus47` |

| `qwen3uncensored8b` | Qwen3-8B-Uncensor-v2 Q4_K_M | ~5 GB | ~8 GB | Uncensored 8B variant. |

| `gemma312` | Gemma 3 12B IT Q4_K_M | ~7 GB | ~9 GB | Setup: `-IncludeGemma312` |

| `gemma426ba4b` | Gemma 4 26B-A4B IT Q4_K_M | ~16 GB | ~8-10 GB | MoE. Setup: `-IncludeGemma426BA4B` |

Switch models: `.\run.ps1 -Model  -Restart`

Models are downloaded to the HuggingFace cache (`HF_HOME` env var, or `~/.cache/huggingface/`) and linked to `models/` for Docker.

---

## run.ps1

```powershell

.\run.ps1 [options]

```

### Model & Server

| Parameter | Default | Description |

|-----------|---------|-------------|

| `-Model` | `9b` | Model alias (see table above) |

| `-Restart` | - | Force restart container |

| `-Stop` | - | Stop server |

| `-Context` | `262144` | Context window (tokens) |

### Inference Features

| Parameter | Default | Description |

|-----------|---------|-------------|

| `-Thinking` | off | Extended reasoning mode |

| `-Mtp` | off | Multi-Token Prediction (requires MTP model like `qwen36opus47`) |

| `-MtpTokens` | `4` | Extra tokens to predict with MTP (1-16) |

### MoE Offloading

For MoE models (`35b`, `qwen3635ba3b*`, `qwen36opus47`, `gemma426ba4b`):

| Parameter | Default | Description |

|-----------|---------|-------------|

| `-MoeOffload` | `auto` | `auto`/`all` = experts to CPU, `off` = full GPU, `N` = first N layers |

### Performance Tuning

| Parameter | Default | Description |

|-----------|---------|-------------|

| `-Threads` | `0` (auto) | CPU threads. Auto: 20 on 32-core, 12 on 16-core. |

| `-Batch` | `2048` | Physical batch size. Auto-bumped to 4096 for MoE. |

| `-UBatch` | `1024` | Micro-batch size (≤ Batch). Auto-bumped to 4096 for MoE. |

| `-KvCache` | `q8_0` | KV cache type: `q4_0` (smallest), `q8_0` (balanced), `f16` (best) |

### Speculative Decoding

| Parameter | Default | Description |

|-----------|---------|-------------|

| `-DraftModelFile` | - | Draft GGUF filename in `models/` |

| `-DraftMin` | `5` | Min draft tokens before verification |

| `-DraftMax` | `16` | Max draft tokens per step |

| `-DraftGpuLayers` | `99` | GPU layers for draft model |

### Advanced

| Parameter | Description |

|-----------|-------------|

| `-ExtraFlags` | Pass-through to `llama-server` |

### Examples

```powershell

.\run.ps1 -Model qwen36heretic -MoeOffload auto -Thinking -Mtp  # uncensored + MTP

.\run.ps1 -Model qwen36opus47 -MoeOffload auto -Thinking -Mtp   # Opus distill + MTP (best for code)

.\run.ps1 -Model qwen3635ba3b2bit -MoeOffload auto              # 2-bit, minimal VRAM

.\run.ps1 -Model gemma426ba4b -Thinking                         # Gemma 4 MoE

.\run.ps1 -Batch 4096 -UBatch 4096 -Threads 24                  # perf tuning

```

---

## setup.ps1

```powershell

.\setup.ps1 [options]

```

Downloads Qwen3.5-9B by default. All other models require explicit flags.

### Additional Models

| Parameter | Model | Size |

|-----------|-------|------|

| `-IncludeQwen3635ba3b` | Qwen3.6-35B-A3B Q4_K_S | ~21 GB |

| `-IncludeQwen3Uncensored8b` | Qwen3-8B-Uncensor-v2 Q4_K_M | ~5 GB |

| `-IncludeQwen36Q2` | Qwen3.6-35B-A3B Q2_K_XL (2-bit) | ~13 GB |

| `-IncludeQwen36IQ4` | Qwen3.6-35B-A3B IQ4_XS | ~14 GB |

| `-IncludeQwen36Heretic` | Qwen3.6-35B-A3B Uncensored Heretic MTP | ~22 GB |

| `-IncludeQwen36Opus47` | Qwen3.6-35B-A3B Claude Opus Distill MTP Q4_K_M | ~20 GB |

| `-IncludeGemma312` | Gemma 3 12B IT Q4_K_M | ~7 GB |

| `-IncludeGemma426BA4B` | Gemma 4 26B-A4B IT Q4_K_M | ~16 GB |

```powershell

.\setup.ps1 -IncludeQwen36Opus47 -IncludeQwen36Heretic  # download both MTP models

.\setup.ps1 -Model 35b                                  # shorthand for -IncludeQwen3635ba3b

```

---

## Client Configuration

Base URL: `http://:8899/v1`  

API Key: any string (not validated)

**Cursor / Continue / OpenAI SDK**: Set base URL, any API key.

**OpenCode** (`~/.config/opencode/opencode.json`):

```json

{

  "provider": {

    "llama-at-home": {

      "name": "Local LLM",

      "npm": "@ai-sdk/openai-compatible",

      "options": { "baseURL": "http://:8899/v1" },

      "models": {

        "Qwen3.5-9B": { "name": "Qwen3.5-9B" },

        "Qwen3.6-35B-A3B": { "name": "Qwen3.6-35B-A3B" }

      }

    }

  }

}

```

---

## MoE Expert Offloading

MoE models (Qwen 35B, Gemma 4 26B-A4B) have many "expert" sub-networks but only activate a few per token. CPU expert offloading keeps attention on GPU while experts run on CPU/RAM.

**Result**: Run 35B models on 8-12 GB VRAM instead of 20+ GB.

```powershell

.\run.ps1 -Model qwen3635ba3b -MoeOffload auto   # default: experts to CPU

.\run.ps1 -Model qwen3635ba3b -MoeOffload off    # full GPU (needs 20+ GB)

.\run.ps1 -Model qwen3635ba3b -MoeOffload 30     # first 30 layers' experts to CPU

```

---

## Multi-Token Prediction (MTP)

MTP predicts multiple tokens in parallel for ~25% speedup on code workloads. Requires a model with native MTP support.

**Note:** MTP models require a custom llama.cpp build. The first run with `-Mtp` will build the image (10-20 minutes). Subsequent runs use the cached image.

| Model | Best `-MtpTokens` | Notes |

|-------|-------------------|-------|

| `qwen36heretic` | 4 (default) | Uncensored, good general MTP |

| `qwen36opus47` | 2 (auto-set) | Claude Opus distill. Best for code. |

```powershell

.\run.ps1 -Model qwen36heretic -Mtp                 # uncensored + MTP

.\run.ps1 -Model qwen36opus47 -Mtp                  # Opus distill + MTP (auto n=2)

.\run.ps1 -Model qwen36opus47 -Mtp -MtpTokens 1     # Opus distill, better for prose

```

### Standalone MTP Runner

For direct control over MTP settings, use `run-mtp.ps1`:

```powershell

.\run-mtp.ps1                                        # Code-optimized defaults (n=2)

.\run-mtp.ps1 -MtpTokens 1                           # Prose/creative mode

.\run-mtp.ps1 -Context 65536 -KvCache q4_0           # Longer context

.\run-mtp.ps1 -ReasoningBudget 16384                 # Hard math/logic problems

```

### Model Card Recommendations

Based on [Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF](https://huggingface.co/Dyluhn/lordx64-Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-MTP-GGUF):

| Setting | Code Workloads | Prose/Creative |

|---------|---------------|----------------|

| `--spec-draft-n-max` | **2** (89-91% accept, ~25% speedup) | **1** (more reliable) |

| `--cache-type-k/v` | `q8_0` (quality) or `q4_0` (longer ctx) | `q8_0` |

| `--parallel` | **1** (required for MTP) | **1** |

| `--reasoning-budget` | 4096 (default) | 16384 (hard problems) |

**Architecture:** `qwen35moe_mtp` — this is MoE + MTP working together.

### MTP + TBQ4 (Optional, Experimental)

The [Indra's Mirror fork](https://indrasmirror.au/blog-mtp-shared-tensors-200k) adds TBQ4 fused flash attention for massive context on 27B dense models (80-87 tok/s at 262K).

**TBQ4 is separate from MTP** — you don't need it to use MTP. The lordx64 model card recommends standard `q8_0` or `q4_0` KV cache.

| KV Cache | Use Case | Notes |

|----------|----------|-------|

| `q8_0` | Default, quality | Model card recommendation |

| `q4_0` | Longer context (65k+) | Good for 16GB VRAM |

| `tbq4_0` | Experimental on MoE | May have alignment issues |

**Recommended Settings for RTX 4080 + 35B MoE:**

```powershell

# Code-optimized (model card defaults)

.\run-mtp.ps1 -Model "lordx64-distill-MTP-Q4_K_M.gguf" `

              -Context 32768 `

              -KvCache q8_0 `

              -MtpTokens 2

# Prose/creative

.\run-mtp.ps1 -MtpTokens 1 -ReasoningBudget 8192

```

---

## Benchmarking

```powershell

.\benchmark.ps1                  # quick tok/s check

.\benchmark.ps1 -Runs 5          # stable average

# Python suite

pip install -r requirements-benchmark.txt

python benchmark.py --test standard

python analyze-benchmark.py benchmark_results.json

```

### A/B Testing

```powershell

.\run.ps1 -Model 9b -Batch 2048 -UBatch 512 -Restart && .\benchmark.ps1 -Runs 5

.\run.ps1 -Model 9b -Batch 4096 -UBatch 1024 -Restart && .\benchmark.ps1 -Runs 5

```

---

## Host Control API

Remote model switching via HTTP.

```powershell

.\start-control.ps1   # starts on port 8898

```

| Endpoint | Method | Description |

|----------|--------|-------------|

| `/models` | GET | Available models + last switch state |

| `/status` | GET | Current status + health |

| `/switch` | POST | Switch model, streams SSE progress |

```powershell

Invoke-WebRequest -Uri "http://localhost:8898/switch" -Method POST `

  -ContentType "application/json" -Body '{"model":"35b","thinking":true,"restart":true}'

```

---

## Tailscale

For secure remote access without port forwarding:

1. Install [Tailscale](https://tailscale.com) on host + clients

2. Use Tailscale IP: `tailscale ip -4`

---

## Requirements

| Component | Minimum |

|-----------|---------|

| GPU | NVIDIA RTX 3060 12 GB |

| RAM | 32 GB |

| OS | Windows 10/11 |

| Docker | Desktop with WSL2 backend |

---

## Docker Commands

```powershell

docker compose logs -f    # stream logs

docker compose down       # stop everything

docker ps                 # running containers

```

---

## Files

```

setup.ps1             Setup (Docker, models, firewall)

run.ps1               Start/stop/configure server

start-control.ps1     Host control API (port 8898)

test.ps1              Connectivity test

benchmark.ps1         Quick benchmark

benchmark.py          Python benchmark suite

analyze-benchmark.py  Analyze results

docker-compose.yml    Service definitions

models/               Downloaded GGUFs (gitignored)

```

---

## License

Scripts: public domain. Models: see respective Hugging Face model cards.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kibotu/llm-windows-server

Awesome Lists containing this project

README