{"id":47615945,"url":"https://github.com/sharpai/swiftlm","last_synced_at":"2026-04-23T05:04:13.475Z","repository":{"id":345901623,"uuid":"1187786518","full_name":"SharpAI/SwiftLM","owner":"SharpAI","description":"⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, MACOS + iOS iPhone app.","archived":false,"fork":false,"pushed_at":"2026-04-12T20:43:16.000Z","size":33154,"stargazers_count":303,"open_issues_count":3,"forks_count":15,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-04-12T21:25:54.487Z","etag":null,"topics":["apple-sili","inference","ios","llm","metal","mlx","moe","on-device-ai","openai-api","swift"],"latest_commit_sha":null,"homepage":"","language":"Swift","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SharpAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-21T06:48:49.000Z","updated_at":"2026-04-12T20:12:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/SharpAI/SwiftLM","commit_stats":null,"previous_names":["sharpai/mlx-server","sharpai/swiftlm"],"tags_count":26,"template":false,"template_full_name":null,"purl":"pkg:github/SharpAI/SwiftLM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SharpAI%2FSwiftLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SharpAI%2FSwiftLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SharpAI%2FSwiftLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SharpAI%2FSwiftLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SharpAI","download_url":"https://codeload.github.com/SharpAI/SwiftLM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SharpAI%2FSwiftLM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31873607,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"online","status_checked_at":"2026-04-16T02:00:06.042Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-sili","inference","ios","llm","metal","mlx","moe","on-device-ai","openai-api","swift"],"created_at":"2026-04-01T21:26:34.128Z","updated_at":"2026-04-23T05:04:13.451Z","avatar_url":"https://github.com/SharpAI.png","language":"Swift","readme":"# ⚡️ SwiftLM\n\nA blazingly fast, native Swift inference server that serves [MLX](https://github.com/ml-explore/mlx) models with a strict **OpenAI-compatible API**. \n\nNo Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://youtu.be/E9vR5FREhMg\"\u003e\u003cimg src=\"docs/mac_demo.gif\" width=\"720\" alt=\"SwiftLM Mac macOS demo\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\u003cbr\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/demo.gif\" width=\"320\" alt=\"SwiftBuddy iOS demo\" /\u003e\n\u003c/p\u003e\n\n---\n\n## 🏁 Getting Started\n\n### Fastest: Download Pre-built Binary\n\nDownload the latest release tarball from the [Releases page](https://github.com/SharpAI/SwiftLM/releases).\nThe archive is **self-contained** — `mlx.metallib` is bundled alongside the binary.\n\n```bash\ntar -xzf SwiftLM-\u003cversion\u003e-macos-arm64.tar.gz\n./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413\n```\n\n### Build from Source\n\nThe build script handles everything: submodules, cmake, Metal kernel compilation, and the Swift build.\n\n```bash\ngit clone --recursive https://github.com/SharpAI/SwiftLM\ncd SwiftLM\n./build.sh\n```\n\nThis will:\n1. Initialize git submodules\n2. Install `cmake` via Homebrew (if not already installed)\n3. Compile `mlx.metallib` from the Metal kernel sources\n4. Build the `SwiftLM` binary in release mode\n\nThen start the server (models download automatically if not cached):\n```bash\n.build/release/SwiftLM \\\n  --model mlx-community/gemma-4-26b-a4b-it-4bit \\\n  --port 5413\n```\n\n*(Add `--stream-experts` when running oversized MoE models to bypass macOS virtual memory swapping and stream expert layers directly from NVMe SSD.)*\n\n## 📊 Performance: Gemma 4-26B on Apple Silicon\n\nBenchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB.\n\n### Headline Numbers\n\n| Configuration | 512 ctx | 40K ctx | 100K ctx |\n|---|---|---|---|\n| **Dense/Vanilla** | 33.0 tok/s · 23.4 GB | 20.2 tok/s · 57.0 GB | 15.7 tok/s · 56.7 GB |\n| **SSD Stream** | 10.8 tok/s · **22.2 GB** | 10.4 tok/s · **24.2 GB** | 9.0 tok/s · **27.6 GB** |\n| **TurboQuant** | 29.0 tok/s · 23.7 GB | 3.9 tok/s · 39.4 GB | 3.9 tok/s · 57.3 GB |\n| **SSD + TurboQuant** | 11.4 tok/s · **22.0 GB** | 2.5 tok/s · **22.5 GB** | 1.6 tok/s · **22.3 GB** |\n\n\u003e Values shown as `generation speed · GPU memory allocated`\n\n**Key takeaways:**\n- 🚀 **Speed Doubled**: The newer MLX backend modifications have more than doubled raw `SSD Stream` inference speed (from 4.5 -\u003e **10.8 tok/s**) while maintaining streaming stability.\n- 📄 **40K context on 24 GB MacBook Pro**: SSD + TurboQuant effortlessly fits a 26B model in **22.5 GB** of memory footprint.\n- 📚 **100K context on 24 GB MacBook Pro**: Due to hyper-efficient 3-bit KV compression paired with SSD weight streaming, you can process 100,000 tokens of context on a 24 GB machine — only utilizing **22.3 GB** total. (Previously required a 64 GB Mac Studio).\n\n\u003e Run `./run_benchmark.sh` to generate these metrics on your own device. (See **Benchmarks \u0026 Testing** below).\n\n---\n\n## 🚀 Features\n\n- 🍎 **100% Native Apple Silicon**: Powered natively by Metal and Swift. \n- 🔌 **OpenAI-compatible**: Drop-in replacement for OpenAI SDKs (`/v1/chat/completions`, streaming, etc).\n- 🧠 **Smart Model Routing**: Loads HuggingFace format models directly, with native Safetensors parsing.\n- 👁️ **Vision-Language Models (VLM)**: Native multimodal vision processing natively on Metal via the `--vision` flag, supporting real-time base64 image parsing (e.g., Qwen2-VL, PaliGemma).\n- 🎧 **Audio-Language Models (ALM)**: High-performance audio ingestion via the `--audio` flag, decoding OpenAI-spec `input_audio` payloads with AVFoundation WAV extraction.\n- ⚡️ **TurboQuantization Integrated**: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.\n- 💾 **SSD Expert Streaming (10x)**: High-performance NVMe streaming that loads Mixture of Experts (MoE) layers directly from SSD to GPU — engineered by [@ericjlake](https://github.com/ericjlake), achieving **10x speedup** (0.58 → 5.91 tok/s) on 122B+ models with only ~10 GB resident memory. Uses cross-projection batching, concurrent pread (QD=24), asyncEval pipeline, and runtime top-k expert selection.\n- 🔮 **Speculative Decoding**: Load a small draft model (e.g. 9B) alongside a large main model to generate candidate tokens and verify in bulk — accelerating in-RAM inference.\n- 🎛️ **Granular Memory Control**: Integrated Layer Partitioning (`--gpu-layers`) and Wisdom Auto-Calibration for squeezing massive models into RAM.\n\n---\n\n## 📡 Supported Models \u0026 Methodologies\n\n`SwiftLM` dynamically maps Apple MLX primitives to standard HuggingFace architectures, enabling native Metal inference across the latest frontier open-weights models.\n\n### 💬 Text (LLMs)\n\n| Family | Models | Notes |\n|---|---|---|\n| **Gemma 4** | `gemma-4-e2b`, `gemma-4-e4b` (dense) · `gemma-4-26b-a4b`, `gemma-4-31b` (MoE) | Interleaved local + global attention; KV sharing; native quantized KV cache (issue #71 fix) |\n| **Gemma 3 / 3n** | `gemma-3-*`, `gemma-3n-*` | Google Gemma 3 and nano variants |\n| **Gemma / Gemma 2** | `gemma-*`, `gemma-2-*` | Original Gemma family |\n| **Qwen 3.5** | `Qwen3.5-7B`, `Qwen3.5-27B`, `Qwen3.5-122B-A10B`, `Qwen3.5-397B-A22B` | Dense + MoE; SSD streaming at 10× for 122B/397B |\n| **Qwen 3** | `Qwen3-*` (dense + MoE) | Sliding window + hybrid attention |\n| **Qwen 2.5** | `Qwen2.5-7B`, `Qwen2.5-14B`, `Qwen2.5-72B` | Robust RoPE scaling |\n| **Qwen 2** | `Qwen2-*` | Linear RoPE variants |\n| **Phi 4 / PhiMoE** | `phi-4-mlx`, `Phi-3.5-MoE` | Microsoft Phi family incl. MoE |\n| **Phi 3 / Phi** | `Phi-3`, `Phi-3.5-mini` | 128k context via chunked prefill |\n| **Mistral / Mixtral** | `Mistral-7B`, `Mistral-4`, `Mixtral-*` | GQA + sliding window variants |\n| **Llama / Llama 3** | `Llama-3.1-*`, `Llama-3.2-*`, `Llama-3.3-*` | YaRN + dynamic NTK RoPE scaling |\n| **GLM 4** | `GLM-4-*` | THUDM GLM-4 dense + MoE-Lite variants |\n| **DeepSeek V3** | `DeepSeek-V3-*` | MLA attention architecture |\n| **Falcon H1** | `Falcon-H1-*` | Falcon hybrid SSM+attention |\n| **LFM 2** | `LFM2-*`, `LFM2-MoE-*` | Liquid AI dense + MoE |\n| **OLMo 2 / OLMo 3 / OLMoE** | `OLMo-2-*`, `OLMo-3-*` | AllenAI open language models |\n| **Granite / GraniteMoE** | `Granite-*`, `GraniteMoE-Hybrid-*` | IBM Granite hybrid Mamba+attention |\n| **SmolLM 3** | `SmolLM3-*` | HuggingFace compact LM |\n| **MiniCPM** | `MiniCPM-*` | Lightweight efficient LM |\n| **InternLM 2** | `InternLM2-*` | Shanghai AI Lab series |\n| **Cohere / Command-R** | `Command-R-*`, `c4ai-*` | Cohere retrieval-tuned models |\n| **Jamba** | `Jamba-v0.1` | AI21 hybrid Mamba+attention |\n| **Exaone 4** | `EXAONE-4.0-*` | LG AI Research |\n| **MiMo / MiMo V2** | `MiMo-7B-*` | Xiaomi reasoning model |\n| **Ernie 4.5** | `ERNIE-4.5-*` | Baidu ERNIE series |\n| **Baichuan M1** | `Baichuan-M1-*` | Baichuan multimodal base |\n| **Bailing MoE** | `Ling-*` | Bailing/Ling MoE family |\n| **NemotronH** | `Nemotron-H-*` | NVIDIA Nemotron hybrid |\n| **Starcoder 2** | `starcoder2-*` | Code generation |\n| **OpenELM** | `OpenELM-*` | Apple on-device efficient LM |\n| **Apertus / AfMoE** | `Apertus-*` | Sparse MoE research models |\n| **BitNet** | `bitnet-*` | 1-bit weight quantization |\n| **MiniMax** | `MiniMax-Text-*` | Lightning attention architecture |\n| **Olmo3** | `Olmo3-*` | AllenAI Olmo3 series |\n\n### 👁️ Vision (VLMs)\n*Run with `--vision` flag.*\n\n| Family | Models | Notes |\n|---|---|---|\n| **Gemma 4** | `gemma-4-*` (VLM mode) | Native image tower via MLXVLM |\n| **Gemma 3** | `gemma-3-*` (VLM mode) | PaLiGemma-style image projection |\n| **Qwen3-VL / Qwen3.5-VL** | `Qwen3-VL-*`, `Qwen3.5-VL-*` | Dynamic resolution with native RoPE |\n| **Qwen2-VL / Qwen2.5-VL** | `Qwen2-VL-2B/7B`, `Qwen2.5-VL-*` | Real-time positional bounding + Metal image scaling |\n| **LFM2-VL** | `LFM2-VL-1.6B` | Liquid AI multimodal |\n| **Pixtral** | `pixtral-12b` | Mistral vision model |\n| **PaliGemma** | `paligemma-*` | Google vision-language |\n| **Idefics 3** | `Idefics3-*` | HuggingFace multimodal |\n| **Mistral 3** | `Mistral-Small-3.1-*` | Mistral vision variant |\n| **FastVLM** | `FastVLM-*` | Apple on-device VLM |\n| **SmolVLM 2** | `SmolVLM2-*` | HuggingFace compact VLM |\n| **GLM OCR** | `glm-4v-*` | THUDM vision+OCR |\n| **QwenVL** | `Qwen-VL-*` | Original Qwen VL |\n\n### 🎧 Audio (ALMs)\n*Run with `--audio` flag. Only `gemma-4-e4b` variants include an audio tower.*\n\n| Family | Models | Notes |\n|---|---|---|\n| **Gemma 4 Omni** | `gemma-4-e4b-it-4bit`, `gemma-4-e4b-it-8bit` | Audio-in via vDSP STFT → Mel spectrogram (16kHz, 128 bins); text-out |\n\n\n\n---\n\n## 📱 SwiftBuddy — iOS App\n\nA native iPhone \u0026 iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.\n\n### Features\n- **Tab UI**: Chat · Models · Settings\n- **Live download progress** with speed indicator and circular progress ring\n- **Model catalog**: Qwen3, Phi-3.5, Mistral, Llama — with on-device RAM fit indicators\n- **HuggingFace search** — find any `mlx-community` model by name\n- **Context-aware empty states** — downloading ring, loading spinner, idle prompt\n- **iOS lifecycle hardened** — model unload only fires on true background (not notification banners); 30-second grace period on app-switch\n\n\u003e 📱 **Running live on iPhone 13 Pro (6 GB)** — no Python, no server, no GIL. Pure on-device MLX inference via Metal GPU.\n\n### Build \u0026 Run (iOS)\n\n```bash\ncd SwiftBuddy\npython3 generate_xcodeproj.py       # Generates SwiftBuddy.xcodeproj\nopen SwiftBuddy.xcodeproj\n```\n\nThen in Xcode:\n1. Select the **SwiftBuddy** target → **Signing \u0026 Capabilities**\n2. Set your **Team** (your Apple Developer account)\n3. Select your iPhone as the run destination\n4. ⌘R to build and run\n\n\u003e **Note for contributors**: The `.xcodeproj` is git-ignored (it contains your personal Team ID). Run `generate_xcodeproj.py` after cloning to regenerate it locally. Your Team ID is never committed.\n\n---\n\n## ⚡️ TurboQuantization: KV Cache Compression\n\n`SwiftLM` implements a **hybrid V2+V3 TurboQuant architecture** for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.\n\n### By combining V2 Speed with V3 Quality:\nRecent reproductions of the TurboQuant algorithm (e.g., `turboquant-mlx`) revealed two distinct paths:\n1. **V2 (Hardware-Accelerated)**: Fast, but uses linear affine quantization which degrades quality at 3-bit.\n2. **V3 (Paper-Correct)**: Excellent quality using non-linear Lloyd-Max codebooks, but painfully slow due to software dequantization.\n\n**We built the \"Holy Grail\" hybrid:** We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal (`bggml-metal`) shaders. This achieves **V3 quality at V2 speeds**, completely detached from Python overhead.\n\n### The Algorithm:\n\n**K-Cache (3-bit PolarQuant + 1-bit QJL) = 4.25 bits/dim**\n1. Extract L2 norm and normalize: `x̂ = x / ‖x‖`\n2. Apply Fast Walsh-Hadamard Transform (WHT) rotation to distribute outliers evenly.\n3. Quantize each coordinate using **3-bit non-linear Lloyd-Max centroids**.\n4. Compute the residual error between the original vector and the quantized approximation.\n5. Project the residual via a random Johnson-Lindenstrauss (QJL) matrix and store the 1-bit signs.\n*(Why QJL? QJL acts as an additional regularizer that prevents centroid resolution loss from degrading the attention dot-product.)*\n\n**V-Cache (3-bit PolarQuant) = 3.125 bits/dim**\nBecause the V-cache matrix is not used for inner-product attention scoring, the QJL error correction provides no benefit. We cleanly disable QJL for the V-cache, extracting an additional 25% memory savings without sacrificing quality.\n\nReference implementations: [`turboquant-mlx`](https://github.com/sharpner/turboquant-mlx) | [`turboquant_plus`](https://github.com/TheTom/turboquant_plus) | Paper: [TurboQuant, Google 2504.19874](https://arxiv.org/abs/2504.19874)\n\n---\n\n## 💾 SSD Expert Streaming: 10x MoE Speedup\n\nSwiftLM implements a **rewritten SSD expert streaming pipeline** (engineered by [Eric Lake](https://github.com/ericjlake)) that achieves 10x generation speedup for massive Mixture of Experts (MoE) models running on memory-constrained Apple Silicon. This enables running models like **Qwen3.5-122B** (69.6 GB) and **Qwen3.5-397B** (209 GB) on a **64 GB Mac** by streaming expert weights from NVMe SSD.\n\n### Benchmark Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)\n\n| Configuration | tok/s | vs. Original | Notes |\n|---|---|---|---|\n| Original `--stream-experts` | 0.58 | baseline | Sequential pread, 1 NVMe queue |\n| **This PR (top-k=8, full quality)** | **4.95** | **8.5×** | All 8 experts evaluated |\n| **This PR (top-k=6, default)** | **5.20** | **9.0×** | Recommended default |\n| **This PR (top-k=4, speed mode)** | **5.91** | **10.2×** | Best quality/speed tradeoff |\n| **This PR (top-k=2, turbo mode)** | **6.52** | **11.2×** | Still coherent output |\n\n\u003e Memory stable at **~10.6 GB resident**, no swap activity. Tested over 200-token generation runs.\n\n### The Approach: Small Model Helps Large Model\n\nA novel aspect of this architecture is the **dual-model speculative decoding** pattern: a small draft model (e.g. Qwen3.5-9B at 73 tok/s) runs **entirely in RAM** while the large MoE model (e.g. 122B) streams experts from SSD. The draft model generates candidate tokens at high speed, and the main model verifies them in bulk — dramatically reducing the number of SSD-bound generation rounds needed.\n\n\u003e **Important finding:** Speculative decoding is **counterproductive for SSD-streaming MoE** specifically. The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. Speculative decoding is therefore routed exclusively to **in-RAM models**.\n\n### Optimization Techniques\n\n1. **Cross-Projection Batching**: Collapses ~1,400 per-expert `eval()` calls down to ~48 per token by orchestrating gate/up/down projections together in `SwitchGLU`.\n2. **Concurrent NVMe pread (QD=24)**: Replaces sequential pread with `DispatchQueue.concurrentPerform`, saturating the NVMe controller's queue depth (8 experts × 3 projections = 24 parallel reads).\n3. **AsyncEval Pipeline with Speculative Pread**: Overlaps GPU compute with SSD I/O — uses previous-token routing to speculatively pre-load experts for the next token during the GPU async window (~70% hit rate). Only missed experts (~30%) require on-demand pread after routing sync.\n4. **Persistent Metal Buffers**: Expert weight buffers are allocated once per `SwitchGLU` layer and reused across tokens, eliminating per-token allocation overhead.\n5. **Runtime Top-K Expert Selection**: The `SWIFTLM_TOP_K` environment variable reduces the number of active experts per token at runtime without model recompilation — trading marginal quality for significant speed gains.\n\n### Key Engineering Findings\n\n| Finding | Detail |\n|---|---|\n| **GPU compute is the bottleneck** | At steady state, GPU compute is ~190ms of ~200ms per-token time. The OS page cache serves ~90% of expert reads from RAM. |\n| **Don't cache experts in application memory** | An LRU expert cache *stole* from the OS page cache and regressed performance (4.84 → 4.01 tok/s). Let the kernel manage it. |\n| **MambaCache requires checkpoint rollback** | Unlike attention KV caches (trim = decrement offset), Mamba's recurrent state integrates all history and cannot be partially undone. We implemented `checkpoint()`/`restore()` for speculative decoding on hybrid Attention+Mamba architectures (Qwen3.5). |\n\n### Usage\n\n```bash\n# Standard SSD streaming (recommended, top-k=6):\nSWIFTLM_TOP_K=6 SwiftLM --port 8002 \\\n  --model \u003cpath\u003e/Qwen3.5-122B-A10B-4bit --stream-experts\n\n# Speed mode (top-k=4):\nSWIFTLM_TOP_K=4 SwiftLM --port 8002 \\\n  --model \u003cpath\u003e/Qwen3.5-122B-A10B-4bit --stream-experts\n\n# With speculative decoding (in-RAM models only):\nSwiftLM --port 8002 \\\n  --model \u003cpath\u003e/Qwen3.5-27B-4bit \\\n  --draft-model \u003cpath\u003e/Qwen3.5-9B-4bit \\\n  --num-draft-tokens 4\n```\n\n---\n\n## 🔀 Why We Forked Apple MLX\n\nTo achieve the extreme memory efficiency and speeds seen in **SSD Expert Streaming** and **Speculative Decoding**, `SwiftLM` relies on custom C++ primitives that bypass standard unified memory limits.\n\n\u003e [!NOTE]\n\u003e We maintain custom forks (`SharpAI/mlx` and `SharpAI/mlx-c`) to support **out-of-core memory-mapped execution**, streaming tensor blocks directly from the SSD (NVMe) to the GPU via custom Metal kernels (`ssd_streamer.mm` and `fence.air`). Official `ml-explore` repositories do not yet support this out-of-the-box.\n\nFor a detailed breakdown on repository architecture, upstream synchronization, our specific custom patches, and the specific indications for when we can safely revert to Apple's native upstream, read the full documentation: \n👉 **[Upstream MLX Synchronization \u0026 SSD Streaming Maintenance](.agents/workflows/mlx-upstream-sync.md)**\n\n---\n\n## 💻 Benchmarks \u0026 Testing\n\nRun our automated benchmark suites via the interactive script:\n```bash\n./run_benchmark.sh\n```\n\nThe script provides an interactive menu to select any model and run one of two automated testing suites:\n\n### Test 1: Automated Context \u0026 Memory Profile (TPS \u0026 RAM matrix)\nTests generation speed (TPS) and granular Apple Metal GPU memory allocation across extreme context lengths (e.g., `512, 40000, 100000` tokens).\n- Iterates over 4 configurations: Vanilla, SSD Streaming, TurboQuant, and SSD + TurboQuant.\n- Generates a rich ANSI console visualization with bar charts and a configuration scoreboard.\n- Saves the complete results matrix to `docs/profiling/profiling_results_\u003chostname\u003e.md`.\n\n### Test 2: Prompt Cache \u0026 Sliding Window Regression Test\nVerifies the stability of the engine's KV prompt cache when interleaving long contexts with sliding window attention bounds.\n- Automatically spins up an isolated background inference server instance.\n- Generates a 5,000+ token mock JSON payload.\n- Fires an extreme alternating sequence of 4 concurrent requests (`5537t` → `18t` → `5537t` → `Big Full Cache Hit`).\n- Confirms the memory bounds remain stable without throwing $O(N^2)$ OS memory warnings, $OOM$ exceptions, or `SIGTRAP` errors.\n\n### Throughput \u0026 Inference Memory Profile\nTested by rendering exactly 20 tokens under standard conversational evaluation (`--prefill-size 512`) to capture precise Token Generation (TPS) and Apple Metal memory footprint limits:\n\n| Model | Time To First Token (s) | Generation Speed (tok/s) | Peak GPU Memory (GB) |\n|---|---|---|---|\n| `gemma-4-e2b-it-4bit` | 0.08s | 116.27 tok/s | 1.37 GB |\n| `gemma-4-e4b-it-8bit` | 0.33s | 48.21 tok/s | 7.64 GB |\n| `gemma-4-26b-a4b-it-4bit` | 0.14s | 85.49 tok/s | 13.46 GB |\n| `gemma-4-31b-it-4bit` | 0.55s | 14.82 tok/s | 16.83 GB |\n\nTo run the automated suite on your machine for these models, execute:\n```bash\npython3 tests/run_4models_benchmark.py\n```\n\n\u003e **🧠 How it works:** SwiftLM implements **Chunked Prefill** (controlled via `--prefill-size`, defaulting to 512). This is functionally equivalent to `llama.cpp`'s `--batch-size` parameter and mirrors the [`mlx-lm` Python library](https://github.com/ml-explore/mlx/tree/main/mlx_lm)'s reference implementation approach to preventing $O(N^2)$ Unified Memory over-allocation during massive sequence parsing.\n\n\u003e **⚠️ Quantization Disclaimer**: While heavier quantization shrinks the required memory footprint, **4-bit quantization** remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like `\\name\\` instead of `\"name\"`—which systematically breaks OpenAI-compatible tool calling.\n\n---\n\n## 📡 API Endpoints\n\n| Endpoint | Method | Description |\n|---|---|---|\n| `/health` | GET | Server health + loaded model capabilities |\n| `/v1/models` | GET | List available models |\n| `/v1/chat/completions` | POST | Chat completions (LLM and VLM support, multi-turn, system prompts) |\n\n## 💻 Usage Examples\n\n### Chat Completion (Streaming)\nDrop-in compatible with standard OpenAI HTTP consumers:\n```bash\ncurl http://localhost:5413/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"gemma-4-26b-a4b-it-4bit\",\n    \"stream\": true,\n    \"messages\": [\n      {\"role\": \"system\", \"content\": \"You are Aegis-AI, a local home security agent. Output strictly in JSON format.\"},\n      {\"role\": \"user\", \"content\": \"Clip 1: Delivery person drops package at 14:02. Clip 2: Delivery person walks away down driveway at 14:03. Do these clips represent the same security event? Output a JSON object with a `duplicate` boolean and a `reason` string.\"}\n    ]\n  }'\n```\n---\n\n### Vision-Language Models (VLM)\nTo run a vision model (e.g., `mlx-community/Qwen2-VL-2B-Instruct-4bit`), launch SwiftLM with the `--vision` flag:\n```bash\n./.build/release/SwiftLM --model mlx-community/Qwen2-VL-2B-Instruct-4bit --vision\n```\n\nYou can then pass standard OpenAI base64 encoded images directly. SwiftLM handles hardware spatial-mapping natively via Metal:\n```bash\ncurl http://localhost:5413/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"qwen2-vl\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"text\", \"text\": \"Describe the contents of this image.\"},\n          {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ...\"}}\n        ]\n      }\n    ]\n  }'\n```\n---\n\n\n## ⚙️ CLI Options\n\n| Option | Default | Description |\n|---|---|---|\n| `--model` | (required) | HuggingFace model ID or local path |\n| `--port` | `5413` | Port to listen on |\n| `--host` | `127.0.0.1` | Host to bind |\n| `--vision` | `false` | Enable VLM (vision-language model) mode for image inputs |\n| `--audio` | `false` | Enable ALM (audio-language model) mode for audio inputs |\n| `--max-tokens` | `2048` | Max tokens limit per generation |\n| `--prefill-size`| `512`  | Prompt prefill chunk size (micro-batching for long contexts) |\n| `--top-p` | `1.0` | Default top-p nucleus sampling (overridable per-request) |\n| `--top-k` | `50` | Default top-k sampling (0 disables, overridable per-request) |\n| `--min-p` | `0.0` | Default min-p sampling threshold relative to the highest probability token (0 disables) |\n| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |\n| `--stream-experts` | `false` | Enable SSD expert streaming for MoE models (10x speedup) |\n| `--turbo-kv` | `false` | Enable TurboQuant 3-bit KV cache compression (activates after 2048 tokens, server-wide) |\n| `--draft-model` | (none) | Draft model path/ID for speculative decoding (in-RAM models only) |\n| `--num-draft-tokens` | `4` | Number of draft tokens per speculation round |\n\n## 🔧 Per-Request API Parameters\n\nIn addition to the standard OpenAI fields (`temperature`, `top_p`, `max_tokens`, etc.), SwiftLM accepts the following **SwiftLM-specific** fields on `POST /v1/chat/completions`:\n\n| Field | Type | Description |\n|---|---|---|\n| `kv_bits` | `int` (4 or 8) | Enable **MLX-native quantized KV cache** for this request. Uses `QuantizedKVCache` (standard group quantization) instead of `KVCacheSimple`. Separate from `--turbo-kv`. Reduces KV memory ~2–4× at mild quality cost. |\n| `enable_thinking` | `bool` | Force-enable or disable chain-of-thought thinking blocks for Gemma-4 / Qwen3. |\n| `kv_group_size` | `int` | Group size for `kv_bits` quantization (default: `64`). |\n| `top_k` | `int` | Per-request top-k sampling override (0 = disabled). |\n| `min_p` | `float` | Per-request min-p sampling threshold (0 = disabled). |\n| `repetition_penalty` | `float` | Token repetition penalty (e.g. `1.15`). |\n\n### `kv_bits` vs `--turbo-kv` — What's the difference?\n\n| | `kv_bits` (per-request) | `--turbo-kv` (server flag) |\n|---|---|---|\n| **Scope** | Per-request, sent in JSON body | Server-wide, set at startup |\n| **Algorithm** | MLX-native group quantization (4-bit / 8-bit) | Custom 3-bit PolarQuant + QJL Walsh-Hadamard |\n| **Activation** | From token 0 | After 2048 tokens |\n| **Memory savings** | ~2–4× vs FP16 | ~3.5× vs FP16 |\n| **Use case** | Targeted memory reduction per conversation | Extreme long-context (100K+) compression |\n\n### Example: Enable 4-bit KV cache per request\n```bash\ncurl http://localhost:5413/v1/chat/completions \\\\\n  -H \"Content-Type: application/json\" \\\\\n  -d '{\n    \"model\": \"gemma-4-26b-a4b-it-4bit\",\n    \"kv_bits\": 4,\n    \"messages\": [\n      {\"role\": \"user\", \"content\": \"Summarize the history of computing in 3 sentences.\"}\n    ]\n  }'\n```\n\n## 📦 Requirements\n\n- macOS 14.0+\n- Apple Silicon (M1/M2/M3/M4/M5)\n- Xcode Command Line Tools\n- Metal Toolchain (`xcodebuild -downloadComponent MetalToolchain`)\n\n## 📖 The \"Aha!\" Moment\n\n**The \"2+2=4\" Aha Moment**: During development, we encountered a severe \"silent failure\" where the model would successfully load and evaluate all 32 layers at high speed, but generate nothing but infinite whitespace. The model logits showed the correct *shape* but the wrong *magnitudes*. \n\nThe breakthrough arrived when we realized the **embedding scale** was missing. The Gemma architecture requires scaling embedding outputs by `sqrt(hidden_size)`. For a hidden size of 2816, missing this meant every activation in the network was ~53x too small! By adding one single math operation:\n`h = h * MLXArray(Float(config.hiddenSize).squareRoot())`\n\nThe model instantly woke up from \"whispering\" whitespace and successfully responded to `\"What is 2+2?\"` with a perfect `\"2 + 2 equals 4.\"` — proving that the entire massive structural pipeline from Swift to Metal was working.\n\n## 🙏 Acknowledgments \u0026 Credits\n\n[![Awesome MLX](https://img.shields.io/badge/Awesome-MLX-blue?style=flat-square)](https://github.com/raullenchai/awesome-mlx)\n\n`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects and contributors for their indispensable reference materials and underlying protocols:\n\n### Contributors\n\n- **[Eric Lake](https://github.com/ericjlake)** — Engineered the **SSD Expert Streaming 10x rewrite** ([PR #26](https://github.com/SharpAI/SwiftLM/pull/26)), achieving 10× generation speedup on 122B+ MoE models via cross-projection batching, concurrent NVMe pread (QD=24), asyncEval pipeline with speculative pread, and runtime top-k expert selection. Also implemented the **speculative decoding infrastructure** with `DraftModelRef`, dual-model loading, and **MambaCache checkpoint/restore** for hybrid Attention+Mamba architectures.\n\n### Projects \u0026 References\n\n- **[mlx-swift](https://github.com/ml-explore/mlx-swift)** — The core Apple MLX wrapper bringing Metal-accelerated operations into the Swift ecosystem.\n- **[mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm)** — The official Python language models implementation, serving as the core inspiration for our chunked-prefill architecture and attention manipulation logic.\n- **[flash-moe](https://github.com/danveloper/flash-moe)** — Inspired the memory-mapped out-of-core SSD Expert Streaming mechanics that we implemented natively in SwiftLM.\n- **[Hummingbird](https://github.com/hummingbird-project/hummingbird)** — The incredible event-driven Swift HTTP engine powering the OpenAI-compatible REST API.\n- **[TurboQuant Paper](https://arxiv.org/abs/2504.19874)** — *\"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate\"* (Zandieh et al., AISTATS 2026). Provided the initial algorithmic framework for the dual-stage PolarQuant + QJL engine.\n- **[TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)** — Served as an invaluable reference architecture for the C and GPU quantization tables, guiding the development of our native `turbo-wht` Walsh-Hadamard kernels and custom Metal wrapper layers.\n- **[TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)** — Essential Python validation logic used to certify the correctness of our manually constructed Lloyd-Max codebook generation math.\n- **[amirzandieh/QJL](https://github.com/amirzandieh/QJL)** — The original 1-bit residual correction engine backing the paper, which informed our QJL error recovery in dot-product regimes.\n\n---\n**License**: MIT\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsharpai%2Fswiftlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsharpai%2Fswiftlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsharpai%2Fswiftlm/lists"}