https://github.com/sharpai/swiftlm

⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, MACOS + iOS iPhone app.
https://github.com/sharpai/swiftlm
apple-sili inference ios llm metal mlx moe on-device-ai openai-api swift
Last synced: 5 days ago
JSON representation
⚡ Native MLX Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, MACOS + iOS iPhone app.
Host: GitHub
URL: https://github.com/sharpai/swiftlm
Owner: SharpAI
License: mit
Created: 2026-03-21T06:48:49.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-04-12T20:43:16.000Z (15 days ago)
Last Synced: 2026-04-12T21:25:54.487Z (15 days ago)
Topics: apple-sili, inference, ios, llm, metal, mlx, moe, on-device-ai, openai-api, swift
Language: Swift
Homepage:
Size: 31.6 MB
Stars: 303
Watchers: 2
Forks: 15
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # ⚡️ SwiftLM

A blazingly fast, native Swift inference server that serves [MLX](https://github.com/ml-explore/mlx) models with a strict **OpenAI-compatible API**. 

No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.



  








  



---

## 🏁 Getting Started

### Fastest: Download Pre-built Binary

Download the latest release tarball from the [Releases page](https://github.com/SharpAI/SwiftLM/releases).

The archive is **self-contained** — `mlx.metallib` is bundled alongside the binary.

```bash

tar -xzf SwiftLM--macos-arm64.tar.gz

./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

```

### Build from Source

The build script handles everything: submodules, cmake, Metal kernel compilation, and the Swift build.

```bash

git clone --recursive https://github.com/SharpAI/SwiftLM

cd SwiftLM

./build.sh

```

This will:

1. Initialize git submodules

2. Install `cmake` via Homebrew (if not already installed)

3. Compile `mlx.metallib` from the Metal kernel sources

4. Build the `SwiftLM` binary in release mode

Then start the server (models download automatically if not cached):

```bash

.build/release/SwiftLM \

  --model mlx-community/gemma-4-26b-a4b-it-4bit \

  --port 5413

```

*(Add `--stream-experts` when running oversized MoE models to bypass macOS virtual memory swapping and stream expert layers directly from NVMe SSD.)*

## 📊 Performance: Gemma 4-26B on Apple Silicon

Benchmark results for `gemma-4-26b-a4b-it-4bit` (26B MoE, 4-bit) on M5 Pro 64 GB.

### Headline Numbers

| Configuration | 512 ctx | 40K ctx | 100K ctx |

|---|---|---|---|

| **Dense/Vanilla** | 33.0 tok/s · 23.4 GB | 20.2 tok/s · 57.0 GB | 15.7 tok/s · 56.7 GB |

| **SSD Stream** | 10.8 tok/s · **22.2 GB** | 10.4 tok/s · **24.2 GB** | 9.0 tok/s · **27.6 GB** |

| **TurboQuant** | 29.0 tok/s · 23.7 GB | 3.9 tok/s · 39.4 GB | 3.9 tok/s · 57.3 GB |

| **SSD + TurboQuant** | 11.4 tok/s · **22.0 GB** | 2.5 tok/s · **22.5 GB** | 1.6 tok/s · **22.3 GB** |

> Values shown as `generation speed · GPU memory allocated`

**Key takeaways:**

- 🚀 **Speed Doubled**: The newer MLX backend modifications have more than doubled raw `SSD Stream` inference speed (from 4.5 -> **10.8 tok/s**) while maintaining streaming stability.

- 📄 **40K context on 24 GB MacBook Pro**: SSD + TurboQuant effortlessly fits a 26B model in **22.5 GB** of memory footprint.

- 📚 **100K context on 24 GB MacBook Pro**: Due to hyper-efficient 3-bit KV compression paired with SSD weight streaming, you can process 100,000 tokens of context on a 24 GB machine — only utilizing **22.3 GB** total. (Previously required a 64 GB Mac Studio).

> Run `./run_benchmark.sh` to generate these metrics on your own device. (See **Benchmarks & Testing** below).

---

## 🚀 Features

- 🍎 **100% Native Apple Silicon**: Powered natively by Metal and Swift. 

- 🔌 **OpenAI-compatible**: Drop-in replacement for OpenAI SDKs (`/v1/chat/completions`, streaming, etc).

- 🧠 **Smart Model Routing**: Loads HuggingFace format models directly, with native Safetensors parsing.

- 👁️ **Vision-Language Models (VLM)**: Native multimodal vision processing natively on Metal via the `--vision` flag, supporting real-time base64 image parsing (e.g., Qwen2-VL, PaliGemma).

- 🎧 **Audio-Language Models (ALM)**: High-performance audio ingestion via the `--audio` flag, decoding OpenAI-spec `input_audio` payloads with AVFoundation WAV extraction.

- ⚡️ **TurboQuantization Integrated**: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.

- 💾 **SSD Expert Streaming (10x)**: High-performance NVMe streaming that loads Mixture of Experts (MoE) layers directly from SSD to GPU — engineered by [@ericjlake](https://github.com/ericjlake), achieving **10x speedup** (0.58 → 5.91 tok/s) on 122B+ models with only ~10 GB resident memory. Uses cross-projection batching, concurrent pread (QD=24), asyncEval pipeline, and runtime top-k expert selection.

- 🔮 **Speculative Decoding**: Load a small draft model (e.g. 9B) alongside a large main model to generate candidate tokens and verify in bulk — accelerating in-RAM inference.

- 🎛️ **Granular Memory Control**: Integrated Layer Partitioning (`--gpu-layers`) and Wisdom Auto-Calibration for squeezing massive models into RAM.

---

## 📡 Supported Models & Methodologies

`SwiftLM` dynamically maps Apple MLX primitives to standard HuggingFace architectures, enabling native Metal inference across the latest frontier open-weights models.

### 💬 Text (LLMs)

| Family | Models | Notes |

|---|---|---|

| **Gemma 4** | `gemma-4-e2b`, `gemma-4-e4b` (dense) · `gemma-4-26b-a4b`, `gemma-4-31b` (MoE) | Interleaved local + global attention; KV sharing; native quantized KV cache (issue #71 fix) |

| **Gemma 3 / 3n** | `gemma-3-*`, `gemma-3n-*` | Google Gemma 3 and nano variants |

| **Gemma / Gemma 2** | `gemma-*`, `gemma-2-*` | Original Gemma family |

| **Qwen 3.5** | `Qwen3.5-7B`, `Qwen3.5-27B`, `Qwen3.5-122B-A10B`, `Qwen3.5-397B-A22B` | Dense + MoE; SSD streaming at 10× for 122B/397B |

| **Qwen 3** | `Qwen3-*` (dense + MoE) | Sliding window + hybrid attention |

| **Qwen 2.5** | `Qwen2.5-7B`, `Qwen2.5-14B`, `Qwen2.5-72B` | Robust RoPE scaling |

| **Qwen 2** | `Qwen2-*` | Linear RoPE variants |

| **Phi 4 / PhiMoE** | `phi-4-mlx`, `Phi-3.5-MoE` | Microsoft Phi family incl. MoE |

| **Phi 3 / Phi** | `Phi-3`, `Phi-3.5-mini` | 128k context via chunked prefill |

| **Mistral / Mixtral** | `Mistral-7B`, `Mistral-4`, `Mixtral-*` | GQA + sliding window variants |

| **Llama / Llama 3** | `Llama-3.1-*`, `Llama-3.2-*`, `Llama-3.3-*` | YaRN + dynamic NTK RoPE scaling |

| **GLM 4** | `GLM-4-*` | THUDM GLM-4 dense + MoE-Lite variants |

| **DeepSeek V3** | `DeepSeek-V3-*` | MLA attention architecture |

| **Falcon H1** | `Falcon-H1-*` | Falcon hybrid SSM+attention |

| **LFM 2** | `LFM2-*`, `LFM2-MoE-*` | Liquid AI dense + MoE |

| **OLMo 2 / OLMo 3 / OLMoE** | `OLMo-2-*`, `OLMo-3-*` | AllenAI open language models |

| **Granite / GraniteMoE** | `Granite-*`, `GraniteMoE-Hybrid-*` | IBM Granite hybrid Mamba+attention |

| **SmolLM 3** | `SmolLM3-*` | HuggingFace compact LM |

| **MiniCPM** | `MiniCPM-*` | Lightweight efficient LM |

| **InternLM 2** | `InternLM2-*` | Shanghai AI Lab series |

| **Cohere / Command-R** | `Command-R-*`, `c4ai-*` | Cohere retrieval-tuned models |

| **Jamba** | `Jamba-v0.1` | AI21 hybrid Mamba+attention |

| **Exaone 4** | `EXAONE-4.0-*` | LG AI Research |

| **MiMo / MiMo V2** | `MiMo-7B-*` | Xiaomi reasoning model |

| **Ernie 4.5** | `ERNIE-4.5-*` | Baidu ERNIE series |

| **Baichuan M1** | `Baichuan-M1-*` | Baichuan multimodal base |

| **Bailing MoE** | `Ling-*` | Bailing/Ling MoE family |

| **NemotronH** | `Nemotron-H-*` | NVIDIA Nemotron hybrid |

| **Starcoder 2** | `starcoder2-*` | Code generation |

| **OpenELM** | `OpenELM-*` | Apple on-device efficient LM |

| **Apertus / AfMoE** | `Apertus-*` | Sparse MoE research models |

| **BitNet** | `bitnet-*` | 1-bit weight quantization |

| **MiniMax** | `MiniMax-Text-*` | Lightning attention architecture |

| **Olmo3** | `Olmo3-*` | AllenAI Olmo3 series |

### 👁️ Vision (VLMs)

*Run with `--vision` flag.*

| Family | Models | Notes |

|---|---|---|

| **Gemma 4** | `gemma-4-*` (VLM mode) | Native image tower via MLXVLM |

| **Gemma 3** | `gemma-3-*` (VLM mode) | PaLiGemma-style image projection |

| **Qwen3-VL / Qwen3.5-VL** | `Qwen3-VL-*`, `Qwen3.5-VL-*` | Dynamic resolution with native RoPE |

| **Qwen2-VL / Qwen2.5-VL** | `Qwen2-VL-2B/7B`, `Qwen2.5-VL-*` | Real-time positional bounding + Metal image scaling |

| **LFM2-VL** | `LFM2-VL-1.6B` | Liquid AI multimodal |

| **Pixtral** | `pixtral-12b` | Mistral vision model |

| **PaliGemma** | `paligemma-*` | Google vision-language |

| **Idefics 3** | `Idefics3-*` | HuggingFace multimodal |

| **Mistral 3** | `Mistral-Small-3.1-*` | Mistral vision variant |

| **FastVLM** | `FastVLM-*` | Apple on-device VLM |

| **SmolVLM 2** | `SmolVLM2-*` | HuggingFace compact VLM |

| **GLM OCR** | `glm-4v-*` | THUDM vision+OCR |

| **QwenVL** | `Qwen-VL-*` | Original Qwen VL |

### 🎧 Audio (ALMs)

*Run with `--audio` flag. Only `gemma-4-e4b` variants include an audio tower.*

| Family | Models | Notes |

|---|---|---|

| **Gemma 4 Omni** | `gemma-4-e4b-it-4bit`, `gemma-4-e4b-it-8bit` | Audio-in via vDSP STFT → Mel spectrogram (16kHz, 128 bins); text-out |

---

## 📱 SwiftBuddy — iOS App

A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.

### Features

- **Tab UI**: Chat · Models · Settings

- **Live download progress** with speed indicator and circular progress ring

- **Model catalog**: Qwen3, Phi-3.5, Mistral, Llama — with on-device RAM fit indicators

- **HuggingFace search** — find any `mlx-community` model by name

- **Context-aware empty states** — downloading ring, loading spinner, idle prompt

- **iOS lifecycle hardened** — model unload only fires on true background (not notification banners); 30-second grace period on app-switch

> 📱 **Running live on iPhone 13 Pro (6 GB)** — no Python, no server, no GIL. Pure on-device MLX inference via Metal GPU.

### Build & Run (iOS)

```bash

cd SwiftBuddy

python3 generate_xcodeproj.py       # Generates SwiftBuddy.xcodeproj

open SwiftBuddy.xcodeproj

```

Then in Xcode:

1. Select the **SwiftBuddy** target → **Signing & Capabilities**

2. Set your **Team** (your Apple Developer account)

3. Select your iPhone as the run destination

4. ⌘R to build and run

> **Note for contributors**: The `.xcodeproj` is git-ignored (it contains your personal Team ID). Run `generate_xcodeproj.py` after cloning to regenerate it locally. Your Team ID is never committed.

---

## ⚡️ TurboQuantization: KV Cache Compression

`SwiftLM` implements a **hybrid V2+V3 TurboQuant architecture** for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.

### By combining V2 Speed with V3 Quality:

Recent reproductions of the TurboQuant algorithm (e.g., `turboquant-mlx`) revealed two distinct paths:

1. **V2 (Hardware-Accelerated)**: Fast, but uses linear affine quantization which degrades quality at 3-bit.

2. **V3 (Paper-Correct)**: Excellent quality using non-linear Lloyd-Max codebooks, but painfully slow due to software dequantization.

**We built the "Holy Grail" hybrid:** We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal (`bggml-metal`) shaders. This achieves **V3 quality at V2 speeds**, completely detached from Python overhead.

### The Algorithm:

**K-Cache (3-bit PolarQuant + 1-bit QJL) = 4.25 bits/dim**

1. Extract L2 norm and normalize: `x̂ = x / ‖x‖`

2. Apply Fast Walsh-Hadamard Transform (WHT) rotation to distribute outliers evenly.

3. Quantize each coordinate using **3-bit non-linear Lloyd-Max centroids**.

4. Compute the residual error between the original vector and the quantized approximation.

5. Project the residual via a random Johnson-Lindenstrauss (QJL) matrix and store the 1-bit signs.

*(Why QJL? QJL acts as an additional regularizer that prevents centroid resolution loss from degrading the attention dot-product.)*

**V-Cache (3-bit PolarQuant) = 3.125 bits/dim**

Because the V-cache matrix is not used for inner-product attention scoring, the QJL error correction provides no benefit. We cleanly disable QJL for the V-cache, extracting an additional 25% memory savings without sacrificing quality.

Reference implementations: [`turboquant-mlx`](https://github.com/sharpner/turboquant-mlx) | [`turboquant_plus`](https://github.com/TheTom/turboquant_plus) | Paper: [TurboQuant, Google 2504.19874](https://arxiv.org/abs/2504.19874)

---

## 💾 SSD Expert Streaming: 10x MoE Speedup

SwiftLM implements a **rewritten SSD expert streaming pipeline** (engineered by [Eric Lake](https://github.com/ericjlake)) that achieves 10x generation speedup for massive Mixture of Experts (MoE) models running on memory-constrained Apple Silicon. This enables running models like **Qwen3.5-122B** (69.6 GB) and **Qwen3.5-397B** (209 GB) on a **64 GB Mac** by streaming expert weights from NVMe SSD.

### Benchmark Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)

| Configuration | tok/s | vs. Original | Notes |

|---|---|---|---|

| Original `--stream-experts` | 0.58 | baseline | Sequential pread, 1 NVMe queue |

| **This PR (top-k=8, full quality)** | **4.95** | **8.5×** | All 8 experts evaluated |

| **This PR (top-k=6, default)** | **5.20** | **9.0×** | Recommended default |

| **This PR (top-k=4, speed mode)** | **5.91** | **10.2×** | Best quality/speed tradeoff |

| **This PR (top-k=2, turbo mode)** | **6.52** | **11.2×** | Still coherent output |

> Memory stable at **~10.6 GB resident**, no swap activity. Tested over 200-token generation runs.

### The Approach: Small Model Helps Large Model

A novel aspect of this architecture is the **dual-model speculative decoding** pattern: a small draft model (e.g. Qwen3.5-9B at 73 tok/s) runs **entirely in RAM** while the large MoE model (e.g. 122B) streams experts from SSD. The draft model generates candidate tokens at high speed, and the main model verifies them in bulk — dramatically reducing the number of SSD-bound generation rounds needed.

> **Important finding:** Speculative decoding is **counterproductive for SSD-streaming MoE** specifically. The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. Speculative decoding is therefore routed exclusively to **in-RAM models**.

### Optimization Techniques

1. **Cross-Projection Batching**: Collapses ~1,400 per-expert `eval()` calls down to ~48 per token by orchestrating gate/up/down projections together in `SwitchGLU`.

2. **Concurrent NVMe pread (QD=24)**: Replaces sequential pread with `DispatchQueue.concurrentPerform`, saturating the NVMe controller's queue depth (8 experts × 3 projections = 24 parallel reads).

3. **AsyncEval Pipeline with Speculative Pread**: Overlaps GPU compute with SSD I/O — uses previous-token routing to speculatively pre-load experts for the next token during the GPU async window (~70% hit rate). Only missed experts (~30%) require on-demand pread after routing sync.

4. **Persistent Metal Buffers**: Expert weight buffers are allocated once per `SwitchGLU` layer and reused across tokens, eliminating per-token allocation overhead.

5. **Runtime Top-K Expert Selection**: The `SWIFTLM_TOP_K` environment variable reduces the number of active experts per token at runtime without model recompilation — trading marginal quality for significant speed gains.

### Key Engineering Findings

| Finding | Detail |

|---|---|

| **GPU compute is the bottleneck** | At steady state, GPU compute is ~190ms of ~200ms per-token time. The OS page cache serves ~90% of expert reads from RAM. |

| **Don't cache experts in application memory** | An LRU expert cache *stole* from the OS page cache and regressed performance (4.84 → 4.01 tok/s). Let the kernel manage it. |

| **MambaCache requires checkpoint rollback** | Unlike attention KV caches (trim = decrement offset), Mamba's recurrent state integrates all history and cannot be partially undone. We implemented `checkpoint()`/`restore()` for speculative decoding on hybrid Attention+Mamba architectures (Qwen3.5). |

### Usage

```bash

# Standard SSD streaming (recommended, top-k=6):

SWIFTLM_TOP_K=6 SwiftLM --port 8002 \

  --model /Qwen3.5-122B-A10B-4bit --stream-experts

# Speed mode (top-k=4):

SWIFTLM_TOP_K=4 SwiftLM --port 8002 \

  --model /Qwen3.5-122B-A10B-4bit --stream-experts

# With speculative decoding (in-RAM models only):

SwiftLM --port 8002 \

  --model /Qwen3.5-27B-4bit \

  --draft-model /Qwen3.5-9B-4bit \

  --num-draft-tokens 4

```

---

## 🔀 Why We Forked Apple MLX

To achieve the extreme memory efficiency and speeds seen in **SSD Expert Streaming** and **Speculative Decoding**, `SwiftLM` relies on custom C++ primitives that bypass standard unified memory limits.

> [!NOTE]

> We maintain custom forks (`SharpAI/mlx` and `SharpAI/mlx-c`) to support **out-of-core memory-mapped execution**, streaming tensor blocks directly from the SSD (NVMe) to the GPU via custom Metal kernels (`ssd_streamer.mm` and `fence.air`). Official `ml-explore` repositories do not yet support this out-of-the-box.

For a detailed breakdown on repository architecture, upstream synchronization, our specific custom patches, and the specific indications for when we can safely revert to Apple's native upstream, read the full documentation: 

👉 **[Upstream MLX Synchronization & SSD Streaming Maintenance](.agents/workflows/mlx-upstream-sync.md)**

---

## 💻 Benchmarks & Testing

Run our automated benchmark suites via the interactive script:

```bash

./run_benchmark.sh

```

The script provides an interactive menu to select any model and run one of two automated testing suites:

### Test 1: Automated Context & Memory Profile (TPS & RAM matrix)

Tests generation speed (TPS) and granular Apple Metal GPU memory allocation across extreme context lengths (e.g., `512, 40000, 100000` tokens).

- Iterates over 4 configurations: Vanilla, SSD Streaming, TurboQuant, and SSD + TurboQuant.

- Generates a rich ANSI console visualization with bar charts and a configuration scoreboard.

- Saves the complete results matrix to `docs/profiling/profiling_results_.md`.

### Test 2: Prompt Cache & Sliding Window Regression Test

Verifies the stability of the engine's KV prompt cache when interleaving long contexts with sliding window attention bounds.

- Automatically spins up an isolated background inference server instance.

- Generates a 5,000+ token mock JSON payload.

- Fires an extreme alternating sequence of 4 concurrent requests (`5537t` → `18t` → `5537t` → `Big Full Cache Hit`).

- Confirms the memory bounds remain stable without throwing $O(N^2)$ OS memory warnings, $OOM$ exceptions, or `SIGTRAP` errors.

### Throughput & Inference Memory Profile

Tested by rendering exactly 20 tokens under standard conversational evaluation (`--prefill-size 512`) to capture precise Token Generation (TPS) and Apple Metal memory footprint limits:

| Model | Time To First Token (s) | Generation Speed (tok/s) | Peak GPU Memory (GB) |

|---|---|---|---|

| `gemma-4-e2b-it-4bit` | 0.08s | 116.27 tok/s | 1.37 GB |

| `gemma-4-e4b-it-8bit` | 0.33s | 48.21 tok/s | 7.64 GB |

| `gemma-4-26b-a4b-it-4bit` | 0.14s | 85.49 tok/s | 13.46 GB |

| `gemma-4-31b-it-4bit` | 0.55s | 14.82 tok/s | 16.83 GB |

To run the automated suite on your machine for these models, execute:

```bash

python3 tests/run_4models_benchmark.py

```

> **🧠 How it works:** SwiftLM implements **Chunked Prefill** (controlled via `--prefill-size`, defaulting to 512). This is functionally equivalent to `llama.cpp`'s `--batch-size` parameter and mirrors the [`mlx-lm` Python library](https://github.com/ml-explore/mlx/tree/main/mlx_lm)'s reference implementation approach to preventing $O(N^2)$ Unified Memory over-allocation during massive sequence parsing.

> **⚠️ Quantization Disclaimer**: While heavier quantization shrinks the required memory footprint, **4-bit quantization** remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like `\name\` instead of `"name"`—which systematically breaks OpenAI-compatible tool calling.

---

## 📡 API Endpoints

| Endpoint | Method | Description |

|---|---|---|

| `/health` | GET | Server health + loaded model capabilities |

| `/v1/models` | GET | List available models |

| `/v1/chat/completions` | POST | Chat completions (LLM and VLM support, multi-turn, system prompts) |

## 💻 Usage Examples

### Chat Completion (Streaming)

Drop-in compatible with standard OpenAI HTTP consumers:

```bash

curl http://localhost:5413/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "gemma-4-26b-a4b-it-4bit",

    "stream": true,

    "messages": [

      {"role": "system", "content": "You are Aegis-AI, a local home security agent. Output strictly in JSON format."},

      {"role": "user", "content": "Clip 1: Delivery person drops package at 14:02. Clip 2: Delivery person walks away down driveway at 14:03. Do these clips represent the same security event? Output a JSON object with a `duplicate` boolean and a `reason` string."}

    ]

  }'

```

---

### Vision-Language Models (VLM)

To run a vision model (e.g., `mlx-community/Qwen2-VL-2B-Instruct-4bit`), launch SwiftLM with the `--vision` flag:

```bash

./.build/release/SwiftLM --model mlx-community/Qwen2-VL-2B-Instruct-4bit --vision

```

You can then pass standard OpenAI base64 encoded images directly. SwiftLM handles hardware spatial-mapping natively via Metal:

```bash

curl http://localhost:5413/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "qwen2-vl",

    "messages": [

      {

        "role": "user",

        "content": [

          {"type": "text", "text": "Describe the contents of this image."},

          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."}}

        ]

      }

    ]

  }'

```

---

## ⚙️ CLI Options

| Option | Default | Description |

|---|---|---|

| `--model` | (required) | HuggingFace model ID or local path |

| `--port` | `5413` | Port to listen on |

| `--host` | `127.0.0.1` | Host to bind |

| `--vision` | `false` | Enable VLM (vision-language model) mode for image inputs |

| `--audio` | `false` | Enable ALM (audio-language model) mode for audio inputs |

| `--max-tokens` | `2048` | Max tokens limit per generation |

| `--prefill-size`| `512`  | Prompt prefill chunk size (micro-batching for long contexts) |

| `--top-p` | `1.0` | Default top-p nucleus sampling (overridable per-request) |

| `--top-k` | `50` | Default top-k sampling (0 disables, overridable per-request) |

| `--min-p` | `0.0` | Default min-p sampling threshold relative to the highest probability token (0 disables) |

| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |

| `--stream-experts` | `false` | Enable SSD expert streaming for MoE models (10x speedup) |

| `--turbo-kv` | `false` | Enable TurboQuant 3-bit KV cache compression (activates after 2048 tokens, server-wide) |

| `--draft-model` | (none) | Draft model path/ID for speculative decoding (in-RAM models only) |

| `--num-draft-tokens` | `4` | Number of draft tokens per speculation round |

## 🔧 Per-Request API Parameters

In addition to the standard OpenAI fields (`temperature`, `top_p`, `max_tokens`, etc.), SwiftLM accepts the following **SwiftLM-specific** fields on `POST /v1/chat/completions`:

| Field | Type | Description |

|---|---|---|

| `kv_bits` | `int` (4 or 8) | Enable **MLX-native quantized KV cache** for this request. Uses `QuantizedKVCache` (standard group quantization) instead of `KVCacheSimple`. Separate from `--turbo-kv`. Reduces KV memory ~2–4× at mild quality cost. |

| `enable_thinking` | `bool` | Force-enable or disable chain-of-thought thinking blocks for Gemma-4 / Qwen3. |

| `kv_group_size` | `int` | Group size for `kv_bits` quantization (default: `64`). |

| `top_k` | `int` | Per-request top-k sampling override (0 = disabled). |

| `min_p` | `float` | Per-request min-p sampling threshold (0 = disabled). |

| `repetition_penalty` | `float` | Token repetition penalty (e.g. `1.15`). |

### `kv_bits` vs `--turbo-kv` — What's the difference?

| | `kv_bits` (per-request) | `--turbo-kv` (server flag) |

|---|---|---|

| **Scope** | Per-request, sent in JSON body | Server-wide, set at startup |

| **Algorithm** | MLX-native group quantization (4-bit / 8-bit) | Custom 3-bit PolarQuant + QJL Walsh-Hadamard |

| **Activation** | From token 0 | After 2048 tokens |

| **Memory savings** | ~2–4× vs FP16 | ~3.5× vs FP16 |

| **Use case** | Targeted memory reduction per conversation | Extreme long-context (100K+) compression |

### Example: Enable 4-bit KV cache per request

```bash

curl http://localhost:5413/v1/chat/completions \\

  -H "Content-Type: application/json" \\

  -d '{

    "model": "gemma-4-26b-a4b-it-4bit",

    "kv_bits": 4,

    "messages": [

      {"role": "user", "content": "Summarize the history of computing in 3 sentences."}

    ]

  }'

```

## 📦 Requirements

- macOS 14.0+

- Apple Silicon (M1/M2/M3/M4/M5)

- Xcode Command Line Tools

- Metal Toolchain (`xcodebuild -downloadComponent MetalToolchain`)

## 📖 The "Aha!" Moment

**The "2+2=4" Aha Moment**: During development, we encountered a severe "silent failure" where the model would successfully load and evaluate all 32 layers at high speed, but generate nothing but infinite whitespace. The model logits showed the correct *shape* but the wrong *magnitudes*. 

The breakthrough arrived when we realized the **embedding scale** was missing. The Gemma architecture requires scaling embedding outputs by `sqrt(hidden_size)`. For a hidden size of 2816, missing this meant every activation in the network was ~53x too small! By adding one single math operation:

`h = h * MLXArray(Float(config.hiddenSize).squareRoot())`

The model instantly woke up from "whispering" whitespace and successfully responded to `"What is 2+2?"` with a perfect `"2 + 2 equals 4."` — proving that the entire massive structural pipeline from Swift to Metal was working.

## 🙏 Acknowledgments & Credits

[![Awesome MLX](https://img.shields.io/badge/Awesome-MLX-blue?style=flat-square)](https://github.com/raullenchai/awesome-mlx)

`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects and contributors for their indispensable reference materials and underlying protocols:

### Contributors

- **[Eric Lake](https://github.com/ericjlake)** — Engineered the **SSD Expert Streaming 10x rewrite** ([PR #26](https://github.com/SharpAI/SwiftLM/pull/26)), achieving 10× generation speedup on 122B+ MoE models via cross-projection batching, concurrent NVMe pread (QD=24), asyncEval pipeline with speculative pread, and runtime top-k expert selection. Also implemented the **speculative decoding infrastructure** with `DraftModelRef`, dual-model loading, and **MambaCache checkpoint/restore** for hybrid Attention+Mamba architectures.

### Projects & References

- **[mlx-swift](https://github.com/ml-explore/mlx-swift)** — The core Apple MLX wrapper bringing Metal-accelerated operations into the Swift ecosystem.

- **[mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm)** — The official Python language models implementation, serving as the core inspiration for our chunked-prefill architecture and attention manipulation logic.

- **[flash-moe](https://github.com/danveloper/flash-moe)** — Inspired the memory-mapped out-of-core SSD Expert Streaming mechanics that we implemented natively in SwiftLM.

- **[Hummingbird](https://github.com/hummingbird-project/hummingbird)** — The incredible event-driven Swift HTTP engine powering the OpenAI-compatible REST API.

- **[TurboQuant Paper](https://arxiv.org/abs/2504.19874)** — *"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate"* (Zandieh et al., AISTATS 2026). Provided the initial algorithmic framework for the dual-stage PolarQuant + QJL engine.

- **[TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)** — Served as an invaluable reference architecture for the C and GPU quantization tables, guiding the development of our native `turbo-wht` Walsh-Hadamard kernels and custom Metal wrapper layers.

- **[TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)** — Essential Python validation logic used to certify the correctness of our manually constructed Lloyd-Max codebook generation math.

- **[amirzandieh/QJL](https://github.com/amirzandieh/QJL)** — The original 1-bit residual correction engine backing the paper, which informed our QJL error recovery in dot-product regimes.

---

**License**: MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sharpai/swiftlm

Awesome Lists containing this project

README