{"id":44841667,"url":"https://github.com/ddalcu/mlx-serve","last_synced_at":"2026-06-29T02:00:34.925Z","repository":{"id":338906993,"uuid":"1159656889","full_name":"ddalcu/mlx-serve","owner":"ddalcu","description":"Native LLM inference server for Apple Silicon. OpenAI + Anthropic API compatible. No Python. Includes MLX Core macOS app with chat, agent mode, and tool calling.","archived":false,"fork":false,"pushed_at":"2026-06-29T00:24:17.000Z","size":13864,"stargazers_count":192,"open_issues_count":1,"forks_count":9,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-06-29T01:20:03.551Z","etag":null,"topics":["agent","anthropic-api","apple-silicon","claude-code","deepseek-v4","diffusion","gguf","image-generation","inference","llm","local-llm","macos","macos-app","mlx","openai-api","tool-calling","video-generation","voice-agent","voice-cloning","zig"],"latest_commit_sha":null,"homepage":"http://mlxserve.com/","language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ddalcu.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"buy_me_a_coffee":"ddalcu"}},"created_at":"2026-02-17T01:55:20.000Z","updated_at":"2026-06-29T00:24:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"7a5229b7-be98-4443-9d17-a4e72fa19c38","html_url":"https://github.com/ddalcu/mlx-serve","commit_stats":null,"previous_names":["ddalcu/mlx-serve"],"tags_count":47,"template":false,"template_full_name":null,"purl":"pkg:github/ddalcu/mlx-serve","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddalcu%2Fmlx-serve","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddalcu%2Fmlx-serve/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddalcu%2Fmlx-serve/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddalcu%2Fmlx-serve/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ddalcu","download_url":"https://codeload.github.com/ddalcu/mlx-serve/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddalcu%2Fmlx-serve/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34910177,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","anthropic-api","apple-silicon","claude-code","deepseek-v4","diffusion","gguf","image-generation","inference","llm","local-llm","macos","macos-app","mlx","openai-api","tool-calling","video-generation","voice-agent","voice-cloning","zig"],"created_at":"2026-02-17T03:43:49.868Z","updated_at":"2026-06-29T02:00:34.914Z","avatar_url":"https://github.com/ddalcu.png","language":"Zig","funding_links":["https://buymeacoffee.com/ddalcu"],"categories":["Data \u0026 Science"],"sub_categories":["Large Language Model"],"readme":"# mlx-serve — run any LLM on your Mac\n\n**OpenAI- and Anthropic-compatible local inference for Apple Silicon — MLX *and* GGUF — faster than LM Studio on the same file. No Python. No cloud. No Electron.**\n\n[![Release](https://img.shields.io/github/v/release/ddalcu/mlx-serve?style=flat-square\u0026color=0071e3)](https://github.com/ddalcu/mlx-serve/releases/latest)\n[![Stars](https://img.shields.io/github/stars/ddalcu/mlx-serve?style=flat-square\u0026color=f7a41d)](https://github.com/ddalcu/mlx-serve/stargazers)\n[![Downloads](https://img.shields.io/github/downloads/ddalcu/mlx-serve/total?style=flat-square\u0026color=30d158)](https://github.com/ddalcu/mlx-serve/releases)\n[![Last commit](https://img.shields.io/github/last-commit/ddalcu/mlx-serve?style=flat-square)](https://github.com/ddalcu/mlx-serve/commits/main)\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue?style=flat-square)](LICENSE)\n[![macOS](https://img.shields.io/badge/macOS-Apple%20Silicon-black?style=flat-square\u0026logo=apple)](https://github.com/ddalcu/mlx-serve/releases/latest)\n[![Zig](https://img.shields.io/badge/built%20with-Zig-f7a41d?style=flat-square\u0026logo=zig)](https://ziglang.org)\n\n**[ddalcu.github.io/mlx-serve](https://ddalcu.github.io/mlx-serve/)** · [Download MLX Core.app](https://github.com/ddalcu/mlx-serve/releases/latest) · [Changelog](CHANGELOG.md)\n\n\u003e ★ **If mlx-serve saves you from spinning up another Electron app, [star the repo](https://github.com/ddalcu/mlx-serve/stargazers) — it genuinely helps people find this.**\n\nmlx-serve is a native Zig server that runs **any LLM on Apple Silicon** — MLX-format models *and* every GGUF on HuggingFace (Qwen, Llama, Mistral, Gemma, DeepSeek V4 Flash, thousands more). It exposes **OpenAI-compatible** *and* **Anthropic-compatible** HTTP APIs out of the box, so the same `http://localhost:11234` works with Claude Code, the OpenAI SDK, Continue, Cursor, Open WebUI, and anything else that speaks one of those wires. Ships with **MLX Core**, a macOS menu-bar app with chat, agent mode, MCP tool calling, and model management.\n\n![MLX Core](docs/demo-diffusion.gif)\n\n[\u003cimg src=\"docs/appiconb.png\" width=\"48\" align=\"center\"\u003e](https://github.com/ddalcu/mlx-serve/releases/latest) **[Download MLX Core.app](https://github.com/ddalcu/mlx-serve/releases/latest)** — latest release for macOS (Apple Silicon)\n\n### Install via Homebrew\n\n```bash\nbrew tap ddalcu/mlx-serve https://github.com/ddalcu/mlx-serve\nbrew install --cask mlx-core   # GUI menu bar app\nbrew install mlx-serve          # CLI server only\n```\n\n## Why mlx-serve\n\nIf you're already on LM Studio, Ollama, or `mlx-lm` and wondering whether to switch — here's the short version, head-to-head:\n\n| | mlx-serve | LM Studio | Ollama | mlx-lm |\n|---|:---:|:---:|:---:|:---:|\n| MLX models (native Apple) | ✅ | ✅ | ❌ | ✅ |\n| GGUF models (llama.cpp) | ✅ **embedded** | ✅ | ✅ | ❌ |\n| OpenAI-compatible API | ✅ | ✅ | partial | ❌ |\n| Anthropic Messages API | ✅ | ❌ | ❌ | ❌ |\n| OpenAI Responses API + WebSockets | ✅ | ❌ | ❌ | ❌ |\n| DeepSeek V4 Flash (284B) | ✅ via ds4 | ❌ | ❌ | ❌ |\n| Speculative decoding (PLD + drafter) | ✅ | ❌ | partial | drafter only |\n| Decode speed (geomean vs LM Studio, identical weights) | **+35%** (MLX) | baseline | ~−15% (GGUF, est.¹) | +11% (MLX) |\n| KV-cache quantization (4/8-bit + TurboQuant) | ✅ | ❌ | partial | ✅ |\n| Continuous batching | ✅ | ❌ | ✅ | ❌ |\n| Built-in agent loop + MCP client | ✅ 10 tools | ❌ | ❌ | ❌ |\n| One-click launchers (Claude Code, OpenCode, Pi) | ✅ | ❌ | ❌ | ❌ |\n| Python required at runtime | ❌ | ❌ | ❌ | ✅ |\n| Native menu-bar app (no Electron) | ✅ | ❌ Electron | ❌ | ❌ |\n| **Image Generation** | ✅ | ❌ | ❌ | ❌ |\n| **Video Generation** | ✅ | ❌ | ❌ | ❌ |\n| **Audio Generation** | ✅ | ❌ | ❌ | ❌ |\n| License | MIT | proprietary | MIT | MIT |\n\n¹ Ollama can't run MLX, so the comparison is GGUF-vs-GGUF. \n\n### Benchmarks (Apple M4, 16 GB · identical weights · ctx=4096 · temp=0)\n\n**Same `.gguf` file, both engines:** mlx-serve's embedded llama.cpp beats LM Studio's wrapper on `gemma-4-E4B-it-Q4_K_M.gguf`:\n\n| Workload | LM Studio (GGUF) | mlx-serve (GGUF) | Δ |\n|---|---:|---:|---:|\n| Free-form decode | 24.6 tok/s | **28.2 tok/s** | **+15%** |\n| Echo | 22.3 | **25.1** | **+13%** |\n| Code completion | 23.0 | **25.7** | **+12%** |\n| Prefill | 349 | **367** | **+5%** |\n\n**Same 4-bit MLX weights**, plus mlx-serve's optional speculative-decode wins:\n\n| Model | Workload | LM Studio | mlx-serve | mlx-serve + PLD | mlx-serve + Drafter |\n|---|---|---:|---:|---:|---:|\n| Gemma 4 E2B | Echo | 125 tok/s | 164 (**+31%**) | **269 (+115%)** | 192 (+54%) |\n| Gemma 4 E4B | Code | 89.2 | 101 (+13%) | 100 | **131 (+47%)** |\n| Gemma 4 26B-A4B MoE | Echo | 72.6 | 91.1 (+25%) | **125 (+72%)** | — |\n| Qwen 3.6 35B-A3B MoE | Echo | 83.0 | 101 (+22%) | **140 (+69%)** | — |\n\nAcross 18 cells (best mlx-serve vs best LM Studio, geomean): **+35%**. Reproduce with [`tests/bench.sh --family gemma --lmstudio --omlx`](tests/bench.sh).\n\n![mlx-serve vs LM Studio — Gemma 4 (M4 Max)](docs/perf-vs-lmstudio-gemma-26.5.6.png)\n![mlx-serve GGUF vs LM Studio GGUF — same file, Apple M4](docs/perf-vs-lmstudio-omlx-gemma-20260526-121327.png)\n\n## Features\n\n- **Run any LLM** — every supported MLX architecture *and* the entire GGUF universe via embedded llama.cpp. DeepSeek V4 Flash runs through the dedicated [antirez/ds4](https://github.com/antirez/ds4) engine.\n- **OpenAI-compatible API** — `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`, streaming SSE, tools, JSON-schema constrained decoding, logprobs.\n- **OpenAI Responses API** — `/v1/responses` with `previous_response_id` chains, per-event `sequence_number`, the `/v1/responses/compact` opaque history blob, and a WebSocket transport on the same endpoint.\n- **Anthropic Messages API** — `/v1/messages` works with Claude Code (`ANTHROPIC_BASE_URL=http://localhost:11234`) and the Anthropic SDK.\n- **Speculative decoding** — PLD (model-agnostic n-gram lookup, on by default) + the Gemma 4 cross-attention drafter. Adaptive prompt-time and runtime gates keep novel-content workloads at parity; agentic code loops see up to 1.6×.\n- **KV-cache quantization** — 4-bit / 8-bit / TurboQuant variants shrink KV memory ~4× / ~2× / further still, so 16K contexts fit on hardware that couldn't hold them dense.\n- **Continuous batching** — `--max-concurrent N` batches decode requests through one forward pass for ~1.6× throughput at 4-way parallel.\n- **Prefix cache** — shared system-prompt KV reuse across turns and across conversations. v26.5.7 adds an LRU of llama.cpp KV sessions so multi-doc agent loops stay warm.\n- **Tokenize cache** — chat-template render + tokenize cached per request; the second hit on a long conversation is a memcpy. Warm TTFT 7.7× faster on 1.8K-token prompts.\n- **Vision** — Gemma 4 SigLIP encoder; send images via `image_url` content blocks.\n- **Reasoning / thinking** — full streaming of thinking tokens as `reasoning_content`.\n- **No Python** — single Zig binary, no `pip`, no venv. The MLX Core app ships everything signed and notarized.\n\n## MLX Core (macOS App)\n\nMenu-bar app that wraps the server with a full UI:\n\n- **Model browser** — download from HuggingFace with resumable transfers, auto-discovers LM Studio's existing model folder (`~/.lmstudio/settings.json`) so you don't re-download what's on disk, GGUF rows show a min–max RAM-estimate range.\n- **Chat interface** — multi-session chat with markdown rendering. Drop in PDFs (PDFKit-extracted) or images alongside text.\n- **Agent mode** — 10 built-in tools (shell, cwd, readFile, writeFile, editFile, searchFiles, listFiles, browse, webSearch, saveMemory) with automatic tool calling loop and a per-tool approval dialog (**Allow** / **Deny** / **Always allow this session**).\n- **MCP client** — curated marketplace of stdio + HTTP MCP servers (GitHub, Azure DevOps, DBHub, Docker, Kubernetes, Playwright, Slack, Notion, Filesystem, Shell) plus your own from `~/.mlx-serve/mcp.json`.\n- **Editable system prompt + persistent memory** — `~/.mlx-serve/system-prompt.md` and `~/.mlx-serve/memory.md`.\n- **Prompt-based skills** — drop `.md` files into `~/.mlx-serve/skills/` with YAML frontmatter to teach the agent custom capabilities triggered by keywords.\n- **Engine-aware Settings window** (Cmd+,) — every server-launch flag and per-request default, with sections that show only the knobs relevant to the engine you've loaded (MLX vs GGUF vs ds4).\n- **Server management** — start/stop, live log buffer, restart-on-flag-change banner.\n- **Image / Video Generation** — Krea-2, FLUX.2 and LTX-Video 2.3 native via mlx-serve zig server.\n\n### Image / Video Generation \n\nThe tray has **ImageGen**, **VideoGen** and **AudioGen** buttons that run [FLUX.2](https://huggingface.co/black-forest-labs), [LTX-Video 2.3](https://github.com/dgrauet/ltx-2-mlx) and [Qwen3-TTS](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) through our zig server. All three run natively on MLX. \n\nLaunch MLX Core, click the ImageGen, VideoGen or AudioGen tray icon, and hit **Download**. Each panel remembers your last-used model, quality, resolution, steps and seed between sessions, so you don't re-pick them every time.\n\nYou can also **generate images straight from chat**: in Agent mode, ask for an image and it renders inline in the conversation using your saved Image settings — double-click any chat image to open it full-size in Preview. (Audio and video generation live in their tray windows for now.)\n\n**Models:**\n\n| Feature | Default | Other options | Approx. RAM |\n|---|---|---|---|\n| Image | FLUX.2-klein 4B 4-bit (mflux, ~5 GB pre-quantized) | Krea-2-Turbo-MLX-Serve-mixed-4-8 | 8 / 12 / 16 GB |\n| Video | LTX-Video 2.3 Q4 | — | 24 GB RAM, ~50 GB first-run download (LTX 41 GB + Gemma 8 GB) |\n| Audio | Qwen3-TTS 1.7b | — | 8 GB RAM, ~3.5 GB first-run download |\n\n\u003e The 41 GB LTX snapshot ships **both** transformer variants (1-stage distilled + 2-stage dev, ~11 GB each) plus a 7.6 GB distillation LoRA, so you can switch between Fast/Good/Quality/Super offline without re-downloading.\n\nOutputs go to `~/.mlx-serve/generations/images/YYYY-MM-DD/` and `.../videos/YYYY-MM-DD/`.\n\n\u003e The app won't let you start a generation if there isn't enough free RAM. If the mlx-serve server is running and competing for memory, you'll be prompted to stop it first.\n\n## Supported Models\n\n| Architecture | `model_type` | Examples | Chat Format | Vision |\n|---|---|---|---|---|\n| **Gemma 4** | `gemma4` | `gemma-4-e2b-it-4bit`, `gemma-4-e4b-it-8bit`, `gemma-4-26b-a4b-it-4bit` | Gemma turns | SigLIP |\n| **Gemma 3** | `gemma3` | `gemma-3-12b-it-qat-4bit` | Gemma turns | -- |\n| **Qwen 3 / 3.5 / 3.6** | `qwen3`, `qwen3_5`, `qwen3_5_moe`, `qwen3_next` | `Qwen3-4B`, `Qwen3.5-4B`, `Qwen3.6-35B-A3B` | ChatML | -- |\n| **Nemotron-H** | `nemotron_h` | Nemotron-3-Nano-4B | ChatML | -- |\n| **LFM2** | `lfm2` | LFM2.5-350M | ChatML | -- |\n| **Llama** | `llama` | Llama 3, Llama 3.1, Llama 3.2 | Llama-3 | -- |\n| **Mistral** | `mistral` | Mistral 7B | ChatML | -- |\n| **DeepSeek V4 Flash** | `deepseek_v4` (GGUF) | DeepSeek-V4-Flash | DSV4 | -- |\n| **Anything else as GGUF** | via embedded llama.cpp | any `.gguf` on HuggingFace | per-template | -- |\n\nAny quantized MLX model using one of the above architectures works natively. Anything else can be served as GGUF through the embedded llama.cpp engine — just pick the `.gguf` file in the Model Browser and the server auto-routes by format. Models with unsupported architectures are flagged in the Model Browser but can still be downloaded.\n\n## Prerequisites\n\n- macOS 26+ with Apple Silicon (M1/M2/M3/M4) — the released app bundles MLX dylibs built for macOS 26; older macOS needs a from-source build against a local mlx\n- [Zig 0.16+](https://ziglang.org/download/) *(only if building from source)*\n- mlx-c and libwebp *(only if building from source)*:\n\n```bash\nbrew install mlx-c webp\n```\n\n## Quick Start\n\n### Download a model\n\nThe MLX Core app can download models directly, or use the CLI:\n\n```bash\npip install huggingface-hub\nhuggingface-cli download mlx-community/gemma-4-e4b-it-4bit --local-dir ~/.mlx-serve/models/gemma-4-e4b-it-4bit\n```\n\n### Build and run\n\n```bash\n./scripts/fetch-llama.sh (only once)\nzig build -Doptimize=ReleaseFast\n./zig-out/bin/mlx-serve --model ~/.mlx-serve/models/gemma-4-e4b-it-4bit --serve --port 8080\n```\n\n### Build the app\n\n```bash\n./scripts/fetch-llama.sh (only once)\ncd app \u0026\u0026 SKIP_NOTARIZE=1 bash build.sh\nopen \"MLX Core.app\"\n```\n\nRequires `APPLE_DEVELOPER_ID` and `APPLE_TEAM_ID` environment variables for code signing.\n\n## Usage\n\n### Interactive mode\n\n```bash\n./zig-out/bin/mlx-serve --model /path/to/model --prompt \"What is 2+2?\"\n```\n\n### HTTP server\n\n```bash\n./zig-out/bin/mlx-serve --model /path/to/model --serve --port 8080\n```\n\n### Run any GGUF\n\n```bash\n./zig-out/bin/mlx-serve --model ~/models/Qwen3.5-4B-Q4_K_M.gguf --serve --port 8080\n# Same flags work — server auto-detects GGUF and routes to embedded llama.cpp\n```\n\n### CLI options\n\n| Flag | Default | Description |\n|---|---|---|\n| `--model PATH` | required | Path to the model directory or a `.gguf` file |\n| `--serve` | off | Start the HTTP server |\n| `--host ADDR` | `127.0.0.1` | Host address to bind |\n| `--port N` | `11234` | Port for the HTTP server |\n| `--prompt TEXT` | `\"Hello\"` | Prompt for interactive mode |\n| `--max-tokens N` | `100` | Maximum tokens to generate |\n| `--temp F` | `0.0` | Sampling temperature (0 = greedy) |\n| `--ctx-size N` | auto | Context window size (auto = computed from GPU memory) |\n| `--timeout N` | `300` | Request timeout in seconds |\n| `--reasoning-budget N` | `-1` | Thinking token budget (`-1` = unlimited, `0` = no thinking) |\n| `--no-vision` | off | Disable vision encoder even if model supports it |\n| `--pld` / `--no-pld` | on | Prompt Lookup Decoding (model-agnostic spec-decode) |\n| `--pld-draft-len N` | `5` | Max draft tokens per PLD step |\n| `--pld-key-len N` | `3` | N-gram match key length for PLD |\n| `--drafter DIR` | none | Gemma 4 assistant drafter checkpoint (e.g. `gemma-4-E4B-it-assistant-bf16`) |\n| `--draft-block-size N` | `4` | Drafts per round for the Gemma 4 drafter |\n| `--kv-quant {off,4,8,turbo2,turbo4}` | off | KV-cache quantization scheme (MLX path) |\n| `--llama-kv-quant {off,q8,q4}` | off | KV-cache quantization for GGUF (llama.cpp path) |\n| `--llama-cache-entries N` | `1` | Multi-session LRU for llama.cpp (warm multi-doc agents) |\n| `--tokenize-cache-entries N` | `4` | Chat-template + tokenize cache size |\n| `--max-concurrent N` | `1` | Continuous-batch decode parallelism |\n| `--prefix-cache-entries N` | auto | Shared-prefix KV cache entry cap |\n| `--prefix-cache-mem N{KB,MB,GB}` | `2 GB` | Shared-prefix KV cache memory cap |\n| `--model-dir PATH` | none | Discover and serve every model in a folder (LRU resident set) |\n| `--log-level` | `info` | Log level (error, warn, info, debug) |\n\n## API\n\n### POST /v1/chat/completions\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"messages\": [{\"role\": \"user\", \"content\": \"Write a haiku about programming.\"}],\n    \"max_tokens\": 256,\n    \"stream\": true\n  }'\n```\n\nSupports `messages`, `max_tokens`, `temperature`, `top_p`, `top_k`, `stream`, `tools`, `repetition_penalty`, `presence_penalty`, `logprobs`, plus a per-request `kv_quant` override. Messages can include `image_url` content blocks (base64 or URL) for vision-capable models.\n\n### POST /v1/messages (Anthropic)\n\n```bash\ncurl http://localhost:8080/v1/messages \\\n  -H \"Content-Type: application/json\" \\\n  -H \"anthropic-version: 2023-06-01\" \\\n  -d '{\n    \"model\": \"mlx-serve\",\n    \"max_tokens\": 256,\n    \"messages\": [{\"role\": \"user\", \"content\": \"Write a haiku about programming.\"}]\n  }'\n```\n\nCompatible with Claude Code (`ANTHROPIC_BASE_URL=http://localhost:8080 claude`) and Anthropic SDKs. Supports streaming, tool calling, and extended thinking.\n\n### POST /v1/responses (OpenAI Responses API)\n\n```bash\ncurl http://localhost:8080/v1/responses \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"mlx-serve\",\n    \"input\": \"Write a haiku about programming.\",\n    \"stream\": true\n  }'\n```\n\nStateful chains via `previous_response_id`, full streaming SSE with per-event `sequence_number`, schema-conformant envelope with `tools` / `tool_choice` / `text` / `reasoning` / `usage` echo. `POST /v1/responses/compact` returns an opaque base64 history blob that round-trips back as a `compaction` input item without any LLM call. Same endpoint also accepts an `Upgrade: websocket` handshake — each text frame is a `response.create` JSON message, and each SSE event becomes one outbound text frame.\n\n### Other endpoints\n\n- `GET /health` — health check\n- `GET /v1/models` — list loaded models with capabilities + engine info\n- `POST /v1/completions` — text completions\n- `POST /v1/embeddings` — text embeddings (BERT and encoder-only models)\n- `GET /v1/responses/{id}`, `DELETE /v1/responses/{id}` — fetch / delete stored responses\n\n## Performance\n\nBenchmarked on Apple M4 (16 GB unified memory):\n\n| Model | Prefill | Decode | Memory |\n|---|---|---|---|\n| Gemma-4 E4B (4-bit) | ~390 tok/s | ~33 tok/s | 4.3 GB |\n| Qwen3.5-4B (4-bit) | ~380 tok/s | ~38 tok/s | 2.3 GB |\n| LFM2.5-350M (8-bit) | ~3800 tok/s | ~210 tok/s | 0.4 GB |\n| Nemotron-3-Nano-4B (8-bit) | -- | ~22 tok/s | 4.3 GB |\n\nMatches mlx-lm (Python) generation speed while using less memory and starting 3× faster. Key optimizations: fully-lazy async pipeline with reordered eval (submit-first pattern), JIT-compiled activations (GELU, GeGLU, softcap via `mlx_compile`), GPU memory wiring, chat-template + tokenize caching, and a per-engine prefix cache.\n\n\u003cdetails\u003e\n\u003csummary\u003eBenchmark reproduction\u003c/summary\u003e\n\n```bash\n# Prefill (~840 token prompt):\n./zig-out/bin/mlx-serve --model ~/.mlx-serve/models/gemma-4-e4b-it-4bit \\\n  --prompt \"$(python3 -c \"print('Explain the following topics in extreme detail: ' + ', '.join([f'topic {i} about science and technology and its impact on human civilization throughout history' for i in range(1,50)]))\")\" \\\n  --max-tokens 1\n\n# Decode (256 tokens, temp=0):\n./zig-out/bin/mlx-serve --model ~/.mlx-serve/models/gemma-4-e4b-it-4bit \\\n  --prompt \"Write a detailed essay about quantum computing\" \\\n  --max-tokens 256\n```\n\nRun 3 times and take the average of runs 2-3 (run 1 includes model loading from disk).\n\u003c/details\u003e\n\n## Speculative Decoding\n\nTwo flavors, both greedy-equivalent (byte-identical at temp=0 within the first 30 tokens; mathematically exact at temp \u003e 0 via the Leviathan probability-ratio sampler):\n\n- **PLD** (Prompt Lookup Decoding) — model-agnostic n-gram match in `prompt + generated_tokens`. Default-on (`--pld`); zero per-model setup. Wins on agentic loops, RAG, code editing, anywhere the answer echoes prompt content.\n- **Gemma 4 assistant drafter** — Google's small 4-layer cross-attention drafters (`gemma-4-{E2B,E4B,26B-A4B,31B}-it-assistant-bf16`). Opt-in via `--drafter \u003cdir\u003e`. The drafter cross-attends into the target's KV cache — no separate weights duplicated.\n\nBoth share an **adaptive prompt-time gate**: a 3-gram repetition score on the prompt (`spec_gate_threshold = 0.01`) auto-disables speculation on novel content, so creative writing and one-shot Q\u0026A run at parity with `--no-pld` instead of paying per-step verify overhead. A **runtime acceptance gate** further disables speculation mid-decode if per-draft acceptance falls below break-even (0.50 after 5 attempts). Sticky for the rest of the request. Both modes apply uniformly across all four API surfaces (chat completions, Anthropic messages, OpenAI responses, legacy completions), streaming and non-streaming, including requests with tools — agentic tool loops are speculative decoding's best workload (~2× on file-edit tool calls).\n\n### Speedup on the realistic agentic code-edit workload\n\nApple M-series, MLX 4-bit weights, temp=0, function in prompt + small modification requested (the canonical mlx-serve workload). `nospec` = same binary with `--no-pld`:\n\n| Model | nospec | PLD | Drafter |\n|---|---:|---:|---:|\n| Gemma 4 E4B (4-bit) | 28.0 tok/s | **45.0 tok/s · 1.61×** | **44.6 tok/s · 1.59×** |\n| Qwen 3.5 4B (4-bit) | 28.1 tok/s | **40.5 tok/s · 1.44×** | — |\n| LFM2.5 350M (8-bit) | 162 tok/s | 160 tok/s · 0.99× | — |\n\nOn creative / novel-content prompts both features stay at parity (≈1.0×) thanks to the gate — **no regression**. The 350M LFM2.5 is roughly neutral on spec-decode — its forward is small enough that the verify pass costs about the same as AR.\n\nReproduce with **`./tests/bench.sh --family gemma`** (mlx-serve only — emits per-spec `none`/`pld`/`drafter` rows across the prefill/decode/echo/code prompts).\n\n### vs. LM Studio (HTTP-vs-HTTP)\n\n**+35% faster overall** (geomean across 18 cells, best mlx-serve vs best LMS, identical 4-bit weights, ctx=4096, temp=0).\n\n| Model | Echo | Code | Free-form |\n|---|---:|---:|---:|\n| Gemma 4 E2B | **+122%** | **+47%** | +20% |\n| Gemma 4 E4B | **+97%** | **+53%** | **+35%** |\n| Gemma 4 31B | +20% | +4% | -1% |\n| Gemma 4 26B-A4B-MoE | **+66%** | +23% | +31% |\n| Qwen 3.6 27B | **+60%** | +24% | +32% |\n| Qwen 3.6 35B-A3B-MoE | **+88%** | +20% | +25% |\n\n![Gemma 4](docs/perf-vs-lmstudio-gemma-26.5.6.png)\n![Qwen 3.6](docs/perf-vs-lmstudio-qwen36-26.5.6.png)\n\nReproduce: `./tests/bench.sh --family gemma --lmstudio --omlx` (or `qwen36`). Requires `lms`, `jq`, `python3`, `matplotlib`; `--omlx` requires `omlx` on PATH.\n\n## FAQ\n\n### Is mlx-serve faster than LM Studio?\nYes — every cell, every model we've benchmarked. On identical 4-bit MLX weights mlx-serve wins by **+35% geomean across 18 workloads** (Gemma 4 E2B/E4B/31B/26B-A4B-MoE and Qwen 3.6 27B/35B-A3B-MoE). On the **same `.gguf` file** as LM Studio (`gemma-4-E4B-it-Q4_K_M.gguf`), mlx-serve's embedded llama.cpp wrapper still wins **+12-15% on decode** and **+5% on prefill**. Speculative decoding pushes the lead further on echo-heavy and code-completion workloads — up to 2.65× on Gemma 4 E4B echo.\n\n### Does mlx-serve replace LM Studio?\nFor most use cases, yes. mlx-serve runs the same MLX and GGUF models, exposes an OpenAI-compatible API on the same kind of port, and ships a native menu-bar app instead of an Electron one. It also adds things LM Studio doesn't have: a real Anthropic Messages API (works with Claude Code), the OpenAI Responses API + WebSockets, MCP tool calling, agent mode with 10 built-in tools, KV-cache quantization, continuous batching, and the [antirez/ds4](https://github.com/antirez/ds4) engine for DeepSeek V4 Flash.\n\n### Does mlx-serve replace Ollama?\nOn Apple Silicon, yes. Ollama is cross-platform and uses llama.cpp; mlx-serve runs llama.cpp **and** native MLX with the Mac-specific optimizations Ollama doesn't ship (Metal kernels through mlx-c, JIT-compiled activations, shared-prefix KV cache, the Gemma 4 cross-attention drafter). If you're on a Mac and only need the model APIs, you can drop in `http://localhost:11234` wherever you had `http://localhost:11434` — both wires are OpenAI-compatible.\n\n### Can I run GGUF models on my Mac without Python?\nYes. mlx-serve embeds llama.cpp's inference library (`libllama`) inside the same signed, notarized binary. Point `--model` at any `.gguf` and the server auto-detects the format and routes to the right engine — no `pip`, no venv, no `llama-server` to install separately. DeepSeek V4 Flash GGUFs go through the dedicated [antirez/ds4](https://github.com/antirez/ds4) engine instead, also embedded.\n\n### Does mlx-serve work with Claude Code?\nYes — natively. mlx-serve implements Anthropic's `/v1/messages` endpoint including streaming, tool calling, and extended thinking. Point Claude Code at it with `ANTHROPIC_BASE_URL=http://localhost:11234`. The MLX Core app ships a one-click \"Launch Claude Code\" button that wires up the env vars for you.\n\n### What about the OpenAI SDK, Continue, Cursor, Open WebUI?\nAll work — anything that talks the OpenAI chat-completions or Anthropic Messages wire protocol does. mlx-serve also implements the newer OpenAI Responses API (`/v1/responses`) for clients that want stateful chains via `previous_response_id`, plus a WebSocket transport on the same endpoint.\n\n### Can mlx-serve run DeepSeek V4 Flash locally?\nYes, on 96 GB+ Apple Silicon Macs. Open the MLX Core Model Browser, pick DeepSeek-V4-Flash, hit Download — the server routes the GGUF through the embedded ds4 engine (native Metal kernels, byte-validated against the reference forward). Agent mode and MCP tools work on DSV4 too.\n\n### What models are supported?\nNative MLX dispatch for Gemma 3/4, Qwen 3 / 3.5 / 3.6 / 3-Next, Llama 3.x, Mistral, Nemotron-H, LFM2.5, and DeepSeek V4 Flash. Anything else as GGUF via embedded llama.cpp — Qwen, Llama, Mistral, Gemma, DeepSeek, Phi, Yi, and thousands more available on HuggingFace.\n\n### Does it support tools / function calling?\nYes, on both API surfaces. The server detects tool-call patterns across architectures (Hermes XML, Gemma 4 `\u003c|tool_call\u003e`, raw JSON, ChatML), repairs common Qwen 3.5/3.6 escape quirks, and emits OpenAI-style `tool_calls` deltas in the SSE stream. The MLX Core app ships 10 built-in tools (shell, file I/O, search, browse, web search, memory) and connects to MCP servers from a curated marketplace.\n\n### How does it stay this small / fast?\nZig with direct `mlx-c` FFI — no Python runtime, no Electron, no IPC bridge. The release binary is ~4.5 MB. Eager warmup at boot page-faults weights and pre-compiles decode kernels (first request 3.5× faster). Multi-turn agent loops reuse KV across turns and skip re-prefilling system prompts via a shared-prefix cache. Tokenize caching turns the second hit on a long conversation into a memcpy.\n\n### Is the inference exact, or quantized output drift?\nFor greedy decoding (temp=0), mlx-serve is byte-identical to the reference for the first ~30-80 generated tokens, with the long-tail divergence inherent to INT4 float-reduction order (documented in `CLAUDE.md`). For temp \u003e 0, the Leviathan probability-ratio sampler keeps speculative decoding mathematically exact in distribution. Equivalence is pinned by `tests/test_pld_equivalence.sh`, `test_drafter_equivalence.sh`, and `test_kv_quant_equivalence.sh`.\n\n### Where does my data go?\nNowhere. Everything runs locally on your Mac — no analytics, no telemetry, no cloud calls. The HTTP server binds to `127.0.0.1` by default. Open source under MIT.\n\n### How do I update?\nThe MLX Core app self-updates by checking the GitHub releases feed. CLI: `brew upgrade --cask mlx-core` or `brew upgrade mlx-serve`.\n\n## Acknowledgements\n\nmlx-serve stands on a lot of open-source shoulders. Huge thanks to all of these projects.\n\n### Inference + math\n\n- [**MLX**](https://github.com/ml-explore/mlx) (Apple) — the C++/Metal tensor framework that does the actual GPU work. We link against it via [`mlx-c`](https://github.com/ml-explore/mlx-c), Apple's stable C API, so a Zig binary can drive it without a Python runtime.\n- [**mlx-lm**](https://github.com/ml-explore/mlx-lm) (Apple) — the reference Python implementation we cross-check against on every release. Many architecture quirks were nailed down by reading mlx-lm side-by-side.\n- [**llama.cpp**](https://github.com/ggerganov/llama.cpp) — embedded as `libllama` for the GGUF inference path. Also vendored under `lib/jinja_cpp/` for the C++17 Jinja2 chat-template engine plus the bundled [**nlohmann/json**](https://github.com/nlohmann/json) header.\n- [**antirez/ds4**](https://github.com/antirez/ds4) — the embedded engine that serves DeepSeek-V4-Flash via GGUF. Vendored under `lib/ds4/` pinned at commit `477c0e8`; native Metal kernels, official-logits-validated. Salvatore did the hard part.\n\n### Model architectures + tokenizers\n\n- [**Google Gemma**](https://ai.google.dev/gemma), [**Qwen team**](https://huggingface.co/Qwen), [**Meta Llama**](https://www.llama.com/), [**Mistral AI**](https://mistral.ai/), [**NVIDIA Nemotron-H**](https://huggingface.co/nvidia), [**Liquid LFM2.5**](https://www.liquid.ai/), [**DeepSeek**](https://www.deepseek.com/) — the model families this server runs. The Zig forward paths were written against each project's official reference implementations.\n- The [**HuggingFace `tokenizers`**](https://github.com/huggingface/tokenizers) library — the byte-level BPE reference our Zig tokenizer matches against.\n\n### Image + video\n\n- [**stb_image**](https://github.com/nothings/stb) — single-header JPEG/PNG decode for vision input.\n- [**libwebp**](https://chromium.googlesource.com/webm/libwebp) — WebP decode.\n- [**Black Forest Labs FLUX.2**](https://huggingface.co/black-forest-labs) and [**LTX-Video 2.3 (dgrauet/ltx-2-mlx)**](https://github.com/dgrauet/ltx-2-mlx) — the optional MLX-native image / video generators MLX Core can drive.\n\n### MLX Core (Swift app) integrations\n\n- [**Anthropic swift-sdk**](https://github.com/anthropics/swift-sdk) — the Claude API client the agent loop uses.\n- [**Model Context Protocol (Swift SDK)**](https://github.com/modelcontextprotocol/swift-sdk) — powers the MCP marketplace + tool routing.\n- Apple frameworks (PDFKit, WKWebView, AVFoundation, AppKit, SwiftUI) — the menu-bar app, browser tool, video player, and PDF attachment pipeline all ride on these.\n\n### Build + ship\n\n- [**Zig**](https://ziglang.org) — the systems language the server is written in. The 0.16 migration was painless thanks to the team's documentation.\n- [**Homebrew**](https://brew.sh/) — distribution channel for both the server (`brew install mlx-serve`) and the GUI (`brew install --cask mlx-core`).\n\nIf we missed you, please open a PR — happy to add anyone who landed code, fixtures, or a fix here.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n\n---\n\n★ **Found this useful? [Star the repo](https://github.com/ddalcu/mlx-serve/stargazers) — it really does help others discover it.**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fddalcu%2Fmlx-serve","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fddalcu%2Fmlx-serve","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fddalcu%2Fmlx-serve/lists"}