{"id":50376914,"url":"https://github.com/zouyee/dmlx","last_synced_at":"2026-05-30T10:01:37.130Z","repository":{"id":356009199,"uuid":"1223255499","full_name":"zouyee/dmlx","owner":"zouyee","description":"Big models. Small Macs. Zero excuses.","archived":false,"fork":false,"pushed_at":"2026-05-25T03:59:36.000Z","size":8211,"stargazers_count":54,"open_issues_count":3,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-25T04:29:42.902Z","etag":null,"topics":["apple","apple-silicon","inference-engine","llm","llms","mlx"],"latest_commit_sha":null,"homepage":"","language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zouyee.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T06:38:23.000Z","updated_at":"2026-05-25T03:59:40.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/zouyee/dmlx","commit_stats":null,"previous_names":["zouyee/dmlx"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/zouyee/dmlx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouyee%2Fdmlx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouyee%2Fdmlx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouyee%2Fdmlx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouyee%2Fdmlx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zouyee","download_url":"https://codeload.github.com/zouyee/dmlx/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zouyee%2Fdmlx/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33687722,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple","apple-silicon","inference-engine","llm","llms","mlx"],"created_at":"2026-05-30T10:01:36.299Z","updated_at":"2026-05-30T10:01:37.124Z","avatar_url":"https://github.com/zouyee.png","language":"Zig","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dmlx — Run Frontier LLMs on Your Mac\n\n[![CI](https://github.com/zouyee/dmlx/actions/workflows/ci.yml/badge.svg)](https://github.com/zouyee/dmlx/actions/workflows/ci.yml)\n\n\u003e **284B-parameter DeepSeek V4 Flash on a 48GB MacBook Pro. No cloud. No GPU cluster. Just your laptop.**\n\ndmlx is a high-performance LLM inference engine for Apple Silicon, built in Zig 0.16 on Apple's MLX Metal backend. It delivers **18-26 tok/s** on hardware that can't even load the model with other tools.\n\n```\n284B params → 4-bit quantized (40GB) → SMELT 20% (8GB) → runs on 48GB Mac\n```\n\n---\n\n## Quick Start\n\n```bash\n# Install dependencies\nbrew install zig mlx-c\n\n# Build\ngit clone https://github.com/zouyee/dmlx.git \u0026\u0026 cd dmlx\nmake\n\n# Chat (single prompt)\n./zig-out/bin/dmlx chat --model ~/models/DeepSeek-V4-Flash-4bit \\\n  --prompt \"Explain quantum computing in one sentence\" \\\n  --smelt --smelt-experts 0.2\n\n# Serve (OpenAI-compatible API, Trust OS mode — no custom cache)\n./zig-out/bin/dmlx serve --model ~/models/DeepSeek-V4-Flash-4bit \\\n  --port 8080 --smelt --smelt-experts 0.2\n\n# Query the server\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"default\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello\"}]}'\n```\n\n---\n\n## Performance\n\n**Hardware**: M4 Pro 48GB | **Model**: DeepSeek-V4-Flash 4-bit (284B, 33 shards) | **Mode**: SMELT 20%, Trust OS, packed experts\n\n| Metric | Value |\n|--------|-------|\n| Throughput (steady-state) | **~1.1 tok/s** (99.2s/100tok) |\n| 30-token latency (warm) | **~23.2s** |\n| Steady-state ITL | ~887 ms |\n| Memory usage | ~2.5 GB (Trust OS) |\n| 7-prompt correctness | **7/7 PASS** |\n\n\u003e **Trust OS (cache=0) with packed experts is the recommended default.** Score-free DyMoE skips 1 expert per layer (17% I/O reduction) without breaking MLX lazy fusion. See [analysis](docs/analysis/flash-moe-alignment-plan.md).\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eOptimization history\u003c/b\u003e\u003c/summary\u003e\n\n| Metric | Initial (cached) | Current (Trust OS + packed) | Notes |\n|--------|-----------------|---------------------------|-------|\n| Client 100-token | 193s | **99.2s** | +94% faster |\n| First-request TTFR | 113s | **23.2s** | +388% faster |\n| Server RSS | 4.7GB | **2.5GB** | -47% memory |\n| E2E correctness | 7/7 | **7/7** | maintained |\n\n\n---\n\n## Why dmlx?\n\n| | mlx-lm (Python) | dmlx (Zig) |\n|---|---|---|\n| DeepSeek V4 on 48GB | OOM | **8GB via SMELT** |\n| KV cache strategies | 1 fixed | 6 runtime-switchable |\n| Max context (48GB) | RAM-limited | **128K+ (RAM+SSD)** |\n| Binary size | ~500MB+ (Python stack) | ~10MB static binary |\n| Latency jitter | 10-100ms GC pauses | Zero GC, sub-ms |\n| iOS embedding | No | Yes (C ABI) |\n\n**The core advantage is not raw speed — it's that the model runs at all on consumer Macs.**\n\n---\n\n## How It Works: 5-Layer Memory Optimization\n\n```\nFull model (568GB BF16)\n  │\n  ├─ Layer 1: MoE Expert Streaming    138GB → 10GB   (load only active 6/256 experts)\n  ├─ Layer 2: 4-bit Quant + SMELT      40GB →  8GB   (partial expert preloading)\n  ├─ Layer 3: CSA/HCA Attention         KV cache 9.5× smaller than V3\n  ├─ Layer 4: 6 KV Cache Strategies     Standard|Rotating|Quantized|Paged|Tiered\n  └─ Layer 5: Zero-Copy Loading         137s → 46s startup (mmap + batched I/O)\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLayer 1: MoE Expert Streaming (138GB → 10GB)\u003c/b\u003e\u003c/summary\u003e\n\nDeepSeek V4 activates top-6 of 256 experts per token. dmlx loads only active experts on-demand from SSD via pread (Trust OS mode — no custom cache, relying on macOS page cache). Expert caching is available via `--smelt-cache \u003cMB\u003e` for larger RAM configurations.\n\n```\nSource: src/models/expert_stream.zig | src/models/expert_cache.zig\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLayer 2: SMELT — Selective Model Expert Loading (40GB → 8GB)\u003c/b\u003e\u003c/summary\u003e\n\n| Mode | Experts Preloaded | Memory | Use Case |\n|------|-------------------|--------|----------|\n| Full 4-bit | 256 (100%) | ~40 GB | 96GB+ Mac |\n| SMELT 30% | ~77 | ~10 GB | 64GB Mac |\n| **SMELT 20%** | **~51** | **~8 GB** | **48GB Mac** |\n| SMELT 15% | ~38 | ~6 GB | Memory-tight |\n\nSMELT preloads frequently-used experts and applies routing bias to avoid loading unused ones. Cache misses fall back to on-demand streaming.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLayer 3: CSA + HCA Hybrid Attention\u003c/b\u003e\u003c/summary\u003e\n\nDeepSeek V4's native architecture — Compressed Sparse Attention (m=4) interleaved with Heavily Compressed Attention (m'=128), plus FP8 KV storage. Result: **9.5x smaller KV cache** than V3.\n\n```\nSource: src/models/deepseek_v4.zig (3,091 lines)\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLayer 4: Six KV Cache Strategies\u003c/b\u003e\u003c/summary\u003e\n\n| Strategy | Best For |\n|----------|----------|\n| Standard | Short sequences |\n| Rotating | Ultra-long sequences (ring buffer) |\n| Quantized | Memory-constrained (4/8/16-bit KV) |\n| **Paged** (default) | Production (32-token pages + CoW) |\n| PagedQuantized | Extreme memory optimization |\n| Tiered | 128K+ context (RAM hot + SSD cold) |\n\nPlus **Prefix Cache** with LRU eviction — skip prefill entirely on repeated prompts.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLayer 5: Zero-Copy Model Loading\u003c/b\u003e\u003c/summary\u003e\n\n| Optimization | Before | After |\n|-------------|--------|-------|\n| Binary index cache | 67s parsing | ~1s mmap |\n| Weight loading | 7GB memcpy | Direct mmap |\n| Shard I/O | Random reads | Sequential readahead |\n\nCombined: **137s → 46s** (66% reduction).\n\u003c/details\u003e\n\n---\n\n## Features\n\n**Inference Engine**\n- 9 architectures: DeepSeek V4, LLaMA, Mistral, Qwen2/3, Gemma, GLM-4, Phi, Phi-3\n- OpenAI + Anthropic API compatible HTTP server (SSE streaming)\n- Speculative decoding (PLD + EAGLE), guided decoding (JSON Schema / Regex FSM)\n- Prefix cache with LRU eviction (25-48% TTFR reduction on repeated prompts)\n- Quantization: INT4/INT8, MXFP4, FP8, TurboQuant\n- QLoRA fine-tuning, tool calling, vision (LLaVA)\n\n**MLX Bindings**\n- 200+ operations, autograd, NN layers, compiled op fusion\n- Type-safe Zig API over mlx-c (Metal GPU + CPU Accelerate)\n\n---\n\n## Installation\n\n**Requirements**: macOS Apple Silicon (M1-M4) + Zig 0.16.0 + mlx-c\n\n```bash\nbrew install zig mlx-c\n```\n\n**Build from source:**\n\n```bash\ngit clone https://github.com/zouyee/dmlx.git\ncd dmlx\nmake              # Build (ReleaseFast)\nmake test         # Unit tests (400+)\nmake benchmark    # Full benchmark (build + unit tests + perf + 7-prompt e2e + report)\nmake check        # Everything: build + test + verify + benchmark\n```\n\nThe benchmark script is the single source of truth for performance and correctness:\n\n```bash\nbash scripts/run_benchmark.sh                              # defaults: 0.20 experts, Trust OS (cache=0)\nbash scripts/run_benchmark.sh ~/models/DeepSeek-V4-Flash-4bit 0.15 0  # custom smelt fraction\n```\nPacked experts are auto-detected from `\u003cmodel_path\u003e/packed_experts/`. Generate with `scripts/repack_experts.py`.\n\n**As a Zig dependency** (`build.zig.zon`):\n\n```zig\n.dependencies = .{\n    .dmlx = .{\n        .url = \"https://github.com/zouyee/dmlx/archive/refs/tags/v0.3.0.tar.gz\",\n        .hash = \"...\",\n    },\n},\n```\n\n---\n\n## Architecture\n\n```\nsrc/\n├── main.zig                # CLI (chat, serve, benchmark)\n├── models/                 # DeepSeek V4, LLaMA, Qwen, Gemma, ...\n│   ├── deepseek_v4.zig    # MLA + CSA/HCA + MoE (3K lines)\n│   ├── expert_stream.zig  # On-demand expert loading from SSD\n│   └── expert_cache.zig   # Expert cache (disabled by default — Trust OS)\n├── engine/                 # Inference engine\n│   ├── engine_loop.zig    # Main loop + prefix cache integration\n│   └── prefix_cache.zig   # LRU prefix cache\n├── server/                 # HTTP server (OpenAI + Anthropic API)\n├── kvcache/                # 6 strategies (standard → tiered)\n├── tokenizer/              # BPE tokenizer\n├── speculative.zig         # PLD + EAGLE\n├── guided.zig              # JSON/Regex constrained decoding\n└── trainer.zig             # QLoRA + AdamW\n```\n\n---\n\n## Use Cases\n\n- **Local inference** — GPT-4-class intelligence on-device, zero API costs\n- **Privacy/compliance** — HIPAA/GDPR: all data stays on your hardware\n- **Edge server** — Mac mini as a team inference server (OpenAI API drop-in)\n- **Offline** — Download once, run anywhere without internet\n- **Research** — Modify MoE routing, swap KV strategies, test quantization formats\n\n---\n\n## Documentation\n\n| | |\n|---|---|\n| [Technical Deep Dive](docs/en/deepseek-moe/README.md) | How 284B runs on 48GB |\n| [Performance Benchmark](docs/en/analysis/performance-benchmark.md) | Latest numbers |\n| [Optimization Roadmap](docs/analysis/optimization-roadmap.md) | What's next |\n| [Contributing](CONTRIBUTING.md) | Developer guidelines |\n\n---\n\n## Acknowledgments\n\nBuilt on [Apple MLX](https://github.com/ml-explore/mlx) and `mlx-c`. Custom Metal kernels adapted from [DeepSeek TileKernels](https://github.com/deepseek-ai/tilekernels).\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzouyee%2Fdmlx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzouyee%2Fdmlx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzouyee%2Fdmlx/lists"}