{"id":47877901,"url":"https://github.com/szibis/mlx-flash","last_synced_at":"2026-04-06T03:01:01.895Z","repository":{"id":348377433,"uuid":"1196418538","full_name":"szibis/MLX-Flash","owner":"szibis","description":"Run AI models too large for your Mac's memory — at near-full speed. Intelligent expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon.","archived":false,"fork":false,"pushed_at":"2026-04-01T17:54:50.000Z","size":533,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-04T01:34:47.303Z","etag":null,"topics":["apple-silicon","expert-caching","inference","llm","machine-learning","macos","memory-optimization","metal-gpu","mixture-of-experts","mlx","moe","python","quantization","rust","speculative-execution","ssd-streaming"],"latest_commit_sha":null,"homepage":"https://github.com/szibis/MLX-Flash-compress","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/szibis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-30T17:20:47.000Z","updated_at":"2026-04-01T17:54:54.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/szibis/MLX-Flash","commit_stats":null,"previous_names":["szibis/mlx-flash"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/szibis/MLX-Flash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szibis%2FMLX-Flash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szibis%2FMLX-Flash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szibis%2FMLX-Flash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szibis%2FMLX-Flash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/szibis","download_url":"https://codeload.github.com/szibis/MLX-Flash/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szibis%2FMLX-Flash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31421869,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T00:25:07.052Z","status":"online","status_checked_at":"2026-04-05T02:00:05.211Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","expert-caching","inference","llm","machine-learning","macos","memory-optimization","metal-gpu","mixture-of-experts","mlx","moe","python","quantization","rust","speculative-execution","ssd-streaming"],"created_at":"2026-04-04T01:34:45.880Z","updated_at":"2026-04-05T02:00:40.359Z","avatar_url":"https://github.com/szibis.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/logo.svg\" width=\"200\" alt=\"MLX-Flash Logo\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eMLX-Flash\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\u003cstrong\u003eRun AI models too large for your Mac's memory — at near-full speed.\u003c/strong\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/mlx-flash/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/mlx-flash?color=blue\" alt=\"PyPI\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/szibis/MLX-Flash/actions\"\u003e\u003cimg src=\"https://github.com/szibis/MLX-Flash/actions/workflows/test.yml/badge.svg\" alt=\"Tests\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/szibis/MLX-Flash/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-green\" alt=\"License\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/szibis/MLX-Flash\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/szibis/MLX-Flash?style=social\" alt=\"Stars\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nYour MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD — so you don't have to choose between quality and what fits in memory.\n\n## How It Works (Simple Version)\n\nThink of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:\n\n```mermaid\nflowchart TB\n    subgraph RAM[\"Your Mac's RAM (fast)\"]\n        HC[Hot Cache — 85%+ of active experts]\n        MP[Mixed Precision — hot 4-bit, cold 2-bit]\n        KV[KV Cache — optional 8-bit quantization]\n    end\n    subgraph CACHE[\"Smart Cache Layer\"]\n        LCP[LCP Eviction — layer-depth biased]\n        PF[Speculative Prefetch — 97% accuracy]\n        MM[Memory Monitor — never harms your apps]\n        SPEC[Speculative Execution — predict → execute → verify]\n    end\n    subgraph SSD[\"Your Mac's SSD (big)\"]\n        FULL[Full model weights — even 200GB+]\n        ENT[Entropy-coded storage — 65% smaller]\n    end\n\n    SSD --\u003e|stream on demand| CACHE\n    CACHE --\u003e|cache hit: 0.08ms| RAM\n    CACHE --\u003e|cache miss: 0.6ms| SSD\n    RAM --\u003e|feed to GPU| GPU[MLX GPU Inference]\n```\n\n**Result:** A 200GB AI model runs on your 48GB Mac at **2-3x faster** than naive SSD streaming.\n\n## Quick Start\n\n```bash\n# Install from PyPI\npip install mlx-flash\n\n# Or Homebrew (includes Rust sidecar)\nbrew tap szibis/mlx-flash \u0026\u0026 brew install mlx-flash\n\n# Or from source\ngit clone https://github.com/szibis/MLX-Flash.git\ncd MLX-Flash \u0026\u0026 pip install -e \".[all]\"\n```\n\n```bash\n# Interactive chat (simplest way to use it)\nmlx-flash-chat\n\n# Start the API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)\nmlx-flash --port 8080\n\n# With KV cache quantization (45% less KV memory)\nmlx-flash --port 8080 --kv-bits 8\n\n# See what models fit your hardware\nmlx-flash-browse\n```\n\n## Performance\n\n### Measured Results\n\n| Technique | Speedup | How It Works |\n|-----------|---------|-------------|\n| **LCP Smart Cache** | **2.80x** | Keeps frequently-used model parts in RAM, predicts what's needed next |\n| **+ Async Prefetch** | **2.93x** | Loads next part from SSD while GPU computes current part |\n| **Mixed Precision** | **1.80x size reduction** | Rarely-used parts stored at lower quality (saves space, barely affects output) |\n| **Skip Fallback** | **2.67x** | When something isn't cached, gracefully skip it instead of waiting |\n| **Speculative Execution** | **14-42% TPOT** | Execute predicted experts before router confirms, verify after |\n| **Adaptive Top-K** | **10-30% compute** | Skip low-confidence secondary experts automatically |\n\n### Real Hardware Numbers (Measured on M3 Max 36GB)\n\n**Memory pressure recovery** (the key result):\n\n```\nModel at 0.9x RAM (barely fits):\n  Without optimization:    43.5 tok/s  ########\n  With mixed precision:   104.5 tok/s  ####################  2.4x faster\n```\n\nThe memory pressure cliff is razor-sharp: 10% over the limit causes 59% slowdown. Our 20% footprint reduction shifts the model back to full speed.\n\n**Cache warm-up** (ISP-like progressive acceleration):\n\n```\nToken  0:  83.3ms (cold start, loading experts from SSD)\nToken  8:   5.7ms (warming up, 62% cache hit)\nToken 24:   0.5ms (full speed, 85%+ cache hit)\n         -\u003e 41x speedup from warm-up\n```\n\n**Topic switching:**\n```\ncoding -\u003e writing:  62ms first token (re-warming)  -\u003e 8 tokens to recover\nwriting -\u003e coding:  0.6ms first token (still cached!) -\u003e instant fast\n```\n\n### Expert Streaming Performance\n\nExpert streaming replaces MLX's `QuantizedSwitchLinear` with a GPU lookup table + pre-stacked tensors. The `capacity_per_layer` parameter controls how many experts stay in GPU memory:\n\n| Model | Total Experts | Capacity | Coverage | Throughput | Notes |\n|-------|--------------|----------|----------|------------|-------|\n| Qwen3-30B-A3B | 128 per layer | 128 (100%) | 100% | ~35 tok/s | Full speed, no streaming needed |\n| Qwen3-30B-A3B | 128 per layer | 64 (50%) | 85%+ hit rate | ~15 tok/s | After warm-up with LCP |\n| Mixtral-8x7B | 8 per layer | 8 (100%) | 100% | ~20 tok/s | All experts fit |\n| Mixtral-8x7B | 8 per layer | 4 (50%) | ~95% hit rate | ~12 tok/s | Most active cached |\n\n**Tuning tips:**\n- Start with `capacity_per_layer = total_experts` if RAM allows (no streaming overhead)\n- Use `--task coding` warmup profile for programming tasks (pre-loads code-relevant experts)\n- Enable skip-fallback with adaptive threshold to skip low-confidence secondary experts\n- After ~25 tokens, LCP learns your workload and hit rate climbs to 85-95%\n- Run `optimize_wired_memory_limit()` before loading to prevent Metal pressure cliff\n\n```python\nfrom mlx_flash_compress.expert_streaming import (\n    enable_expert_streaming, enable_skip_fallback, get_warmup_experts\n)\n\n# Load model, enable streaming with 50% capacity\nstreaming = enable_expert_streaming(model, capacity_per_layer=64)\nenable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)\nstreaming.warmup()\n```\n\n### Find Your Optimal Configuration\n\nThe Tier Optimizer tells you exactly how to allocate your Mac's memory:\n\n```bash\n# For a 200GB model on a 48GB Mac\npython -m mlx_flash_compress.tier_optimizer --total-ram 48 --model-gb 209\n\n# Output: \"Best: 41.5GB RAM cache, 82% of requests served from RAM → 6.4 tok/s\"\n```\n\nIt shows you the sweet spot — even dedicating just 10GB to caching gives you 54% of requests served instantly from RAM.\n\n## What's Inside\n\n### Architecture\n\n```mermaid\nflowchart TB\n    subgraph Prediction[\"Expert Prediction (97%+ accuracy)\"]\n        RP[Residual-Stream Predictor\u003cbr/\u003eLinear projection of hidden state]\n        SM[Shadow MLP Predictor\u003cbr/\u003eOnline-trained routing MLP]\n        CL[Cross-Layer Prefetch\u003cbr/\u003e3-hop transitive co-occurrence]\n    end\n    subgraph CacheLayer[\"Smart Cache Layer\"]\n        LCP[LCP Eviction\u003cbr/\u003eLayer-depth biased]\n        FLE[Forward-Looking Eviction\u003cbr/\u003eBelady-optimal approximation]\n        VS[Vertical Split\u003cbr/\u003e2x coverage in same RAM]\n        EM[Expert Merging\u003cbr/\u003eCosine similarity clustering]\n    end\n    subgraph Execution[\"Inference Engine\"]\n        ES[Expert Streaming\u003cbr/\u003eGPU lookup + pre-stacked tensors]\n        SE[Speculative Execution\u003cbr/\u003ePredict → Execute → Verify]\n        SF[Skip Fallback\u003cbr/\u003eAdaptive top-k]\n        MP[Mixed Precision\u003cbr/\u003eHot 4-bit / Cold 2-bit]\n    end\n    subgraph Storage[\"Compressed Storage\"]\n        EC[Entropy Coding\u003cbr/\u003eHuffman for uint4]\n        ST[Safetensors mmap\u003cbr/\u003eZero-copy SSD reads]\n    end\n\n    Prediction --\u003e CacheLayer\n    CacheLayer --\u003e Execution\n    Storage --\u003e CacheLayer\n```\n\n### Core Modules (35 Python files)\n\n| Module | What It Does |\n|--------|-------------|\n| **Expert Streaming** | |\n| `expert_streaming.py` | GPU lookup table + pre-stacked weights, skip-fallback, adaptive top-k, Mixtral/Qwen support |\n| `speculative_experts.py` | Residual-stream predictor (97%+), Belady-optimal eviction, speculative execution |\n| `advanced_prefetch.py` | Cross-layer N-hop predictor + shadow MLP for \u003e90% prefetch accuracy |\n| **Cache Management** | |\n| `lcp_cache.py` | Smart cache with layer-depth biased LCP eviction + `mx.clear_cache()` |\n| `smart_eviction.py` | SpecMD-inspired least-stale eviction + routing predictor |\n| `vertical_split.py` | Cache partial expert rows for 2x coverage in same RAM (MoEpic) |\n| `expert_merging.py` | Offline expert clustering — merge similar experts for 15-30% fewer params |\n| **Compression** | |\n| `entropy_coding.py` | Huffman coding for uint4 weights — 65% smaller at near-zero quality loss |\n| `mixed_precision.py` | Hot experts at 4-bit, cold at 2-bit — 1.8x smaller, barely noticeable |\n| `compression.py` | LZ4/ZSTD compression + Apple's native LZFSE |\n| **Memory \u0026 Hardware** | |\n| `memory_manager.py` | Real-time pressure monitoring, wired memory limit, auto-release |\n| `hardware.py` | Apple Silicon detection (M1-M5), RAM, GPU cores |\n| `tier_optimizer.py` | Finds the perfect RAM/SSD balance for your Mac + model combo |\n| `ssd_protection.py` | Thermal cutoff, sequential hints, zero writes |\n| **Inference \u0026 Serving** | |\n| `serve.py` | OpenAI-compatible server with KV cache quantization, memory-aware hints |\n| `chat.py` | Colorful chat CLI with web search, memory, model switching |\n| `web_search.py` | DuckDuckGo search + persistent memory store (Perplexity-style) |\n| `hf_calculator.py` | Model size/memory estimator for any MoE or dense model |\n| `task_profiler.py` | Per-task expert profiles (coding/writing/math/chat) for fast warmup |\n| **Distributed** | |\n| `distributed_experts.py` | Multi-Mac expert parallelism over Thunderbolt 5 RDMA |\n| `kv_cache_sharing.py` | PT-MoE KV-cache sharing between blocks (37.5% memory savings) |\n| `cached_inference.py` | Expert routing capture + cache simulation |\n| `rust_bridge.py` | Python ↔ Rust Unix socket bridge |\n| **Rust Sidecar** | |\n| `mlx-flash-server/` | axum HTTP/SSE proxy, mach2 memory (0.1ms), DashMap LCP, Unix socket |\n\n### Client Integration\n\n```mermaid\ngraph LR\n    subgraph Clients\n        LS[LM Studio]\n        CU[Cursor]\n        CC[Claude Code]\n        SDK[OpenAI SDK]\n        CD[continue.dev]\n        OW[Open WebUI]\n    end\n    subgraph Rust[\"Rust Sidecar :8080\"]\n        AX[axum HTTP/SSE]\n        MEM[Memory Monitor\u003cbr/\u003emach2 0.1ms]\n        LCPC[LCP Cache\u003cbr/\u003eDashMap lock-free]\n    end\n    subgraph Python[\"Python Worker :8081\"]\n        MLX[MLX Inference\u003cbr/\u003e95% of work]\n        GEN[generate\u0026#40;\u0026#41;]\n    end\n\n    Clients --\u003e|OpenAI API| Rust\n    Rust --\u003e|proxy| Python\n    Rust -.-\u003e|Unix socket| LCPC\n    LCPC -.-\u003e|expert weights| Python\n```\n\n### Using It\n\n| How | Command | Best For |\n|-----|---------|----------|\n| **Interactive chat** | `mlx-flash-chat` | Chat with web search, memory, model switching |\n| **API server** | `mlx-flash --port 8080` | LM Studio, Cursor, Claude Code, OpenAI SDK |\n| **API + KV quant** | `mlx-flash --port 8080 --kv-bits 8` | 45% less KV memory |\n| **Model calculator** | `python -m mlx_flash_compress.hf_calculator` | Estimate size/memory for any model |\n| **Model browser** | `mlx-flash-browse` | See what fits your hardware |\n| **Warm-up demo** | `python -m mlx_flash_compress.demo_warmup` | Watch cache fill in real-time |\n| **Pressure test** | `python -m mlx_flash_compress.bench_memory_pressure` | Measure memory impact |\n\n**Chat commands:** `/models` browse catalog, `/model N` switch live, `/search` web search, `/ask` search+answer, `/remember` save facts, `/memories` list, `/status` memory info\n\n### Integrations\n\nAll integrations start with running the server:\n\n```bash\n# Install\npip install mlx-flash\n\n# Start the server\nmlx-flash --port 8080 --preload\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eLM Studio\u003c/b\u003e\u003c/summary\u003e\n\n1. Start MLX-Flash: `mlx-flash --port 8080 --preload`\n2. In LM Studio: **Settings** → **Server** → Add custom endpoint: `http://localhost:8080/v1`\n3. Select model: `local`\n4. Chat normally — LM Studio treats MLX-Flash as its backend\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eCursor\u003c/b\u003e\u003c/summary\u003e\n\n1. Start MLX-Flash: `mlx-flash --port 8080 --preload`\n2. In Cursor: **Settings** → **Models** → **Add Model**\n   - Provider: `OpenAI Compatible`\n   - API Base: `http://localhost:8080/v1`\n   - API Key: `not-needed`\n   - Model: `local`\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eClaude Code\u003c/b\u003e\u003c/summary\u003e\n\n```bash\n# Terminal 1: Start server\nmlx-flash --port 8080 --preload\n\n# Terminal 2: Use with Claude Code\nexport OPENAI_API_BASE=http://localhost:8080/v1\nexport OPENAI_API_KEY=not-needed\n```\n\nOr add to `~/.claude/.mcp.json`:\n```json\n{\n  \"mlx-flash\": {\n    \"command\": \"mlx-flash\",\n    \"args\": [\"--model\", \"mlx-community/Qwen3-30B-A3B-4bit\", \"--port\", \"8080\"]\n  }\n}\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eCodex CLI\u003c/b\u003e\u003c/summary\u003e\n\n```bash\n# Start server\nmlx-flash --port 8080 --preload\n\n# Use with Codex\nexport OPENAI_API_BASE=http://localhost:8080/v1\nexport OPENAI_API_KEY=not-needed\ncodex \"refactor this function\"\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eOllama (side-by-side)\u003c/b\u003e\u003c/summary\u003e\n\n```bash\n# Ollama on default port (11434) for dense models\nollama serve\n\n# MLX-Flash on 8080 for MoE models (better expert caching)\nmlx-flash --port 8080 --preload\n\n# Use Ollama for dense models, MLX-Flash for MoE models\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003econtinue.dev (VS Code / JetBrains)\u003c/b\u003e\u003c/summary\u003e\n\nAdd to `~/.continue/config.json`:\n```json\n{\n  \"models\": [{\n    \"title\": \"Local MoE (MLX)\",\n    \"provider\": \"openai\",\n    \"model\": \"local\",\n    \"apiBase\": \"http://localhost:8080/v1\",\n    \"apiKey\": \"not-needed\"\n  }]\n}\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eOpen WebUI\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nmlx-flash --port 8080 --preload\n# In Open WebUI settings: Add connection → http://localhost:8080/v1\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003ePython / OpenAI SDK\u003c/b\u003e\u003c/summary\u003e\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8080/v1\", api_key=\"not-needed\")\nresponse = client.chat.completions.create(\n    model=\"local\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n)\nprint(response.choices[0].message.content)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eAider (AI pair programming)\u003c/b\u003e\u003c/summary\u003e\n\n```bash\nmlx-flash --port 8080 --preload\naider --openai-api-base http://localhost:8080/v1 --openai-api-key not-needed --model local\n```\n\n\u003c/details\u003e\n\n\u003e See [`docs/integrations.md`](docs/integrations.md) for 18+ detailed integration guides with streaming examples, health checks, and memory monitoring.\n\n### Benchmark Suite\n\n```bash\npython -m mlx_flash_compress.bench_memory_pressure       # Memory pressure analysis (key demo)\npython -m mlx_flash_compress.demo_warmup                   # ISP-like warm-up visualization\npython -m mlx_flash_compress.cached_inference --multi-topic # Real routing capture\npython -m mlx_flash_compress.bench --synthetic              # Quick test (no model needed)\npython -m mlx_flash_compress.bench_real                     # Real Qwen MoE model test\npython -m mlx_flash_compress.bench_final                    # Final comprehensive benchmark\n```\n\n## Key Discoveries\n\n### 1. Standard Compression Doesn't Work on AI Weights\n\nWe tested 6 different compression strategies on real AI model weights. Result: **1.0x compression** (zero savings). The data is already maximally dense at 4-bit quantization. Instead, we use entropy coding (Huffman) which exploits the non-uniform distribution of quantized values for 65% savings.\n\n### 2. Smart Caching Is the #1 Win\n\nInstead of trying to compress, we **predict what's needed and pre-load it**. Our prediction stack achieves 97%+ accuracy:\n- Residual-stream predictor (linear projection of hidden states)\n- Cross-layer 3-hop lookahead (transitive co-occurrence)\n- Forward-looking Belady-optimal eviction (never evict what you'll need)\n- Layer-depth bias (early layers are more valuable to cache)\n\n### 3. The Brain Already Solved This Problem\n\nMoE models work like the brain — only 0.78% of \"neurons\" (experts) activate per input. The brain handles this with predictive coding (pre-activating expected pathways). We implement the same principle: predict which experts are needed, speculatively execute them, and verify after the router confirms.\n\n### 4. Speculate, Don't Wait\n\nSpeculative expert execution (from MoE-SpAc paper) runs predicted experts *before* the router confirms them. With 97% prediction accuracy, this means 97% of expert computations start immediately with zero load latency. The 3% misses are discarded and recomputed — on unified memory, this costs only ~0.1ms per wasted computation.\n\n## Requirements\n\n- **macOS** with Apple Silicon (M1/M2/M3/M4/M5)\n- **Python 3.10+**\n- 16GB+ RAM (more = better caching = faster)\n- For real model tests: `mlx` and `mlx-lm` packages\n\n## Project Stats\n\n- **15,000+ lines of code** (Python + Rust)\n- **254 tests** (222 Python + 32 Rust)\n- **8 benchmark suites** + interactive demos\n- **10 research documents** (15+ papers implemented, 60+ surveyed)\n- **40 Python modules** covering prediction, caching, compression, distributed, serving\n- **OpenAI-compatible API server** with KV cache quantization\n- **Memory-aware** inference with wired memory optimization\n- **Rust sidecar** with 0.1ms memory checks (210x faster than Python)\n- **Lock-free LCP expert cache** (DashMap) with layer-depth bias\n- **Unix socket bridge** for Python ↔ Rust expert weight streaming\n- **15+ research techniques** implemented from papers 2024-2026\n\n## Research \u0026 Techniques Implemented\n\n```mermaid\ngraph TB\n    subgraph DONE[\"Implemented (15+ techniques)\"]\n        ES[Expert Streaming\u003cbr/\u003eGPU lookup tables]\n        LCP[Layer-biased LCP\u003cbr/\u003eFATE paper]\n        RP[Residual Predictor\u003cbr/\u003e97%+ accuracy]\n        SE[Speculative Execution\u003cbr/\u003eMoE-SpAc]\n        FE[Forward Eviction\u003cbr/\u003eMoE-SpeQ Belady]\n        CL[Cross-Layer Prefetch\u003cbr/\u003e3-hop lookahead]\n        SP[Shadow MLP Predictor\u003cbr/\u003emlx-od-moe]\n        VS[Vertical Splitting\u003cbr/\u003eMoEpic 2x coverage]\n        EM[Expert Merging\u003cbr/\u003eDEK/EEP]\n        EC[Entropy Coding\u003cbr/\u003eEntroLLM Huffman]\n        AT[Adaptive Top-K\u003cbr/\u003eLExI paper]\n        MP[Mixed Precision\u003cbr/\u003eHOBBIT]\n        KV[KV Cache 8-bit\u003cbr/\u003emlx-moe]\n        WM[Wired Memory Limit\u003cbr/\u003emacOS sysctl]\n        MC[mx.clear_cache\u003cbr/\u003eMLX v0.31]\n    end\n    subgraph BLOCKED[\"Blocked\"]\n        AMX[AMX Pipeline\u003cbr/\u003eundocumented HW]\n        MLXrs[mlx-rs\u003cbr/\u003emacOS 26 Metal]\n    end\n```\n\n| Technique | Paper | Status |\n|-----------|-------|--------|\n| Expert streaming (GPU lookup) | HOBBIT arXiv:2411.01433 | **Implemented** |\n| Residual-stream predictor | Speculating Experts arXiv:2603.19289 | **Implemented** |\n| Speculative expert execution | MoE-SpAc arXiv:2603.09983 | **Implemented** |\n| Forward-looking Belady eviction | MoE-SpeQ arXiv:2511.14102 | **Implemented** |\n| Cross-layer 3-hop prefetch | FATE arXiv:2502.12224 / tinyserve | **Implemented** |\n| Layer-depth cache bias | FATE arXiv:2502.12224 | **Implemented** |\n| Shadow model predictor | mlx-od-moe | **Implemented** |\n| Vertical expert splitting | MoEpic paper | **Implemented** |\n| Expert merging (offline) | DEK/EEP arXiv:2509.19781 | **Implemented** |\n| Entropy coding (Huffman uint4) | EntroLLM arXiv:2505.02380 | **Implemented** |\n| Adaptive top-k skipping | LExI arXiv:2509.02753 | **Implemented** |\n| Mixed precision per-expert | HOBBIT arXiv:2411.01433 | **Implemented** |\n| KV cache 8-bit quantization | mlx-moe / mlx-lm v0.31 | **Implemented** |\n| Wired memory optimization | macOS sysctl / mlx-moe | **Implemented** |\n| `mx.clear_cache()` integration | MLX v0.31.0 | **Implemented** |\n| AMX dequant pipeline | amx-rs Rust crate | Blocked (undocumented HW) |\n| mlx-rs native inference | mlx-rs v0.25.3 | Blocked (macOS 26 Metal) |\n\n### Competition\n\n10+ OSS projects and 15+ papers attack the same problem. Our unique differentiators:\n1. **Only** project with Rust sidecar + Mach syscall memory monitoring\n2. **Only** Apple Silicon project with mixed precision per-expert (hot 4-bit / cold 2-bit)\n3. **Most techniques implemented**: 15+ from research frontier, more than any competitor\n4. **Only** project combining speculative execution + Belady eviction + residual predictor + expert merging\n\n| Competitor | Key Feature | Our Advantage |\n|-----------|------------|---------------|\n| mu-hashmi/mlx-moe | Expert profiles, 10+ model families | Speculative execution, residual predictor, Rust sidecar |\n| kqb/mlx-od-moe | Shadow model, memory-mapped experts | Cross-layer prefetch, entropy coding, expert merging |\n| jundot/omlx | Hybrid mxfp4/mxfp8 quantization | Belady eviction, adaptive top-k, vertical splitting |\n| HOBBIT (paper) | Nearly identical architecture | Apple Silicon native, open source |\n\nSee `docs/competitive-analysis.md` for the full landscape.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszibis%2Fmlx-flash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fszibis%2Fmlx-flash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszibis%2Fmlx-flash/lists"}