{"id":47614599,"url":"https://github.com/matt-k-wong/mlx-flash","last_synced_at":"2026-06-13T04:01:25.622Z","repository":{"id":345803746,"uuid":"1187386524","full_name":"matt-k-wong/mlx-flash","owner":"matt-k-wong","description":"Flash weight streaming for MLX: run massive models larger than your RAM on Apple Silicon.","archived":false,"fork":false,"pushed_at":"2026-04-01T08:14:49.000Z","size":521,"stargazers_count":73,"open_issues_count":0,"forks_count":6,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-02T04:39:55.860Z","etag":null,"topics":["apple-silicon","large-language-models","llm","llm-inference","lm-studio","machine-learning","macos","memory-optimization","metal","mlx","optimization","weight-streaming"],"latest_commit_sha":null,"homepage":"https://github.com/matt-k-wong/mlx-flash","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/matt-k-wong.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-20T17:04:42.000Z","updated_at":"2026-04-01T18:50:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/matt-k-wong/mlx-flash","commit_stats":null,"previous_names":["matt-k-wong/mlx-flash"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/matt-k-wong/mlx-flash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matt-k-wong%2Fmlx-flash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matt-k-wong%2Fmlx-flash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matt-k-wong%2Fmlx-flash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matt-k-wong%2Fmlx-flash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/matt-k-wong","download_url":"https://codeload.github.com/matt-k-wong/mlx-flash/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matt-k-wong%2Fmlx-flash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34271500,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-13T02:00:06.617Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","large-language-models","llm","llm-inference","lm-studio","machine-learning","macos","memory-optimization","metal","mlx","optimization","weight-streaming"],"created_at":"2026-04-01T21:09:33.263Z","updated_at":"2026-06-13T04:01:25.617Z","avatar_url":"https://github.com/matt-k-wong.png","language":"Python","funding_links":[],"categories":["Rising projects"],"sub_categories":[],"readme":"# mlx-flash ⚡\n\n\u003e **Flash Weight Streaming for MLX** — run models larger than your RAM on Apple Silicon.\n\u003e 30B on 16 GB, 70B+ on 32 GB+. **No additional quantisation — uses the model's native precision.**\n\n\u003e **Project Lineage:** This implementation is inspired by Apple Research's paper [*LLM in a Flash* (arXiv 2312.11514)](https://arxiv.org/abs/2312.11514). `mlx-flash` provides a high-quality, production-grade integration layer for the MLX ecosystem, featuring bit-perfect parity and predictive bandwidth scheduling.\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![Python 3.11+](https://img.shields.io/badge/Python-3.11%2B-blue.svg)](https://python.org)\n[![MLX](https://img.shields.io/badge/MLX-latest-green.svg)](https://github.com/ml-explore/mlx)\n[![macOS 13+](https://img.shields.io/badge/macOS-13%2B-lightgrey.svg)](https://apple.com)\n[![Tests](https://github.com/matt-k-wong/mlx-flash/actions/workflows/tests.yml/badge.svg)](https://github.com/matt-k-wong/mlx-flash/actions/workflows/tests.yml)\n\n---\n\n## Why Flash Mode?\n\n| Model | Hardware | Mode | Load Time | Result |\n|-------|----------|------|-----------|--------|\n| **Nemotron-30B (17.8 GB)** | 16GB MacBook Air | Normal | 4.1s | ❌ OOM / Laggy |\n| **Nemotron-30B (17.8 GB)** | 16GB MacBook Air | **Flash** | **0.8s** | ✅ Bit-Perfect \u0026 Smooth |\n\n`mlx-flash` allows you to run models of **any size** (30B, 70B, even 400B+) on base-spec Macs by streaming weights directly from your SSD.\n\n---\n\n## 🏗️ Architecture: The Holistic Patch\n\nUnlike previous iterations that attempted to re-implement the transformer loop manually, `mlx-flash` now uses a **Holistic Model Patching** architecture. \n\n1. **Deep Tissue Patching**: We wrap the original model's layers in a `StreamingProxy`.\n2. **Native Logic Retention**: Because we use the model's own `__call__` method, every nuance of the architecture (RoPE scaling, residual streams, causal masks) is handled natively by the model code.\n3. **Execution Interception**: Our proxies intercept the layer execution to force synchronous `mx.eval()` and trigger the **Predictive I/O Scheduler**.\n\n### The Control Loop (MPC-Lite)\nWe use a **Model Predictive Controller** to maximize tokens/second:\n- **Baseline Estabishment**: On the first token (\"Cold Start\"), we establish a pristine compute baseline.\n- **Predictive Prefetch**: We predict the bandwidth demand of Layer N+1 while the GPU is still busy with Layer N.\n- **Token Bucket Actuator**: A continuous token bucket smoothly paces SSD reads using micro-sleeps, keeping GPU degradation below 5%.\n\n---\n\n## 🏆 Quality \u0026 Bit-Parity\n\n`mlx-flash` is a **zero-compromise** engine. We have proven quality through:\n\n1.  **Bit-Perfect Operators**: `TiledLinear` executes identically to `nn.Linear` (fused `mx.addmm`), so the loss delta vs. standard MLX is **exactly 0**. Note: on MLX ≥ 0.31, Metal kernel selection makes block-wise tiled accumulation diverge from native fp16 matmul, so bit-exact mode executes layers whole; sub-layer tiling will return as an opt-in memory mode.\n2.  **Hybrid KV Cache**: Keeps the most recent **128 tokens in full FP16 precision**, while offloading older context to properly scaled 8-bit quantized disk storage.\n3.  **Passkey Retrieval**: Verified 100% accuracy on context retrieval tests hidden 1,000+ tokens deep in quantized disk storage.\n\nSee [QUALITY.md](docs/QUALITY.md) for the full proof suite.\n\n---\n\n## 🚀 Quick Start\n\n### 1. Install\n```bash\npip install git+https://github.com/matt-k-wong/mlx-flash.git\n```\n\n\u003e ⚠️ **Do not `pip install mlx-flash`** — the PyPI package by that name is an **unrelated project**. This project is installed from GitHub. Tested against `mlx\u003e=0.31` / `mlx-lm\u003e=0.31`.\n\n### 2. Unified CLI\n```bash\n# Run any model with 2GB weight residence budget\nmlx-flash --model mlx-community/Llama-3.2-1B-Instruct-4bit --ram 2.0 --kv-quant 8\n```\n\n### 3. Python Usage\n```python\nfrom mlx_flash import FlashConfig, FlashManager\n\n# 1. Load and Patch\nmanager = FlashManager(FlashConfig(ram_budget_gb=2.0))\nmodel, tokenizer = manager.load(\"mlx-community/Meta-Llama-3-70B-Instruct-4bit\")\n\n# 2. Generate\nfor segment in model.stream_generate(\"Tell me a story\", max_tokens=100):\n    print(segment, end=\"\", flush=True)\n```\n\n---\n\n## How It Works\n\n```mermaid\ngraph TD\n    A[SSD: .safetensors] --\"mmap(lazy=True)\"--\u003e B[MLX Lazy Arrays]\n    A --\"Predictive Worker\"--\u003e P[Token Bucket]\n    P --\"os.pread\"--\u003e B\n    \n    subgraph Model[\"Native Model Logic\"]\n        Embed --\u003e Proxy1\n        subgraph Proxy1[\"StreamingProxy (Layer 1)\"]\n            StartHook --\u003e Dispatch[strategy.execute]\n            Dispatch --\u003e Eval[mx.eval]\n            Eval --\u003e EndHook\n        end\n        Proxy1 --\u003e Proxy2[...]\n        Proxy2 --\u003e Norm\n        Norm --\u003e Head\n    end\n```\n\n---\n\n## Roadmap\n- [x] **v0.4.0**: Holistic Model Patching (Bit-Perfect Parity), MPC-Lite Bandwidth Controller, Unified `mlx-flash` CLI, `mlx`/`mlx-lm` 0.31+ compatibility.\n- [ ] **v0.5.0**: Asynchronous DAG Scheduler (Zero-latency Python glue).\n- [ ] **v0.6.0**: MoE Lookahead Routing for Mixtral/DeepSeek.\n\n---\n\n*Brought to you by ⚡ Flash-Mode Contributors. MIT licensed.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatt-k-wong%2Fmlx-flash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmatt-k-wong%2Fmlx-flash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatt-k-wong%2Fmlx-flash/lists"}