An open API service indexing awesome lists of open source software.

https://github.com/szibis/mlx-flash

Run AI models too large for your Mac's memory — at near-full speed. Intelligent expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon.
https://github.com/szibis/mlx-flash

apple-silicon expert-caching inference llm machine-learning macos memory-optimization metal-gpu mixture-of-experts mlx moe python quantization rust speculative-execution ssd-streaming

Last synced: about 1 month ago
JSON representation

Run AI models too large for your Mac's memory — at near-full speed. Intelligent expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon.

Awesome Lists containing this project

README

          


MLX-Flash Logo

MLX-Flash

Run AI models too large for your Mac's memory — at near-full speed.


PyPI
Tests
License
Stars

Your MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD — so you don't have to choose between quality and what fits in memory.

## How It Works (Simple Version)

Think of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:

```mermaid
flowchart TB
subgraph RAM["Your Mac's RAM (fast)"]
HC[Hot Cache — 85%+ of active experts]
MP[Mixed Precision — hot 4-bit, cold 2-bit]
KV[KV Cache — optional 8-bit quantization]
end
subgraph CACHE["Smart Cache Layer"]
LCP[LCP Eviction — layer-depth biased]
PF[Speculative Prefetch — 97% accuracy]
MM[Memory Monitor — never harms your apps]
SPEC[Speculative Execution — predict → execute → verify]
end
subgraph SSD["Your Mac's SSD (big)"]
FULL[Full model weights — even 200GB+]
ENT[Entropy-coded storage — 65% smaller]
end

SSD -->|stream on demand| CACHE
CACHE -->|cache hit: 0.08ms| RAM
CACHE -->|cache miss: 0.6ms| SSD
RAM -->|feed to GPU| GPU[MLX GPU Inference]
```

**Result:** A 200GB AI model runs on your 48GB Mac at **2-3x faster** than naive SSD streaming.

## Quick Start

```bash
# Install from PyPI
pip install mlx-flash

# Or Homebrew (includes Rust sidecar)
brew tap szibis/mlx-flash && brew install mlx-flash

# Or from source
git clone https://github.com/szibis/MLX-Flash.git
cd MLX-Flash && pip install -e ".[all]"
```

```bash
# Interactive chat (simplest way to use it)
mlx-flash-chat

# Start the API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)
mlx-flash --port 8080

# With KV cache quantization (45% less KV memory)
mlx-flash --port 8080 --kv-bits 8

# See what models fit your hardware
mlx-flash-browse
```

## Performance

### Measured Results

| Technique | Speedup | How It Works |
|-----------|---------|-------------|
| **LCP Smart Cache** | **2.80x** | Keeps frequently-used model parts in RAM, predicts what's needed next |
| **+ Async Prefetch** | **2.93x** | Loads next part from SSD while GPU computes current part |
| **Mixed Precision** | **1.80x size reduction** | Rarely-used parts stored at lower quality (saves space, barely affects output) |
| **Skip Fallback** | **2.67x** | When something isn't cached, gracefully skip it instead of waiting |
| **Speculative Execution** | **14-42% TPOT** | Execute predicted experts before router confirms, verify after |
| **Adaptive Top-K** | **10-30% compute** | Skip low-confidence secondary experts automatically |

### Real Hardware Numbers (Measured on M3 Max 36GB)

**Memory pressure recovery** (the key result):

```
Model at 0.9x RAM (barely fits):
Without optimization: 43.5 tok/s ########
With mixed precision: 104.5 tok/s #################### 2.4x faster
```

The memory pressure cliff is razor-sharp: 10% over the limit causes 59% slowdown. Our 20% footprint reduction shifts the model back to full speed.

**Cache warm-up** (ISP-like progressive acceleration):

```
Token 0: 83.3ms (cold start, loading experts from SSD)
Token 8: 5.7ms (warming up, 62% cache hit)
Token 24: 0.5ms (full speed, 85%+ cache hit)
-> 41x speedup from warm-up
```

**Topic switching:**
```
coding -> writing: 62ms first token (re-warming) -> 8 tokens to recover
writing -> coding: 0.6ms first token (still cached!) -> instant fast
```

### Expert Streaming Performance

Expert streaming replaces MLX's `QuantizedSwitchLinear` with a GPU lookup table + pre-stacked tensors. The `capacity_per_layer` parameter controls how many experts stay in GPU memory:

| Model | Total Experts | Capacity | Coverage | Throughput | Notes |
|-------|--------------|----------|----------|------------|-------|
| Qwen3-30B-A3B | 128 per layer | 128 (100%) | 100% | ~35 tok/s | Full speed, no streaming needed |
| Qwen3-30B-A3B | 128 per layer | 64 (50%) | 85%+ hit rate | ~15 tok/s | After warm-up with LCP |
| Mixtral-8x7B | 8 per layer | 8 (100%) | 100% | ~20 tok/s | All experts fit |
| Mixtral-8x7B | 8 per layer | 4 (50%) | ~95% hit rate | ~12 tok/s | Most active cached |

**Tuning tips:**
- Start with `capacity_per_layer = total_experts` if RAM allows (no streaming overhead)
- Use `--task coding` warmup profile for programming tasks (pre-loads code-relevant experts)
- Enable skip-fallback with adaptive threshold to skip low-confidence secondary experts
- After ~25 tokens, LCP learns your workload and hit rate climbs to 85-95%
- Run `optimize_wired_memory_limit()` before loading to prevent Metal pressure cliff

```python
from mlx_flash_compress.expert_streaming import (
enable_expert_streaming, enable_skip_fallback, get_warmup_experts
)

# Load model, enable streaming with 50% capacity
streaming = enable_expert_streaming(model, capacity_per_layer=64)
enable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)
streaming.warmup()
```

### Find Your Optimal Configuration

The Tier Optimizer tells you exactly how to allocate your Mac's memory:

```bash
# For a 200GB model on a 48GB Mac
python -m mlx_flash_compress.tier_optimizer --total-ram 48 --model-gb 209

# Output: "Best: 41.5GB RAM cache, 82% of requests served from RAM → 6.4 tok/s"
```

It shows you the sweet spot — even dedicating just 10GB to caching gives you 54% of requests served instantly from RAM.

## What's Inside

### Architecture

```mermaid
flowchart TB
subgraph Prediction["Expert Prediction (97%+ accuracy)"]
RP[Residual-Stream Predictor
Linear projection of hidden state]
SM[Shadow MLP Predictor
Online-trained routing MLP]
CL[Cross-Layer Prefetch
3-hop transitive co-occurrence]
end
subgraph CacheLayer["Smart Cache Layer"]
LCP[LCP Eviction
Layer-depth biased]
FLE[Forward-Looking Eviction
Belady-optimal approximation]
VS[Vertical Split
2x coverage in same RAM]
EM[Expert Merging
Cosine similarity clustering]
end
subgraph Execution["Inference Engine"]
ES[Expert Streaming
GPU lookup + pre-stacked tensors]
SE[Speculative Execution
Predict → Execute → Verify]
SF[Skip Fallback
Adaptive top-k]
MP[Mixed Precision
Hot 4-bit / Cold 2-bit]
end
subgraph Storage["Compressed Storage"]
EC[Entropy Coding
Huffman for uint4]
ST[Safetensors mmap
Zero-copy SSD reads]
end

Prediction --> CacheLayer
CacheLayer --> Execution
Storage --> CacheLayer
```

### Core Modules (35 Python files)

| Module | What It Does |
|--------|-------------|
| **Expert Streaming** | |
| `expert_streaming.py` | GPU lookup table + pre-stacked weights, skip-fallback, adaptive top-k, Mixtral/Qwen support |
| `speculative_experts.py` | Residual-stream predictor (97%+), Belady-optimal eviction, speculative execution |
| `advanced_prefetch.py` | Cross-layer N-hop predictor + shadow MLP for >90% prefetch accuracy |
| **Cache Management** | |
| `lcp_cache.py` | Smart cache with layer-depth biased LCP eviction + `mx.clear_cache()` |
| `smart_eviction.py` | SpecMD-inspired least-stale eviction + routing predictor |
| `vertical_split.py` | Cache partial expert rows for 2x coverage in same RAM (MoEpic) |
| `expert_merging.py` | Offline expert clustering — merge similar experts for 15-30% fewer params |
| **Compression** | |
| `entropy_coding.py` | Huffman coding for uint4 weights — 65% smaller at near-zero quality loss |
| `mixed_precision.py` | Hot experts at 4-bit, cold at 2-bit — 1.8x smaller, barely noticeable |
| `compression.py` | LZ4/ZSTD compression + Apple's native LZFSE |
| **Memory & Hardware** | |
| `memory_manager.py` | Real-time pressure monitoring, wired memory limit, auto-release |
| `hardware.py` | Apple Silicon detection (M1-M5), RAM, GPU cores |
| `tier_optimizer.py` | Finds the perfect RAM/SSD balance for your Mac + model combo |
| `ssd_protection.py` | Thermal cutoff, sequential hints, zero writes |
| **Inference & Serving** | |
| `serve.py` | OpenAI-compatible server with KV cache quantization, memory-aware hints |
| `chat.py` | Colorful chat CLI with web search, memory, model switching |
| `web_search.py` | DuckDuckGo search + persistent memory store (Perplexity-style) |
| `hf_calculator.py` | Model size/memory estimator for any MoE or dense model |
| `task_profiler.py` | Per-task expert profiles (coding/writing/math/chat) for fast warmup |
| **Distributed** | |
| `distributed_experts.py` | Multi-Mac expert parallelism over Thunderbolt 5 RDMA |
| `kv_cache_sharing.py` | PT-MoE KV-cache sharing between blocks (37.5% memory savings) |
| `cached_inference.py` | Expert routing capture + cache simulation |
| `rust_bridge.py` | Python ↔ Rust Unix socket bridge |
| **Rust Sidecar** | |
| `mlx-flash-server/` | axum HTTP/SSE proxy, mach2 memory (0.1ms), DashMap LCP, Unix socket |

### Client Integration

```mermaid
graph LR
subgraph Clients
LS[LM Studio]
CU[Cursor]
CC[Claude Code]
SDK[OpenAI SDK]
CD[continue.dev]
OW[Open WebUI]
end
subgraph Rust["Rust Sidecar :8080"]
AX[axum HTTP/SSE]
MEM[Memory Monitor
mach2 0.1ms]
LCPC[LCP Cache
DashMap lock-free]
end
subgraph Python["Python Worker :8081"]
MLX[MLX Inference
95% of work]
GEN[generate()]
end

Clients -->|OpenAI API| Rust
Rust -->|proxy| Python
Rust -.->|Unix socket| LCPC
LCPC -.->|expert weights| Python
```

### Using It

| How | Command | Best For |
|-----|---------|----------|
| **Interactive chat** | `mlx-flash-chat` | Chat with web search, memory, model switching |
| **API server** | `mlx-flash --port 8080` | LM Studio, Cursor, Claude Code, OpenAI SDK |
| **API + KV quant** | `mlx-flash --port 8080 --kv-bits 8` | 45% less KV memory |
| **Model calculator** | `python -m mlx_flash_compress.hf_calculator` | Estimate size/memory for any model |
| **Model browser** | `mlx-flash-browse` | See what fits your hardware |
| **Warm-up demo** | `python -m mlx_flash_compress.demo_warmup` | Watch cache fill in real-time |
| **Pressure test** | `python -m mlx_flash_compress.bench_memory_pressure` | Measure memory impact |

**Chat commands:** `/models` browse catalog, `/model N` switch live, `/search` web search, `/ask` search+answer, `/remember` save facts, `/memories` list, `/status` memory info

### Integrations

All integrations start with running the server:

```bash
# Install
pip install mlx-flash

# Start the server
mlx-flash --port 8080 --preload
```

LM Studio

1. Start MLX-Flash: `mlx-flash --port 8080 --preload`
2. In LM Studio: **Settings** → **Server** → Add custom endpoint: `http://localhost:8080/v1`
3. Select model: `local`
4. Chat normally — LM Studio treats MLX-Flash as its backend

Cursor

1. Start MLX-Flash: `mlx-flash --port 8080 --preload`
2. In Cursor: **Settings** → **Models** → **Add Model**
- Provider: `OpenAI Compatible`
- API Base: `http://localhost:8080/v1`
- API Key: `not-needed`
- Model: `local`

Claude Code

```bash
# Terminal 1: Start server
mlx-flash --port 8080 --preload

# Terminal 2: Use with Claude Code
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed
```

Or add to `~/.claude/.mcp.json`:
```json
{
"mlx-flash": {
"command": "mlx-flash",
"args": ["--model", "mlx-community/Qwen3-30B-A3B-4bit", "--port", "8080"]
}
}
```

Codex CLI

```bash
# Start server
mlx-flash --port 8080 --preload

# Use with Codex
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed
codex "refactor this function"
```

Ollama (side-by-side)

```bash
# Ollama on default port (11434) for dense models
ollama serve

# MLX-Flash on 8080 for MoE models (better expert caching)
mlx-flash --port 8080 --preload

# Use Ollama for dense models, MLX-Flash for MoE models
```

continue.dev (VS Code / JetBrains)

Add to `~/.continue/config.json`:
```json
{
"models": [{
"title": "Local MoE (MLX)",
"provider": "openai",
"model": "local",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}]
}
```

Open WebUI

```bash
mlx-flash --port 8080 --preload
# In Open WebUI settings: Add connection → http://localhost:8080/v1
```

Python / OpenAI SDK

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
```

Aider (AI pair programming)

```bash
mlx-flash --port 8080 --preload
aider --openai-api-base http://localhost:8080/v1 --openai-api-key not-needed --model local
```

> See [`docs/integrations.md`](docs/integrations.md) for 18+ detailed integration guides with streaming examples, health checks, and memory monitoring.

### Benchmark Suite

```bash
python -m mlx_flash_compress.bench_memory_pressure # Memory pressure analysis (key demo)
python -m mlx_flash_compress.demo_warmup # ISP-like warm-up visualization
python -m mlx_flash_compress.cached_inference --multi-topic # Real routing capture
python -m mlx_flash_compress.bench --synthetic # Quick test (no model needed)
python -m mlx_flash_compress.bench_real # Real Qwen MoE model test
python -m mlx_flash_compress.bench_final # Final comprehensive benchmark
```

## Key Discoveries

### 1. Standard Compression Doesn't Work on AI Weights

We tested 6 different compression strategies on real AI model weights. Result: **1.0x compression** (zero savings). The data is already maximally dense at 4-bit quantization. Instead, we use entropy coding (Huffman) which exploits the non-uniform distribution of quantized values for 65% savings.

### 2. Smart Caching Is the #1 Win

Instead of trying to compress, we **predict what's needed and pre-load it**. Our prediction stack achieves 97%+ accuracy:
- Residual-stream predictor (linear projection of hidden states)
- Cross-layer 3-hop lookahead (transitive co-occurrence)
- Forward-looking Belady-optimal eviction (never evict what you'll need)
- Layer-depth bias (early layers are more valuable to cache)

### 3. The Brain Already Solved This Problem

MoE models work like the brain — only 0.78% of "neurons" (experts) activate per input. The brain handles this with predictive coding (pre-activating expected pathways). We implement the same principle: predict which experts are needed, speculatively execute them, and verify after the router confirms.

### 4. Speculate, Don't Wait

Speculative expert execution (from MoE-SpAc paper) runs predicted experts *before* the router confirms them. With 97% prediction accuracy, this means 97% of expert computations start immediately with zero load latency. The 3% misses are discarded and recomputed — on unified memory, this costs only ~0.1ms per wasted computation.

## Requirements

- **macOS** with Apple Silicon (M1/M2/M3/M4/M5)
- **Python 3.10+**
- 16GB+ RAM (more = better caching = faster)
- For real model tests: `mlx` and `mlx-lm` packages

## Project Stats

- **15,000+ lines of code** (Python + Rust)
- **254 tests** (222 Python + 32 Rust)
- **8 benchmark suites** + interactive demos
- **10 research documents** (15+ papers implemented, 60+ surveyed)
- **40 Python modules** covering prediction, caching, compression, distributed, serving
- **OpenAI-compatible API server** with KV cache quantization
- **Memory-aware** inference with wired memory optimization
- **Rust sidecar** with 0.1ms memory checks (210x faster than Python)
- **Lock-free LCP expert cache** (DashMap) with layer-depth bias
- **Unix socket bridge** for Python ↔ Rust expert weight streaming
- **15+ research techniques** implemented from papers 2024-2026

## Research & Techniques Implemented

```mermaid
graph TB
subgraph DONE["Implemented (15+ techniques)"]
ES[Expert Streaming
GPU lookup tables]
LCP[Layer-biased LCP
FATE paper]
RP[Residual Predictor
97%+ accuracy]
SE[Speculative Execution
MoE-SpAc]
FE[Forward Eviction
MoE-SpeQ Belady]
CL[Cross-Layer Prefetch
3-hop lookahead]
SP[Shadow MLP Predictor
mlx-od-moe]
VS[Vertical Splitting
MoEpic 2x coverage]
EM[Expert Merging
DEK/EEP]
EC[Entropy Coding
EntroLLM Huffman]
AT[Adaptive Top-K
LExI paper]
MP[Mixed Precision
HOBBIT]
KV[KV Cache 8-bit
mlx-moe]
WM[Wired Memory Limit
macOS sysctl]
MC[mx.clear_cache
MLX v0.31]
end
subgraph BLOCKED["Blocked"]
AMX[AMX Pipeline
undocumented HW]
MLXrs[mlx-rs
macOS 26 Metal]
end
```

| Technique | Paper | Status |
|-----------|-------|--------|
| Expert streaming (GPU lookup) | HOBBIT arXiv:2411.01433 | **Implemented** |
| Residual-stream predictor | Speculating Experts arXiv:2603.19289 | **Implemented** |
| Speculative expert execution | MoE-SpAc arXiv:2603.09983 | **Implemented** |
| Forward-looking Belady eviction | MoE-SpeQ arXiv:2511.14102 | **Implemented** |
| Cross-layer 3-hop prefetch | FATE arXiv:2502.12224 / tinyserve | **Implemented** |
| Layer-depth cache bias | FATE arXiv:2502.12224 | **Implemented** |
| Shadow model predictor | mlx-od-moe | **Implemented** |
| Vertical expert splitting | MoEpic paper | **Implemented** |
| Expert merging (offline) | DEK/EEP arXiv:2509.19781 | **Implemented** |
| Entropy coding (Huffman uint4) | EntroLLM arXiv:2505.02380 | **Implemented** |
| Adaptive top-k skipping | LExI arXiv:2509.02753 | **Implemented** |
| Mixed precision per-expert | HOBBIT arXiv:2411.01433 | **Implemented** |
| KV cache 8-bit quantization | mlx-moe / mlx-lm v0.31 | **Implemented** |
| Wired memory optimization | macOS sysctl / mlx-moe | **Implemented** |
| `mx.clear_cache()` integration | MLX v0.31.0 | **Implemented** |
| AMX dequant pipeline | amx-rs Rust crate | Blocked (undocumented HW) |
| mlx-rs native inference | mlx-rs v0.25.3 | Blocked (macOS 26 Metal) |

### Competition

10+ OSS projects and 15+ papers attack the same problem. Our unique differentiators:
1. **Only** project with Rust sidecar + Mach syscall memory monitoring
2. **Only** Apple Silicon project with mixed precision per-expert (hot 4-bit / cold 2-bit)
3. **Most techniques implemented**: 15+ from research frontier, more than any competitor
4. **Only** project combining speculative execution + Belady eviction + residual predictor + expert merging

| Competitor | Key Feature | Our Advantage |
|-----------|------------|---------------|
| mu-hashmi/mlx-moe | Expert profiles, 10+ model families | Speculative execution, residual predictor, Rust sidecar |
| kqb/mlx-od-moe | Shadow model, memory-mapped experts | Cross-layer prefetch, entropy coding, expert merging |
| jundot/omlx | Hybrid mxfp4/mxfp8 quantization | Belady eviction, adaptive top-k, vertical splitting |
| HOBBIT (paper) | Nearly identical architecture | Apple Silicon native, open source |

See `docs/competitive-analysis.md` for the full landscape.

## License

MIT