https://github.com/jjang-ai/mlxstudio

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)
https://github.com/jjang-ai/mlxstudio

ai ai-agents anthropic anthropic-api apple-silicon inference inference-engine llm lmstudio macbook macstudio mlx mlxllm mlxstudio omlx omlx-alternative openai-api

Last synced: about 2 months ago
JSON representation

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)

Host: GitHub
URL: https://github.com/jjang-ai/mlxstudio
Owner: jjang-ai
Created: 2026-03-05T17:17:53.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-05-15T20:08:44.000Z (about 2 months ago)
Last Synced: 2026-05-15T23:31:37.070Z (about 2 months ago)
Topics: ai, ai-agents, anthropic, anthropic-api, apple-silicon, inference, inference-engine, llm, lmstudio, macbook, macstudio, mlx, mlxllm, mlxstudio, omlx, omlx-alternative, openai-api
Homepage: https://mlx.studio
Size: 2.07 MB
Stars: 720
Watchers: 7
Forks: 47
Open Issues: 9
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-mlx - mlxstudio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API) (LLM & Inference)

README

The native macOS desktop app for local AI on Apple Silicon

vMLX v2 — native Swift + Metal, 50–95 t/s on M-series.

Zero PyTorch in the hot path. Pure SwiftUI. Drag and drop models.

The Python panel above remains available for legacy support.

Features •
Screenshots •
API Server •
Image Generation •
JANG Quantization •
Requirements •
Build •
한국어

---

MLX Studio is a complete desktop app for running LLMs, VLMs, and image generation models locally on your Mac. No cloud, no API keys, no data leaving your machine. Supports every model on [mlx-community](https://huggingface.co/mlx-community) -- Qwen, Llama, Mistral, Gemma, Phi, DeepSeek, and thousands more. Built on [vMLX Engine](https://github.com/jjang-ai/vmlx) and Apple's [MLX](https://github.com/ml-explore/mlx) framework.

> **JANG 2-bit destroys MLX 4-bit on [MiniMax M2.5](https://huggingface.co/JANGQ-AI/MiniMax-M2.5-JANG_2L):**
>
> | Quantization | MMLU (200q) | Size |
> |---|---|---|
> | **JANG\_2L (2-bit)** | **74%** | **89 GB** |
> | MLX 4-bit | 26.5% | 120 GB |
> | MLX 3-bit | 24.5% | 93 GB |
> | MLX 2-bit | 25% | 68 GB |
>
> Adaptive mixed-precision quantization keeps critical layers at higher precision while compressing the rest. Check scores at [jangq.ai](https://jangq.ai). Models at [JANGQ-AI](https://huggingface.co/JANGQ-AI).

---

## Install

### Option 1: Download the App (Recommended)

> **[Download the latest DMG](https://github.com/jjang-ai/mlxstudio/releases/latest)** -- one file, ready to go.

1. Download `vMLX-X.Y.Z-arm64.dmg`
2. Open the DMG and drag to Applications
3. Launch -- that's it

All releases are code-signed and notarized by Apple for macOS Gatekeeper. No Homebrew, no pip, no Xcode required.

### Option 2: CLI via pip (Engine Only)

The vMLX inference engine is published on [PyPI as `vmlx`](https://pypi.org/project/vmlx/) -- same engine that powers the desktop app, available as a standalone CLI. This is real, published software with 1,894+ tests.

```bash
# Recommended: use uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
```

> **Note:** On macOS 14+, `pip install vmlx` without a venv will fail with "externally-managed-environment". Use `uv`, `pipx`, or create a venv first.

Once running, your local OpenAI-compatible API server is live at `http://localhost:8000`. Point any OpenAI or Anthropic SDK client at it.

---

## Quick Start

1. **Launch** MLX Studio from Applications
2. **Pick a model** -- browse HuggingFace models in the Server tab, or enter a repo name (e.g., `mlx-community/Qwen3-8B-4bit`)
3. **Start the session** -- the model downloads automatically and the server starts
4. **Chat** -- switch to the Chat tab and start talking

That's it. The app manages the entire Python engine, model downloads, and server lifecycle for you.

---

## Screenshots

Chat Interface
Streaming conversations with thinking mode, code highlighting, and markdown

Agentic Coding
Full tool calling with file I/O, shell execution, and web search

Image Generation & Editing
Flux Schnell, Dev, Z-Image Turbo, Klein + Qwen Image Edit

Anthropic API Compatible
Drop-in /v1/messages endpoint for Anthropic SDK clients

Developer Tools
Convert, inspect, and diagnose models

Model Conversion
GGUF to MLX, 16-bit to quantized, and JANG adaptive mixed-precision

HuggingFace Browser
Search and download models directly in-app

Menu Bar
Running models, GPU memory, and quick controls

---

## Features

### Model Support (65+ Model Families)

Run any MLX model from HuggingFace -- thousands of models, zero configuration:

- **Text LLMs** -- Qwen 2/2.5/3/3.5/3.6, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, **Mistral-Medium-3.5** (ministral3, dense GQA + 256K YaRN + PIXTRAL vision), Mistral-Small-4 (MLA), Gemma 2/3/4, Phi-3/4, DeepSeek V2/V3/V4 (MLA), GLM-4/4.7/5, Nemotron, **Laguna** (poolside, 33B/3B SWA MoE), MiniMax M2.5/M2.7, Kimi K2.5/K2.6, Step, XVERSE, Yi, InternLM, ChatGLM, CodeLlama, and any mlx-lm compatible model
- **Vision LLMs (VL)** -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL / Qwen3.6-VL, Pixtral, InternVL, LLaVA, Gemma 3n / 4-VL, Phi-3-Vision, Mistral-Medium-3.5 (PIXTRAL) -- send images and video directly in chat
- **Multimodal Omni** -- **Nemotron-3-Nano-Omni** (text + image + audio + video) with Parakeet audio encoder + RADIO ViT vision tower; routed via OmniMultimodalDispatcher across `/v1/chat/completions`, `/v1/messages`, `/v1/responses`, and `/api/chat`
- **Mixture-of-Experts** -- Qwen 3.5/3.6 MoE, Mixtral 8x7B/8x22B, DeepSeek V2/V3/V4, MiniMax M2.5/M2.7, Llama 4 Scout/Maverick, Laguna (256 routed experts top-8 + 1 shared)
- **Hybrid SSM Models** -- Nemotron-H, Nemotron-3-Nano-Omni, Jamba, GatedDeltaNet, Qwen3.5-A3B hybrid, Granite MoE Hybrid, LFM2 (Mamba + Attention with dedicated hybrid cache + SSM companion + capture-during-prefill)
- **Image Generation** -- Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein 4B/9B (via mflux)
- **Image Editing** -- Qwen Image Edit (instruction-based editing, full precision)
- **Audio** -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
- **JANG Models** -- Adaptive mixed-precision quantized models from [JANGQ-AI](https://huggingface.co/JANGQ-AI), stay quantized in GPU memory via native `QuantizedLinear`
- **GGUF Import** -- Convert GGUF models to MLX format directly in-app

### OpenAI-Compatible API Server

Every session launches a full API server. Point any OpenAI SDK client at your local endpoint:

- `POST /v1/chat/completions` -- Chat Completions API with streaming, tool calling, vision, structured output
- `POST /v1/responses` -- OpenAI Responses API (agentic format) with streaming
- `POST /v1/completions` -- Text completions
- `POST /v1/images/generations` -- Image generation (Flux/Z-Image models, OpenAI format with `usage` field)
- `POST /v1/images/edits` -- Image editing (Qwen Image Edit, instruction-based)
- `POST /v1/embeddings` -- Text embeddings with dimension control and batch processing
- `POST /v1/rerank` -- Document reranking
- `POST /v1/audio/speech` -- Text-to-speech (Kokoro TTS)
- `POST /v1/audio/transcriptions` -- Speech-to-text (Whisper)
- `GET /v1/models` -- List loaded models
- `GET /health` -- Server health with VRAM usage, queue length, load times

### Anthropic API Compatibility

Drop-in replacement for the Anthropic Claude API:

- `POST /v1/messages` -- Anthropic Messages API format
- Anthropic SDK tool calling format (auto-translated to internal format)
- Vision/multimodal support via Anthropic content blocks
- Use the Anthropic Python/TypeScript SDK -- just change the `base_url` to your local server
- Copy-paste code snippets in the API tab for curl, Python, and JavaScript

### Tool Calling & Agentic Workflows (14 Parsers)

Auto-detected tool call parsers for every major model family:

- **Qwen** (qwen3, qwen2.5) -- `` XML format
- **Llama 3** -- `` format
- **Mistral** -- `[TOOL_CALLS]` format
- **Hermes** -- `` JSON format
- **DeepSeek** -- function call blocks
- **GLM-4.7** -- GLM tool format
- **MiniMax** -- MiniMax function calling
- **Nemotron** -- NVIDIA Nemotron tool format
- **Granite** -- IBM Granite format
- **Functionary** -- Functionary v3 format
- **XLAM** -- Salesforce xLAM format
- **Kimi** -- Moonshot Kimi format
- **Step-3.5** -- StepFun format
- Auto-detection from `model_type` in config.json with regex name fallback

**26+ Built-in Tools:**
- **File I/O** -- read, write, edit, patch, copy, move, delete, create directory, list directory, file info, insert text, replace lines, directory tree
- **Search** -- ripgrep file search with regex and glob, glob file finder, unified diff
- **Execution** -- shell commands (60s timeout), background processes (5m auto-kill), process output polling
- **Web** -- DuckDuckGo search, Brave Search API, URL fetch with HTML-to-text
- **Developer** -- token counter, regex find-replace across files, git operations, clipboard read/write, diagnostics (TypeScript/ESLint/Python linting)
- **Interactive** -- `ask_user` tool for human-in-the-loop interrupts
- Per-category toggles: enable/disable file, search, shell, web tools independently
- Auto-continue agent loops (up to 10 tool iterations per request)
- **MCP (Model Context Protocol)** -- connect external tool servers, merge tool definitions, execute MCP tools via API

### Reasoning Model Support (4 Parsers)

Collapsible thinking blocks with dedicated parsing for reasoning models:

- **Qwen3 / Qwen3.5** -- `...` blocks
- **DeepSeek-R1** -- DeepSeek reasoning format
- **OpenAI GPT-OSS / GLM-4.7** -- GPT-OSS thinking format
- **Phi-4-reasoning** -- reasoning content extraction
- Enable/disable thinking per request
- Reasoning effort control (low/medium/high)
- Streaming reasoning content with proper tokenization

### Vision & Multimodal (VLM)

Full multimodal input support for vision-language models:

- **Images** -- PNG, JPEG, WebP via base64 or URL (up to 50 MB)
- **Video** -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames), configurable FPS
- **Audio** -- Base64 or URL audio input (Qwen3-Audio)
- Image detail levels: auto, low, high
- Dedicated MLLM cache for image/video embeddings (separate from KV cache)
- Send images directly in chat to any VL model

### Continuous Batching & Concurrency

Production-grade multi-user serving:

- **Continuous batching** -- handle 32+ concurrent requests with dynamic slot allocation
- **Prefill batching** -- batch prompt processing with configurable batch size (prevents Metal GPU timeouts)
- **Completion batching** -- batch token generation across sequences
- **Stream interval control** -- configure streaming frequency
- **Request pooling** -- efficiently share GPU memory across concurrent sequences
- **Rate limiting** -- optional per-client request limits
- **API key authentication** -- optional `--api-key` flag for secured access

### 5-Layer Cache Stack

Multi-tier caching for maximum throughput and memory efficiency:

- **L1: Memory-Aware Prefix Cache** -- token-level semantic caching with LRU eviction, configurable memory allocation
- **L1 alt: Paged KV Cache** -- block-aware cache with reduced fragmentation for long contexts
- **L2: Disk Cache** -- persistent spillover to disk for large context windows
- **L2 alt: Block Disk Store** -- block-level disk persistence
- **KV Quantization** -- q4/q8 quantized KV cache at storage boundary (2-4x memory savings, no accuracy loss)
- **Hybrid SSM Cache** -- dedicated cache for Mamba + Attention architectures (Nemotron-H, Jamba, GatedDeltaNet)
- Automatic cache type selection based on model architecture
- Cache warming API (`POST /v1/cache/warm`) for pre-loading common prompts
- Cache stats API (`GET /v1/cache/stats`) for monitoring hit rates and memory usage

### Sampling & Generation Control

Full control over text generation:

- **Temperature** (0.0 - 2.0) -- creativity control
- **Top-P** (0.0 - 1.0) -- nucleus sampling
- **Top-K** (integer) -- top-K token filtering
- **Min-P** (0.0 - 1.0) -- minimum probability threshold
- **Repetition Penalty** -- penalize repeated tokens
- **Stop Sequences** -- custom stopping strings
- **Max Tokens** -- output length limit (up to 131072)
- **Request Timeout** -- per-request timeout override
- **Structured Output** -- `response_format` with `json_object` or `json_schema` modes for guaranteed valid JSON
- **Streaming** with proper Unicode handling (emoji, CJK, Arabic multi-byte characters)
- **Usage stats** in streaming responses (`stream_options.include_usage`)

### Model Conversion & Quantization

Convert models directly in-app via the Tools tab:

- **16-bit to MLX** -- convert HuggingFace safetensors to MLX format
- **16-bit to quantized** -- quantize to 2-bit, 4-bit, or 8-bit MLX
- **GGUF to MLX** -- import GGUF models into MLX safetensors format
- **MLX to JANG** -- adaptive mixed-precision quantization (different bits per layer type)
- **Model Inspector** -- view config.json, architecture, layer structure
- **Model Doctor** -- diagnostic checks (load test, token count, memory estimation)
- Progress tracking with real-time status

### Image Generation

Generate images locally with Flux and Z-Image models:

- **Flux Schnell** -- 4-step fast generation
- **Flux Dev** -- 20-step high-quality generation
- **Z-Image Turbo** -- fast turbo generation (4-bit and 8-bit)
- **Flux Klein** -- lightweight 4B parameter model
- **Flux Kontext** -- subject-consistent editing
- **Flux Krea** -- aesthetic fine-tuned generation
- Configurable steps, guidance scale, height, width, seed, sampler
- Multiple samplers: euler, euler_ancestral, heun, dpmpp_2m_sde, dpmpp_sde
- Quantized model support (2-bit to 8-bit)
- Image gallery with generation history, save, and settings persistence
- OpenAI-compatible `/v1/images/generations` endpoint with `usage` field

### Chat Interface

Full-featured conversation UI:

- **Persistent history** -- SQLite (WAL mode) with full message, metrics, and tool call history
- **Markdown rendering** -- GitHub-flavored markdown with syntax highlighting
- **Reasoning display** -- collapsible thinking sections for reasoning models
- **Tool call display** -- inline tool execution with status and results
- **Streaming metrics** -- live tokens/second, time-to-first-token (TTFT), prompt processing speed, prefix cache hit rate
- **System prompts** -- per-chat custom system message
- **Chat settings** -- per-chat overrides for temperature, top-p, top-k, min-p, repetition penalty, max tokens, stop sequences
- **Chat folders** -- hierarchical organization
- **Message search** -- full-text search across chat history
- **Export/Import** -- ShareGPT format
- **Voice chat** -- STT + TTS integration

### Model Management

- **HuggingFace browser** -- search, filter by text/image, and download models directly in-app
- **Download queue** -- multiple concurrent downloads with real-time progress bars and cancel support
- **Model size display** -- file sizes from safetensors metadata before downloading
- **Local model discovery** -- auto-scan `~/.mlxstudio/models`, `~/.cache/huggingface/hub`, `~/.exo/models`, and custom directories
- **Deduplication** -- strict format detection prevents false positive model matches
- **Zero-config detection** -- reads model config.json to auto-set tool parsers, reasoning parsers, cache types, and chat templates
- **65+ model families** in the auto-detection registry with two-tier detection (config.json `model_type` primary, name regex fallback)

### Desktop Experience

- **5 app modes** -- Chat, Server, Image, Tools, API
- **Menu bar tray** -- live server status, GPU memory, running models, quick controls
- **Multi-session** -- run multiple models simultaneously on different ports
- **Dock icon** -- restore on click, close-to-tray support
- **Dark and light themes** -- system-respecting
- **Keyboard shortcuts** -- common actions
- **Toast notifications** -- user feedback
- **Update banner** -- new version detection

---

## Advanced Quantization

MLX Studio supports standard MLX quantization (4-bit, 8-bit) as well as **JANG adaptive mixed-precision** -- an advanced format that assigns different bit widths to different layer types for better quality at the same model size.

- Convert in-app via the Tools tab, or via CLI: `vmlx convert model --jang-profile JANG_3M`
- Pre-quantized models available at [JANGQ-AI on HuggingFace](https://huggingface.co/JANGQ-AI)
- Stays quantized in GPU memory -- native MLX `QuantizedLinear` + `quantized_matmul`
- Compatible with all caching layers (prefix, paged, disk, KV quant)

See the [vMLX source repo](https://github.com/jjang-ai/vmlx#advanced-quantization) for profiles and conversion details.

### Smelt Mode (Partial Expert Loading)

For MoE models that don't fit in RAM, **Smelt** loads only a subset of experts per layer from SSD and keeps the backbone resident. Response quality stays coherent while RAM usage drops; throughput scales inversely with expert % loaded because expert swaps hit SSD on the hot path.

**Benchmarks on `Nemotron-Cascade-2-30B-A3B-JANG_4M`** (23 MoE layers × 128 experts, Apple M3 Ultra / 128 GB, dedicated machine, no parallel models):

| `--smelt-experts` | Active RAM | Decode tok/s | RAM saving | Coherent |
|---|---:|---:|---|---|
| _off (baseline)_ | **17,408 MB** | **89.9** | — | ✓ |
| `50` | 9,529 MB | **66.5** | **−45%** | ✓ |
| `25` | 5,590 MB | * | **−68%** | ✓ |

\* Responses too short for reliable steady-state tok/s measurement at 25 %. Subjectively responsive.

All three configurations produced coherent, non-looping output. No quality degradation observed.

> **Credit**: Smelt mode is inspired by [Anemll's **flash-moe**](https://github.com/Anemll/flash-moe) — a pure C / Objective‑C / Metal inference engine that showed huge MoE models (Qwen3.5-397B) can run on modest Apple Silicon hardware by streaming expert weights from SSD with `pread()` on demand. vMLX Smelt takes a different implementation path: Python/MLX, tied to the JANG quantization format, and loading a fixed subset of experts per layer at startup (backbone resident, routing biased toward the loaded subset) rather than on-demand per-token. It plugs into the full vMLX server with continuous batching, paged cache, and OpenAI-compatible API. Different engine, same core insight — thanks to the flash-moe team for validating the approach.

**Smelt is mutually exclusive with VLM mode.** MLX Studio / vMLX v1.3.33+ automatically disables `--is-mllm` when smelt is active (with a warning) because the vision tower is not wired through the partial-expert loader — image input on a smelt-loaded VLM would produce garbage logits. Use a text-only model when running smelt, or disable smelt when running a VLM.

Requires an MoE model in JANG format. Not compatible with dense models (no experts to partial-load).

---

## System Requirements

| Requirement | Minimum |
|---|---|
| **macOS** | 14.0 Sonoma or later |
| **Chip** | Apple Silicon (M1 / M2 / M3 / M4) |
| **RAM** | 8 GB (16 GB+ recommended for larger models) |
| **Disk** | ~500 MB for app; models range from 1-50 GB each |

---

## Build from Source

```bash
git clone https://github.com/jjang-ai/vmlx.git
cd vmlx

# Python engine
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Electron app
cd panel && npm install && npm run build
npx electron-builder --mac --dir # .app bundle
npx electron-builder --mac dmg # DMG installer
```

---

## Links

| Resource | Link |
|---|---|
| **Source Code** | [github.com/jjang-ai/vmlx](https://github.com/jjang-ai/vmlx) |
| **PyPI** | [pypi.org/project/vmlx](https://pypi.org/project/vmlx/) |
| **MLX Models** | [huggingface.co/mlx-community](https://huggingface.co/mlx-community) |
| **JANG Models** | [huggingface.co/JANGQ-AI](https://huggingface.co/JANGQ-AI) |
| **Website** | [vmlx.net](https://vmlx.net) |

---

## License

Apache License 2.0

---

Built by Jinho Jang • eric@jangq.ai • JANGQ AI • Support on Ko-fi

---

## 한국어 (Korean)

### MLX Studio — Apple Silicon을 위한 네이티브 macOS AI 앱

Mac에서 LLM, VLM, 이미지 생성 및 편집 모델을 완전히 로컬로 실행하세요.

> **JANG 2비트가 MLX 4/3/2비트보다 높은 성능** — 적응형 혼합 정밀도 양자화(JANG\_2S, JANG\_2.6)가 MiniMax M2.5, Qwen3 등에서 표준 MLX 양자화를 능가합니다. [jangq.ai](https://jangq.ai)에서 벤치마크 확인. [JANGQ-AI](https://huggingface.co/JANGQ-AI)에서 사전 양자화 모델 다운로드.

**설치:** [최신 DMG 다운로드](https://github.com/jjang-ai/mlxstudio/releases/latest) — 드래그 앤 드롭으로 설치.

### 주요 기능

| 기능 | 설명 |
|------|------|
| **채팅** | 대화 인터페이스, 도구 호출, 에이전트 코딩 |
| **이미지 생성** | Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein |
| **이미지 편집** | Qwen Image Edit (텍스트 지시 기반 편집) |
| **5단계 캐싱** | 프리픽스, 페이지드, KV 양자화, 디스크 캐시 |
| **API 서버** | OpenAI + Anthropic 호환 API |
| **30개 도구** | 파일, 웹 검색, Git, 터미널 내장 도구 |

개발자: 장진호 (eric@jangq.ai)

JANGQ AI •
Ko-fi로 후원하기

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jjang-ai/mlxstudio

Awesome Lists containing this project

README

The native macOS desktop app for local AI on Apple Silicon