An open API service indexing awesome lists of open source software.

https://github.com/jjang-ai/mlxstudio

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)
https://github.com/jjang-ai/mlxstudio

ai ai-agents anthropic anthropic-api apple-silicon inference inference-engine llm lmstudio macbook macstudio mlx mlxllm mlxstudio omlx omlx-alternative openai-api

Last synced: 10 days ago
JSON representation

MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)

Awesome Lists containing this project

README

          





MLX Studio

The native macOS desktop app for local AI on Apple Silicon


Latest Release
Downloads
Platform
PyPI
License
Ko-fi



Get vMLX v2 Swift

  

Legacy Python DMG


vMLX v2 — native Swift + Metal, 50–95 t/s on M-series.

Zero PyTorch in the hot path. Pure SwiftUI. Drag and drop models.

The Python panel above remains available for legacy support.


Features
Screenshots
API Server
Image Generation
JANG Quantization
Requirements
Build
한국어

---

MLX Studio is a complete desktop app for running LLMs, VLMs, and image generation models locally on your Mac. No cloud, no API keys, no data leaving your machine. Supports every model on [mlx-community](https://huggingface.co/mlx-community) -- Qwen, Llama, Mistral, Gemma, Phi, DeepSeek, and thousands more. Built on [vMLX Engine](https://github.com/jjang-ai/vmlx) and Apple's [MLX](https://github.com/ml-explore/mlx) framework.

> **JANG 2-bit destroys MLX 4-bit on [MiniMax M2.5](https://huggingface.co/JANGQ-AI/MiniMax-M2.5-JANG_2L):**
>
> | Quantization | MMLU (200q) | Size |
> |---|---|---|
> | **JANG\_2L (2-bit)** | **74%** | **89 GB** |
> | MLX 4-bit | 26.5% | 120 GB |
> | MLX 3-bit | 24.5% | 93 GB |
> | MLX 2-bit | 25% | 68 GB |
>
> Adaptive mixed-precision quantization keeps critical layers at higher precision while compressing the rest. Check scores at [jangq.ai](https://jangq.ai). Models at [JANGQ-AI](https://huggingface.co/JANGQ-AI).

---

## Install

### Option 1: Download the App (Recommended)

> **[Download the latest DMG](https://github.com/jjang-ai/mlxstudio/releases/latest)** -- one file, ready to go.

1. Download `vMLX-X.Y.Z-arm64.dmg`
2. Open the DMG and drag to Applications
3. Launch -- that's it

All releases are code-signed and notarized by Apple for macOS Gatekeeper. No Homebrew, no pip, no Xcode required.

### Option 2: CLI via pip (Engine Only)

The vMLX inference engine is published on [PyPI as `vmlx`](https://pypi.org/project/vmlx/) -- same engine that powers the desktop app, available as a standalone CLI. This is real, published software with 1,894+ tests.

```bash
# Recommended: use uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or with pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit
```

> **Note:** On macOS 14+, `pip install vmlx` without a venv will fail with "externally-managed-environment". Use `uv`, `pipx`, or create a venv first.

Once running, your local OpenAI-compatible API server is live at `http://localhost:8000`. Point any OpenAI or Anthropic SDK client at it.

---

## Quick Start

1. **Launch** MLX Studio from Applications
2. **Pick a model** -- browse HuggingFace models in the Server tab, or enter a repo name (e.g., `mlx-community/Qwen3-8B-4bit`)
3. **Start the session** -- the model downloads automatically and the server starts
4. **Chat** -- switch to the Chat tab and start talking

That's it. The app manages the entire Python engine, model downloads, and server lifecycle for you.

---

## Screenshots



Chat Interface
Streaming conversations with thinking mode, code highlighting, and markdown

Agentic Coding
Full tool calling with file I/O, shell execution, and web search



Image Generation & Editing
Flux Schnell, Dev, Z-Image Turbo, Klein + Qwen Image Edit

Anthropic API Compatible
Drop-in /v1/messages endpoint for Anthropic SDK clients



Developer Tools
Convert, inspect, and diagnose models

Model Conversion
GGUF to MLX, 16-bit to quantized, and JANG adaptive mixed-precision



HuggingFace Browser
Search and download models directly in-app

Menu Bar
Running models, GPU memory, and quick controls

---

## Features

### Model Support (65+ Model Families)

Run any MLX model from HuggingFace -- thousands of models, zero configuration:

- **Text LLMs** -- Qwen 2/2.5/3/3.5/3.6, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, **Mistral-Medium-3.5** (ministral3, dense GQA + 256K YaRN + PIXTRAL vision), Mistral-Small-4 (MLA), Gemma 2/3/4, Phi-3/4, DeepSeek V2/V3/V4 (MLA), GLM-4/4.7/5, Nemotron, **Laguna** (poolside, 33B/3B SWA MoE), MiniMax M2.5/M2.7, Kimi K2.5/K2.6, Step, XVERSE, Yi, InternLM, ChatGLM, CodeLlama, and any mlx-lm compatible model
- **Vision LLMs (VL)** -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL / Qwen3.6-VL, Pixtral, InternVL, LLaVA, Gemma 3n / 4-VL, Phi-3-Vision, Mistral-Medium-3.5 (PIXTRAL) -- send images and video directly in chat
- **Multimodal Omni** -- **Nemotron-3-Nano-Omni** (text + image + audio + video) with Parakeet audio encoder + RADIO ViT vision tower; routed via OmniMultimodalDispatcher across `/v1/chat/completions`, `/v1/messages`, `/v1/responses`, and `/api/chat`
- **Mixture-of-Experts** -- Qwen 3.5/3.6 MoE, Mixtral 8x7B/8x22B, DeepSeek V2/V3/V4, MiniMax M2.5/M2.7, Llama 4 Scout/Maverick, Laguna (256 routed experts top-8 + 1 shared)
- **Hybrid SSM Models** -- Nemotron-H, Nemotron-3-Nano-Omni, Jamba, GatedDeltaNet, Qwen3.5-A3B hybrid, Granite MoE Hybrid, LFM2 (Mamba + Attention with dedicated hybrid cache + SSM companion + capture-during-prefill)
- **Image Generation** -- Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein 4B/9B (via mflux)
- **Image Editing** -- Qwen Image Edit (instruction-based editing, full precision)
- **Audio** -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)
- **JANG Models** -- Adaptive mixed-precision quantized models from [JANGQ-AI](https://huggingface.co/JANGQ-AI), stay quantized in GPU memory via native `QuantizedLinear`
- **GGUF Import** -- Convert GGUF models to MLX format directly in-app

### OpenAI-Compatible API Server

Every session launches a full API server. Point any OpenAI SDK client at your local endpoint:

- `POST /v1/chat/completions` -- Chat Completions API with streaming, tool calling, vision, structured output
- `POST /v1/responses` -- OpenAI Responses API (agentic format) with streaming
- `POST /v1/completions` -- Text completions
- `POST /v1/images/generations` -- Image generation (Flux/Z-Image models, OpenAI format with `usage` field)
- `POST /v1/images/edits` -- Image editing (Qwen Image Edit, instruction-based)
- `POST /v1/embeddings` -- Text embeddings with dimension control and batch processing
- `POST /v1/rerank` -- Document reranking
- `POST /v1/audio/speech` -- Text-to-speech (Kokoro TTS)
- `POST /v1/audio/transcriptions` -- Speech-to-text (Whisper)
- `GET /v1/models` -- List loaded models
- `GET /health` -- Server health with VRAM usage, queue length, load times

### Anthropic API Compatibility

Drop-in replacement for the Anthropic Claude API:

- `POST /v1/messages` -- Anthropic Messages API format
- Anthropic SDK tool calling format (auto-translated to internal format)
- Vision/multimodal support via Anthropic content blocks
- Use the Anthropic Python/TypeScript SDK -- just change the `base_url` to your local server
- Copy-paste code snippets in the API tab for curl, Python, and JavaScript

### Tool Calling & Agentic Workflows (14 Parsers)

Auto-detected tool call parsers for every major model family:

- **Qwen** (qwen3, qwen2.5) -- `` XML format
- **Llama 3** -- `` format
- **Mistral** -- `[TOOL_CALLS]` format
- **Hermes** -- `` JSON format
- **DeepSeek** -- function call blocks
- **GLM-4.7** -- GLM tool format
- **MiniMax** -- MiniMax function calling
- **Nemotron** -- NVIDIA Nemotron tool format
- **Granite** -- IBM Granite format
- **Functionary** -- Functionary v3 format
- **XLAM** -- Salesforce xLAM format
- **Kimi** -- Moonshot Kimi format
- **Step-3.5** -- StepFun format
- Auto-detection from `model_type` in config.json with regex name fallback

**26+ Built-in Tools:**
- **File I/O** -- read, write, edit, patch, copy, move, delete, create directory, list directory, file info, insert text, replace lines, directory tree
- **Search** -- ripgrep file search with regex and glob, glob file finder, unified diff
- **Execution** -- shell commands (60s timeout), background processes (5m auto-kill), process output polling
- **Web** -- DuckDuckGo search, Brave Search API, URL fetch with HTML-to-text
- **Developer** -- token counter, regex find-replace across files, git operations, clipboard read/write, diagnostics (TypeScript/ESLint/Python linting)
- **Interactive** -- `ask_user` tool for human-in-the-loop interrupts
- Per-category toggles: enable/disable file, search, shell, web tools independently
- Auto-continue agent loops (up to 10 tool iterations per request)
- **MCP (Model Context Protocol)** -- connect external tool servers, merge tool definitions, execute MCP tools via API

### Reasoning Model Support (4 Parsers)

Collapsible thinking blocks with dedicated parsing for reasoning models:

- **Qwen3 / Qwen3.5** -- `...` blocks
- **DeepSeek-R1** -- DeepSeek reasoning format
- **OpenAI GPT-OSS / GLM-4.7** -- GPT-OSS thinking format
- **Phi-4-reasoning** -- reasoning content extraction
- Enable/disable thinking per request
- Reasoning effort control (low/medium/high)
- Streaming reasoning content with proper tokenization

### Vision & Multimodal (VLM)

Full multimodal input support for vision-language models:

- **Images** -- PNG, JPEG, WebP via base64 or URL (up to 50 MB)
- **Video** -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames), configurable FPS
- **Audio** -- Base64 or URL audio input (Qwen3-Audio)
- Image detail levels: auto, low, high
- Dedicated MLLM cache for image/video embeddings (separate from KV cache)
- Send images directly in chat to any VL model

### Continuous Batching & Concurrency

Production-grade multi-user serving:

- **Continuous batching** -- handle 32+ concurrent requests with dynamic slot allocation
- **Prefill batching** -- batch prompt processing with configurable batch size (prevents Metal GPU timeouts)
- **Completion batching** -- batch token generation across sequences
- **Stream interval control** -- configure streaming frequency
- **Request pooling** -- efficiently share GPU memory across concurrent sequences
- **Rate limiting** -- optional per-client request limits
- **API key authentication** -- optional `--api-key` flag for secured access

### 5-Layer Cache Stack

Multi-tier caching for maximum throughput and memory efficiency:

- **L1: Memory-Aware Prefix Cache** -- token-level semantic caching with LRU eviction, configurable memory allocation
- **L1 alt: Paged KV Cache** -- block-aware cache with reduced fragmentation for long contexts
- **L2: Disk Cache** -- persistent spillover to disk for large context windows
- **L2 alt: Block Disk Store** -- block-level disk persistence
- **KV Quantization** -- q4/q8 quantized KV cache at storage boundary (2-4x memory savings, no accuracy loss)
- **Hybrid SSM Cache** -- dedicated cache for Mamba + Attention architectures (Nemotron-H, Jamba, GatedDeltaNet)
- Automatic cache type selection based on model architecture
- Cache warming API (`POST /v1/cache/warm`) for pre-loading common prompts
- Cache stats API (`GET /v1/cache/stats`) for monitoring hit rates and memory usage

### Sampling & Generation Control

Full control over text generation:

- **Temperature** (0.0 - 2.0) -- creativity control
- **Top-P** (0.0 - 1.0) -- nucleus sampling
- **Top-K** (integer) -- top-K token filtering
- **Min-P** (0.0 - 1.0) -- minimum probability threshold
- **Repetition Penalty** -- penalize repeated tokens
- **Stop Sequences** -- custom stopping strings
- **Max Tokens** -- output length limit (up to 131072)
- **Request Timeout** -- per-request timeout override
- **Structured Output** -- `response_format` with `json_object` or `json_schema` modes for guaranteed valid JSON
- **Streaming** with proper Unicode handling (emoji, CJK, Arabic multi-byte characters)
- **Usage stats** in streaming responses (`stream_options.include_usage`)

### Model Conversion & Quantization

Convert models directly in-app via the Tools tab:

- **16-bit to MLX** -- convert HuggingFace safetensors to MLX format
- **16-bit to quantized** -- quantize to 2-bit, 4-bit, or 8-bit MLX
- **GGUF to MLX** -- import GGUF models into MLX safetensors format
- **MLX to JANG** -- adaptive mixed-precision quantization (different bits per layer type)
- **Model Inspector** -- view config.json, architecture, layer structure
- **Model Doctor** -- diagnostic checks (load test, token count, memory estimation)
- Progress tracking with real-time status

### Image Generation

Generate images locally with Flux and Z-Image models:

- **Flux Schnell** -- 4-step fast generation
- **Flux Dev** -- 20-step high-quality generation
- **Z-Image Turbo** -- fast turbo generation (4-bit and 8-bit)
- **Flux Klein** -- lightweight 4B parameter model
- **Flux Kontext** -- subject-consistent editing
- **Flux Krea** -- aesthetic fine-tuned generation
- Configurable steps, guidance scale, height, width, seed, sampler
- Multiple samplers: euler, euler_ancestral, heun, dpmpp_2m_sde, dpmpp_sde
- Quantized model support (2-bit to 8-bit)
- Image gallery with generation history, save, and settings persistence
- OpenAI-compatible `/v1/images/generations` endpoint with `usage` field

### Chat Interface

Full-featured conversation UI:

- **Persistent history** -- SQLite (WAL mode) with full message, metrics, and tool call history
- **Markdown rendering** -- GitHub-flavored markdown with syntax highlighting
- **Reasoning display** -- collapsible thinking sections for reasoning models
- **Tool call display** -- inline tool execution with status and results
- **Streaming metrics** -- live tokens/second, time-to-first-token (TTFT), prompt processing speed, prefix cache hit rate
- **System prompts** -- per-chat custom system message
- **Chat settings** -- per-chat overrides for temperature, top-p, top-k, min-p, repetition penalty, max tokens, stop sequences
- **Chat folders** -- hierarchical organization
- **Message search** -- full-text search across chat history
- **Export/Import** -- ShareGPT format
- **Voice chat** -- STT + TTS integration

### Model Management

- **HuggingFace browser** -- search, filter by text/image, and download models directly in-app
- **Download queue** -- multiple concurrent downloads with real-time progress bars and cancel support
- **Model size display** -- file sizes from safetensors metadata before downloading
- **Local model discovery** -- auto-scan `~/.mlxstudio/models`, `~/.cache/huggingface/hub`, `~/.exo/models`, and custom directories
- **Deduplication** -- strict format detection prevents false positive model matches
- **Zero-config detection** -- reads model config.json to auto-set tool parsers, reasoning parsers, cache types, and chat templates
- **65+ model families** in the auto-detection registry with two-tier detection (config.json `model_type` primary, name regex fallback)

### Desktop Experience

- **5 app modes** -- Chat, Server, Image, Tools, API
- **Menu bar tray** -- live server status, GPU memory, running models, quick controls
- **Multi-session** -- run multiple models simultaneously on different ports
- **Dock icon** -- restore on click, close-to-tray support
- **Dark and light themes** -- system-respecting
- **Keyboard shortcuts** -- common actions
- **Toast notifications** -- user feedback
- **Update banner** -- new version detection

---

## Advanced Quantization

MLX Studio supports standard MLX quantization (4-bit, 8-bit) as well as **JANG adaptive mixed-precision** -- an advanced format that assigns different bit widths to different layer types for better quality at the same model size.

- Convert in-app via the Tools tab, or via CLI: `vmlx convert model --jang-profile JANG_3M`
- Pre-quantized models available at [JANGQ-AI on HuggingFace](https://huggingface.co/JANGQ-AI)
- Stays quantized in GPU memory -- native MLX `QuantizedLinear` + `quantized_matmul`
- Compatible with all caching layers (prefix, paged, disk, KV quant)

See the [vMLX source repo](https://github.com/jjang-ai/vmlx#advanced-quantization) for profiles and conversion details.

### Smelt Mode (Partial Expert Loading)

For MoE models that don't fit in RAM, **Smelt** loads only a subset of experts per layer from SSD and keeps the backbone resident. Response quality stays coherent while RAM usage drops; throughput scales inversely with expert % loaded because expert swaps hit SSD on the hot path.

**Benchmarks on `Nemotron-Cascade-2-30B-A3B-JANG_4M`** (23 MoE layers × 128 experts, Apple M3 Ultra / 128 GB, dedicated machine, no parallel models):

| `--smelt-experts` | Active RAM | Decode tok/s | RAM saving | Coherent |
|---|---:|---:|---|---|
| _off (baseline)_ | **17,408 MB** | **89.9** | — | ✓ |
| `50` | 9,529 MB | **66.5** | **−45%** | ✓ |
| `25` | 5,590 MB | * | **−68%** | ✓ |

\* Responses too short for reliable steady-state tok/s measurement at 25 %. Subjectively responsive.

All three configurations produced coherent, non-looping output. No quality degradation observed.

> **Credit**: Smelt mode is inspired by [Anemll's **flash-moe**](https://github.com/Anemll/flash-moe) — a pure C / Objective‑C / Metal inference engine that showed huge MoE models (Qwen3.5-397B) can run on modest Apple Silicon hardware by streaming expert weights from SSD with `pread()` on demand. vMLX Smelt takes a different implementation path: Python/MLX, tied to the JANG quantization format, and loading a fixed subset of experts per layer at startup (backbone resident, routing biased toward the loaded subset) rather than on-demand per-token. It plugs into the full vMLX server with continuous batching, paged cache, and OpenAI-compatible API. Different engine, same core insight — thanks to the flash-moe team for validating the approach.

**Smelt is mutually exclusive with VLM mode.** MLX Studio / vMLX v1.3.33+ automatically disables `--is-mllm` when smelt is active (with a warning) because the vision tower is not wired through the partial-expert loader — image input on a smelt-loaded VLM would produce garbage logits. Use a text-only model when running smelt, or disable smelt when running a VLM.

Requires an MoE model in JANG format. Not compatible with dense models (no experts to partial-load).

---

## System Requirements

| Requirement | Minimum |
|---|---|
| **macOS** | 14.0 Sonoma or later |
| **Chip** | Apple Silicon (M1 / M2 / M3 / M4) |
| **RAM** | 8 GB (16 GB+ recommended for larger models) |
| **Disk** | ~500 MB for app; models range from 1-50 GB each |

---

## Build from Source

```bash
git clone https://github.com/jjang-ai/vmlx.git
cd vmlx

# Python engine
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Electron app
cd panel && npm install && npm run build
npx electron-builder --mac --dir # .app bundle
npx electron-builder --mac dmg # DMG installer
```

---

## Links

| Resource | Link |
|---|---|
| **Source Code** | [github.com/jjang-ai/vmlx](https://github.com/jjang-ai/vmlx) |
| **PyPI** | [pypi.org/project/vmlx](https://pypi.org/project/vmlx/) |
| **MLX Models** | [huggingface.co/mlx-community](https://huggingface.co/mlx-community) |
| **JANG Models** | [huggingface.co/JANGQ-AI](https://huggingface.co/JANGQ-AI) |
| **Website** | [vmlx.net](https://vmlx.net) |

---

## License

Apache License 2.0

---


Built by Jinho Jangeric@jangq.aiJANGQ AISupport on Ko-fi

---

## 한국어 (Korean)

### MLX Studio — Apple Silicon을 위한 네이티브 macOS AI 앱

Mac에서 LLM, VLM, 이미지 생성 및 편집 모델을 완전히 로컬로 실행하세요.

> **JANG 2비트가 MLX 4/3/2비트보다 높은 성능** — 적응형 혼합 정밀도 양자화(JANG\_2S, JANG\_2.6)가 MiniMax M2.5, Qwen3 등에서 표준 MLX 양자화를 능가합니다. [jangq.ai](https://jangq.ai)에서 벤치마크 확인. [JANGQ-AI](https://huggingface.co/JANGQ-AI)에서 사전 양자화 모델 다운로드.

**설치:** [최신 DMG 다운로드](https://github.com/jjang-ai/mlxstudio/releases/latest) — 드래그 앤 드롭으로 설치.

### 주요 기능

| 기능 | 설명 |
|------|------|
| **채팅** | 대화 인터페이스, 도구 호출, 에이전트 코딩 |
| **이미지 생성** | Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein |
| **이미지 편집** | Qwen Image Edit (텍스트 지시 기반 편집) |
| **5단계 캐싱** | 프리픽스, 페이지드, KV 양자화, 디스크 캐시 |
| **API 서버** | OpenAI + Anthropic 호환 API |
| **30개 도구** | 파일, 웹 검색, Git, 터미널 내장 도구 |


개발자: 장진호 (eric@jangq.ai)

JANGQ AI
Ko-fi로 후원하기