{"id":47352173,"url":"https://github.com/jjang-ai/mlxstudio","last_synced_at":"2026-05-22T20:03:41.362Z","repository":{"id":342358704,"uuid":"1173721368","full_name":"jjang-ai/mlxstudio","owner":"jjang-ai","description":"MLX Studio - Home of JANG_Q - Image Gen/Edit + Chat/Code All in one - + OpenClaw (Anthropic API)","archived":false,"fork":false,"pushed_at":"2026-05-15T20:08:44.000Z","size":2170,"stargazers_count":720,"open_issues_count":9,"forks_count":47,"subscribers_count":7,"default_branch":"main","last_synced_at":"2026-05-15T23:31:37.070Z","etag":null,"topics":["ai","ai-agents","anthropic","anthropic-api","apple-silicon","inference","inference-engine","llm","lmstudio","macbook","macstudio","mlx","mlxllm","mlxstudio","omlx","omlx-alternative","openai-api"],"latest_commit_sha":null,"homepage":"https://mlx.studio","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jjang-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-05T17:17:53.000Z","updated_at":"2026-05-15T20:07:06.000Z","dependencies_parsed_at":"2026-04-01T23:05:37.966Z","dependency_job_id":null,"html_url":"https://github.com/jjang-ai/mlxstudio","commit_stats":null,"previous_names":["vmlxllm/vmlx-releases","jjang-ai/vmlx-releases"],"tags_count":128,"template":false,"template_full_name":null,"purl":"pkg:github/jjang-ai/mlxstudio","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjang-ai%2Fmlxstudio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjang-ai%2Fmlxstudio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjang-ai%2Fmlxstudio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjang-ai%2Fmlxstudio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jjang-ai","download_url":"https://codeload.github.com/jjang-ai/mlxstudio/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjang-ai%2Fmlxstudio/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33364334,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-21T12:23:38.849Z","status":"online","status_checked_at":"2026-05-22T02:00:06.671Z","response_time":265,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-agents","anthropic","anthropic-api","apple-silicon","inference","inference-engine","llm","lmstudio","macbook","macstudio","mlx","mlxllm","mlxstudio","omlx","omlx-alternative","openai-api"],"created_at":"2026-03-18T00:31:14.284Z","updated_at":"2026-05-22T20:03:41.301Z","avatar_url":"https://github.com/jjang-ai.png","language":null,"funding_links":["https://ko-fi.com/jangml"],"categories":["*Ops for AI","LLM \u0026 Inference"],"sub_categories":["Model Serving \u0026 Inference"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"assets/logo-wide-dark.png\"\u003e\n    \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"assets/logo-wide-light.png\"\u003e\n    \u003cimg alt=\"MLX Studio\" src=\"assets/logo-wide-light.png\" width=\"400\"\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003eThe native macOS desktop app for local AI on Apple Silicon\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/jjang-ai/mlxstudio/releases/latest\"\u003e\u003cimg src=\"https://img.shields.io/github/v/release/jjang-ai/mlxstudio?style=flat-square\u0026label=Latest%20Release\u0026color=blue\" alt=\"Latest Release\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/jjang-ai/mlxstudio/releases\"\u003e\u003cimg src=\"https://img.shields.io/github/downloads/jjang-ai/mlxstudio/total?style=flat-square\u0026label=Downloads\u0026color=green\" alt=\"Downloads\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Platform-macOS%20ARM64-lightgrey?style=flat-square\u0026logo=apple\" alt=\"Platform\"\u003e\n  \u003ca href=\"https://pypi.org/project/vmlx/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/vmlx?style=flat-square\u0026label=vMLX%20Engine\u0026color=%234B8BBE\u0026logo=python\u0026logoColor=white\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/jjang-ai/mlxstudio/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-Apache%202.0-orange?style=flat-square\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://ko-fi.com/jangml\"\u003e\u003cimg src=\"https://img.shields.io/badge/Support-Ko--fi-FF5E5B?style=flat-square\u0026logo=ko-fi\u0026logoColor=white\" alt=\"Ko-fi\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/jjang-ai/vmlx/releases?q=tag%3Av2\u0026expanded=true\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/%E2%AC%87%EF%B8%8F_Get_vMLX_v2_(Swift)-Recommended-orange?style=for-the-badge\u0026logo=swift\u0026logoColor=white\" alt=\"Get vMLX v2 Swift\" height=\"40\"\u003e\n  \u003c/a\u003e\n  \u0026nbsp;\u0026nbsp;\n  \u003ca href=\"https://github.com/jjang-ai/mlxstudio/releases/latest\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Legacy_Python_Panel-1.4.0_DMG-lightgrey?style=for-the-badge\u0026logo=python\u0026logoColor=white\" alt=\"Legacy Python DMG\" height=\"40\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003evMLX v2 — native Swift + Metal, 50–95 t/s on M-series.\u003c/b\u003e\u003cbr\u003e\n  Zero PyTorch in the hot path. Pure SwiftUI. Drag and drop models.\u003cbr\u003e\n  The Python panel above remains available for legacy support.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#features\"\u003eFeatures\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#screenshots\"\u003eScreenshots\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#api-server\"\u003eAPI Server\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#image-generation\"\u003eImage Generation\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#advanced-quantization\"\u003eJANG Quantization\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#system-requirements\"\u003eRequirements\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#build-from-source\"\u003eBuild\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#한국어-korean\"\u003e한국어\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\nMLX Studio is a complete desktop app for running LLMs, VLMs, and image generation models locally on your Mac. No cloud, no API keys, no data leaving your machine. Supports every model on [mlx-community](https://huggingface.co/mlx-community) -- Qwen, Llama, Mistral, Gemma, Phi, DeepSeek, and thousands more. Built on [vMLX Engine](https://github.com/jjang-ai/vmlx) and Apple's [MLX](https://github.com/ml-explore/mlx) framework.\n\n\u003e **JANG 2-bit destroys MLX 4-bit on [MiniMax M2.5](https://huggingface.co/JANGQ-AI/MiniMax-M2.5-JANG_2L):**\n\u003e\n\u003e | Quantization | MMLU (200q) | Size |\n\u003e |---|---|---|\n\u003e | **JANG\\_2L (2-bit)** | **74%** | **89 GB** |\n\u003e | MLX 4-bit | 26.5% | 120 GB |\n\u003e | MLX 3-bit | 24.5% | 93 GB |\n\u003e | MLX 2-bit | 25% | 68 GB |\n\u003e\n\u003e Adaptive mixed-precision quantization keeps critical layers at higher precision while compressing the rest. Check scores at [jangq.ai](https://jangq.ai). Models at [JANGQ-AI](https://huggingface.co/JANGQ-AI).\n\n---\n\n## Install\n\n### Option 1: Download the App (Recommended)\n\n\u003e **[Download the latest DMG](https://github.com/jjang-ai/mlxstudio/releases/latest)** -- one file, ready to go.\n\n1. Download `vMLX-X.Y.Z-arm64.dmg`\n2. Open the DMG and drag to Applications\n3. Launch -- that's it\n\nAll releases are code-signed and notarized by Apple for macOS Gatekeeper. No Homebrew, no pip, no Xcode required.\n\n### Option 2: CLI via pip (Engine Only)\n\nThe vMLX inference engine is published on [PyPI as `vmlx`](https://pypi.org/project/vmlx/) -- same engine that powers the desktop app, available as a standalone CLI. This is real, published software with 1,894+ tests.\n\n```bash\n# Recommended: use uv (fast, no venv hassle)\nbrew install uv\nuv tool install vmlx\nvmlx serve mlx-community/Qwen3-8B-4bit\n\n# Or with pipx (isolates from system Python)\nbrew install pipx\npipx install vmlx\nvmlx serve mlx-community/Qwen3-8B-4bit\n\n# Or with pip in a virtual environment\npython3 -m venv ~/.vmlx-env \u0026\u0026 source ~/.vmlx-env/bin/activate\npip install vmlx\nvmlx serve mlx-community/Qwen3-8B-4bit\n```\n\n\u003e **Note:** On macOS 14+, `pip install vmlx` without a venv will fail with \"externally-managed-environment\". Use `uv`, `pipx`, or create a venv first.\n\nOnce running, your local OpenAI-compatible API server is live at `http://localhost:8000`. Point any OpenAI or Anthropic SDK client at it.\n\n---\n\n## Quick Start\n\n1. **Launch** MLX Studio from Applications\n2. **Pick a model** -- browse HuggingFace models in the Server tab, or enter a repo name (e.g., `mlx-community/Qwen3-8B-4bit`)\n3. **Start the session** -- the model downloads automatically and the server starts\n4. **Chat** -- switch to the Chat tab and start talking\n\nThat's it. The app manages the entire Python engine, model downloads, and server lifecycle for you.\n\n---\n\n## Screenshots\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/chat-tab.png\" width=\"450\"\u003e\u003cbr\u003e\u003cb\u003eChat Interface\u003c/b\u003e\u003cbr\u003e\u003cem\u003eStreaming conversations with thinking mode, code highlighting, and markdown\u003c/em\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/agentic-chat.png\" width=\"450\"\u003e\u003cbr\u003e\u003cb\u003eAgentic Coding\u003c/b\u003e\u003cbr\u003e\u003cem\u003eFull tool calling with file I/O, shell execution, and web search\u003c/em\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/image-tab.png\" width=\"450\"\u003e\u003cbr\u003e\u003cb\u003eImage Generation \u0026 Editing\u003c/b\u003e\u003cbr\u003e\u003cem\u003eFlux Schnell, Dev, Z-Image Turbo, Klein + Qwen Image Edit\u003c/em\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/anthropic-api.png\" width=\"450\"\u003e\u003cbr\u003e\u003cb\u003eAnthropic API Compatible\u003c/b\u003e\u003cbr\u003e\u003cem\u003eDrop-in /v1/messages endpoint for Anthropic SDK clients\u003c/em\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/tools-tab.png\" width=\"450\"\u003e\u003cbr\u003e\u003cb\u003eDeveloper Tools\u003c/b\u003e\u003cbr\u003e\u003cem\u003eConvert, inspect, and diagnose models\u003c/em\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/gguf-to-mlx.png\" width=\"450\"\u003e\u003cbr\u003e\u003cb\u003eModel Conversion\u003c/b\u003e\u003cbr\u003e\u003cem\u003eGGUF to MLX, 16-bit to quantized, and JANG adaptive mixed-precision\u003c/em\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/jangq-models.png\" width=\"450\"\u003e\u003cbr\u003e\u003cb\u003eHuggingFace Browser\u003c/b\u003e\u003cbr\u003e\u003cem\u003eSearch and download models directly in-app\u003c/em\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"assets/menu-bar.png\" width=\"300\"\u003e\u003cbr\u003e\u003cb\u003eMenu Bar\u003c/b\u003e\u003cbr\u003e\u003cem\u003eRunning models, GPU memory, and quick controls\u003c/em\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n---\n\n## Features\n\n### Model Support (65+ Model Families)\n\nRun any MLX model from HuggingFace -- thousands of models, zero configuration:\n\n- **Text LLMs** -- Qwen 2/2.5/3/3.5/3.6, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral/Codestral, **Mistral-Medium-3.5** (ministral3, dense GQA + 256K YaRN + PIXTRAL vision), Mistral-Small-4 (MLA), Gemma 2/3/4, Phi-3/4, DeepSeek V2/V3/V4 (MLA), GLM-4/4.7/5, Nemotron, **Laguna** (poolside, 33B/3B SWA MoE), MiniMax M2.5/M2.7, Kimi K2.5/K2.6, Step, XVERSE, Yi, InternLM, ChatGLM, CodeLlama, and any mlx-lm compatible model\n- **Vision LLMs (VL)** -- Qwen-VL, Qwen2.5-VL, Qwen3.5-VL / Qwen3.6-VL, Pixtral, InternVL, LLaVA, Gemma 3n / 4-VL, Phi-3-Vision, Mistral-Medium-3.5 (PIXTRAL) -- send images and video directly in chat\n- **Multimodal Omni** -- **Nemotron-3-Nano-Omni** (text + image + audio + video) with Parakeet audio encoder + RADIO ViT vision tower; routed via OmniMultimodalDispatcher across `/v1/chat/completions`, `/v1/messages`, `/v1/responses`, and `/api/chat`\n- **Mixture-of-Experts** -- Qwen 3.5/3.6 MoE, Mixtral 8x7B/8x22B, DeepSeek V2/V3/V4, MiniMax M2.5/M2.7, Llama 4 Scout/Maverick, Laguna (256 routed experts top-8 + 1 shared)\n- **Hybrid SSM Models** -- Nemotron-H, Nemotron-3-Nano-Omni, Jamba, GatedDeltaNet, Qwen3.5-A3B hybrid, Granite MoE Hybrid, LFM2 (Mamba + Attention with dedicated hybrid cache + SSM companion + capture-during-prefill)\n- **Image Generation** -- Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein 4B/9B (via mflux)\n- **Image Editing** -- Qwen Image Edit (instruction-based editing, full precision)\n- **Audio** -- Kokoro TTS, Whisper STT, Qwen3-Audio (via mlx-audio)\n- **JANG Models** -- Adaptive mixed-precision quantized models from [JANGQ-AI](https://huggingface.co/JANGQ-AI), stay quantized in GPU memory via native `QuantizedLinear`\n- **GGUF Import** -- Convert GGUF models to MLX format directly in-app\n\n### OpenAI-Compatible API Server\n\nEvery session launches a full API server. Point any OpenAI SDK client at your local endpoint:\n\n- `POST /v1/chat/completions` -- Chat Completions API with streaming, tool calling, vision, structured output\n- `POST /v1/responses` -- OpenAI Responses API (agentic format) with streaming\n- `POST /v1/completions` -- Text completions\n- `POST /v1/images/generations` -- Image generation (Flux/Z-Image models, OpenAI format with `usage` field)\n- `POST /v1/images/edits` -- Image editing (Qwen Image Edit, instruction-based)\n- `POST /v1/embeddings` -- Text embeddings with dimension control and batch processing\n- `POST /v1/rerank` -- Document reranking\n- `POST /v1/audio/speech` -- Text-to-speech (Kokoro TTS)\n- `POST /v1/audio/transcriptions` -- Speech-to-text (Whisper)\n- `GET /v1/models` -- List loaded models\n- `GET /health` -- Server health with VRAM usage, queue length, load times\n\n### Anthropic API Compatibility\n\nDrop-in replacement for the Anthropic Claude API:\n\n- `POST /v1/messages` -- Anthropic Messages API format\n- Anthropic SDK tool calling format (auto-translated to internal format)\n- Vision/multimodal support via Anthropic content blocks\n- Use the Anthropic Python/TypeScript SDK -- just change the `base_url` to your local server\n- Copy-paste code snippets in the API tab for curl, Python, and JavaScript\n\n### Tool Calling \u0026 Agentic Workflows (14 Parsers)\n\nAuto-detected tool call parsers for every major model family:\n\n- **Qwen** (qwen3, qwen2.5) -- `\u003ctool_call\u003e` XML format\n- **Llama 3** -- `\u003cfunction=name\u003e` format\n- **Mistral** -- `[TOOL_CALLS]` format\n- **Hermes** -- `\u003ctool_call\u003e` JSON format\n- **DeepSeek** -- function call blocks\n- **GLM-4.7** -- GLM tool format\n- **MiniMax** -- MiniMax function calling\n- **Nemotron** -- NVIDIA Nemotron tool format\n- **Granite** -- IBM Granite format\n- **Functionary** -- Functionary v3 format\n- **XLAM** -- Salesforce xLAM format\n- **Kimi** -- Moonshot Kimi format\n- **Step-3.5** -- StepFun format\n- Auto-detection from `model_type` in config.json with regex name fallback\n\n**26+ Built-in Tools:**\n- **File I/O** -- read, write, edit, patch, copy, move, delete, create directory, list directory, file info, insert text, replace lines, directory tree\n- **Search** -- ripgrep file search with regex and glob, glob file finder, unified diff\n- **Execution** -- shell commands (60s timeout), background processes (5m auto-kill), process output polling\n- **Web** -- DuckDuckGo search, Brave Search API, URL fetch with HTML-to-text\n- **Developer** -- token counter, regex find-replace across files, git operations, clipboard read/write, diagnostics (TypeScript/ESLint/Python linting)\n- **Interactive** -- `ask_user` tool for human-in-the-loop interrupts\n- Per-category toggles: enable/disable file, search, shell, web tools independently\n- Auto-continue agent loops (up to 10 tool iterations per request)\n- **MCP (Model Context Protocol)** -- connect external tool servers, merge tool definitions, execute MCP tools via API\n\n### Reasoning Model Support (4 Parsers)\n\nCollapsible thinking blocks with dedicated parsing for reasoning models:\n\n- **Qwen3 / Qwen3.5** -- `\u003cthink\u003e...\u003c/think\u003e` blocks\n- **DeepSeek-R1** -- DeepSeek reasoning format\n- **OpenAI GPT-OSS / GLM-4.7** -- GPT-OSS thinking format\n- **Phi-4-reasoning** -- reasoning content extraction\n- Enable/disable thinking per request\n- Reasoning effort control (low/medium/high)\n- Streaming reasoning content with proper tokenization\n\n### Vision \u0026 Multimodal (VLM)\n\nFull multimodal input support for vision-language models:\n\n- **Images** -- PNG, JPEG, WebP via base64 or URL (up to 50 MB)\n- **Video** -- MP4, MOV, WebM via base64 or URL (up to 200 MB), smart frame extraction (8-64 frames), configurable FPS\n- **Audio** -- Base64 or URL audio input (Qwen3-Audio)\n- Image detail levels: auto, low, high\n- Dedicated MLLM cache for image/video embeddings (separate from KV cache)\n- Send images directly in chat to any VL model\n\n### Continuous Batching \u0026 Concurrency\n\nProduction-grade multi-user serving:\n\n- **Continuous batching** -- handle 32+ concurrent requests with dynamic slot allocation\n- **Prefill batching** -- batch prompt processing with configurable batch size (prevents Metal GPU timeouts)\n- **Completion batching** -- batch token generation across sequences\n- **Stream interval control** -- configure streaming frequency\n- **Request pooling** -- efficiently share GPU memory across concurrent sequences\n- **Rate limiting** -- optional per-client request limits\n- **API key authentication** -- optional `--api-key` flag for secured access\n\n### 5-Layer Cache Stack\n\nMulti-tier caching for maximum throughput and memory efficiency:\n\n- **L1: Memory-Aware Prefix Cache** -- token-level semantic caching with LRU eviction, configurable memory allocation\n- **L1 alt: Paged KV Cache** -- block-aware cache with reduced fragmentation for long contexts\n- **L2: Disk Cache** -- persistent spillover to disk for large context windows\n- **L2 alt: Block Disk Store** -- block-level disk persistence\n- **KV Quantization** -- q4/q8 quantized KV cache at storage boundary (2-4x memory savings, no accuracy loss)\n- **Hybrid SSM Cache** -- dedicated cache for Mamba + Attention architectures (Nemotron-H, Jamba, GatedDeltaNet)\n- Automatic cache type selection based on model architecture\n- Cache warming API (`POST /v1/cache/warm`) for pre-loading common prompts\n- Cache stats API (`GET /v1/cache/stats`) for monitoring hit rates and memory usage\n\n### Sampling \u0026 Generation Control\n\nFull control over text generation:\n\n- **Temperature** (0.0 - 2.0) -- creativity control\n- **Top-P** (0.0 - 1.0) -- nucleus sampling\n- **Top-K** (integer) -- top-K token filtering\n- **Min-P** (0.0 - 1.0) -- minimum probability threshold\n- **Repetition Penalty** -- penalize repeated tokens\n- **Stop Sequences** -- custom stopping strings\n- **Max Tokens** -- output length limit (up to 131072)\n- **Request Timeout** -- per-request timeout override\n- **Structured Output** -- `response_format` with `json_object` or `json_schema` modes for guaranteed valid JSON\n- **Streaming** with proper Unicode handling (emoji, CJK, Arabic multi-byte characters)\n- **Usage stats** in streaming responses (`stream_options.include_usage`)\n\n### Model Conversion \u0026 Quantization\n\nConvert models directly in-app via the Tools tab:\n\n- **16-bit to MLX** -- convert HuggingFace safetensors to MLX format\n- **16-bit to quantized** -- quantize to 2-bit, 4-bit, or 8-bit MLX\n- **GGUF to MLX** -- import GGUF models into MLX safetensors format\n- **MLX to JANG** -- adaptive mixed-precision quantization (different bits per layer type)\n- **Model Inspector** -- view config.json, architecture, layer structure\n- **Model Doctor** -- diagnostic checks (load test, token count, memory estimation)\n- Progress tracking with real-time status\n\n### Image Generation\n\nGenerate images locally with Flux and Z-Image models:\n\n- **Flux Schnell** -- 4-step fast generation\n- **Flux Dev** -- 20-step high-quality generation\n- **Z-Image Turbo** -- fast turbo generation (4-bit and 8-bit)\n- **Flux Klein** -- lightweight 4B parameter model\n- **Flux Kontext** -- subject-consistent editing\n- **Flux Krea** -- aesthetic fine-tuned generation\n- Configurable steps, guidance scale, height, width, seed, sampler\n- Multiple samplers: euler, euler_ancestral, heun, dpmpp_2m_sde, dpmpp_sde\n- Quantized model support (2-bit to 8-bit)\n- Image gallery with generation history, save, and settings persistence\n- OpenAI-compatible `/v1/images/generations` endpoint with `usage` field\n\n### Chat Interface\n\nFull-featured conversation UI:\n\n- **Persistent history** -- SQLite (WAL mode) with full message, metrics, and tool call history\n- **Markdown rendering** -- GitHub-flavored markdown with syntax highlighting\n- **Reasoning display** -- collapsible thinking sections for reasoning models\n- **Tool call display** -- inline tool execution with status and results\n- **Streaming metrics** -- live tokens/second, time-to-first-token (TTFT), prompt processing speed, prefix cache hit rate\n- **System prompts** -- per-chat custom system message\n- **Chat settings** -- per-chat overrides for temperature, top-p, top-k, min-p, repetition penalty, max tokens, stop sequences\n- **Chat folders** -- hierarchical organization\n- **Message search** -- full-text search across chat history\n- **Export/Import** -- ShareGPT format\n- **Voice chat** -- STT + TTS integration\n\n### Model Management\n\n- **HuggingFace browser** -- search, filter by text/image, and download models directly in-app\n- **Download queue** -- multiple concurrent downloads with real-time progress bars and cancel support\n- **Model size display** -- file sizes from safetensors metadata before downloading\n- **Local model discovery** -- auto-scan `~/.mlxstudio/models`, `~/.cache/huggingface/hub`, `~/.exo/models`, and custom directories\n- **Deduplication** -- strict format detection prevents false positive model matches\n- **Zero-config detection** -- reads model config.json to auto-set tool parsers, reasoning parsers, cache types, and chat templates\n- **65+ model families** in the auto-detection registry with two-tier detection (config.json `model_type` primary, name regex fallback)\n\n### Desktop Experience\n\n- **5 app modes** -- Chat, Server, Image, Tools, API\n- **Menu bar tray** -- live server status, GPU memory, running models, quick controls\n- **Multi-session** -- run multiple models simultaneously on different ports\n- **Dock icon** -- restore on click, close-to-tray support\n- **Dark and light themes** -- system-respecting\n- **Keyboard shortcuts** -- common actions\n- **Toast notifications** -- user feedback\n- **Update banner** -- new version detection\n\n---\n\n## Advanced Quantization\n\nMLX Studio supports standard MLX quantization (4-bit, 8-bit) as well as **JANG adaptive mixed-precision** -- an advanced format that assigns different bit widths to different layer types for better quality at the same model size.\n\n- Convert in-app via the Tools tab, or via CLI: `vmlx convert model --jang-profile JANG_3M`\n- Pre-quantized models available at [JANGQ-AI on HuggingFace](https://huggingface.co/JANGQ-AI)\n- Stays quantized in GPU memory -- native MLX `QuantizedLinear` + `quantized_matmul`\n- Compatible with all caching layers (prefix, paged, disk, KV quant)\n\nSee the [vMLX source repo](https://github.com/jjang-ai/vmlx#advanced-quantization) for profiles and conversion details.\n\n### Smelt Mode (Partial Expert Loading)\n\nFor MoE models that don't fit in RAM, **Smelt** loads only a subset of experts per layer from SSD and keeps the backbone resident. Response quality stays coherent while RAM usage drops; throughput scales inversely with expert % loaded because expert swaps hit SSD on the hot path.\n\n**Benchmarks on `Nemotron-Cascade-2-30B-A3B-JANG_4M`** (23 MoE layers × 128 experts, Apple M3 Ultra / 128 GB, dedicated machine, no parallel models):\n\n| `--smelt-experts` | Active RAM | Decode tok/s | RAM saving | Coherent |\n|---|---:|---:|---|---|\n| _off (baseline)_ | **17,408 MB** | **89.9** | — | ✓ |\n| `50` | 9,529 MB | **66.5** | **−45%** | ✓ |\n| `25` | 5,590 MB | * | **−68%** | ✓ |\n\n\\* Responses too short for reliable steady-state tok/s measurement at 25 %. Subjectively responsive.\n\nAll three configurations produced coherent, non-looping output. No quality degradation observed.\n\n\u003e **Credit**: Smelt mode is inspired by [Anemll's **flash-moe**](https://github.com/Anemll/flash-moe) — a pure C / Objective‑C / Metal inference engine that showed huge MoE models (Qwen3.5-397B) can run on modest Apple Silicon hardware by streaming expert weights from SSD with `pread()` on demand. vMLX Smelt takes a different implementation path: Python/MLX, tied to the JANG quantization format, and loading a fixed subset of experts per layer at startup (backbone resident, routing biased toward the loaded subset) rather than on-demand per-token. It plugs into the full vMLX server with continuous batching, paged cache, and OpenAI-compatible API. Different engine, same core insight — thanks to the flash-moe team for validating the approach.\n\n**Smelt is mutually exclusive with VLM mode.** MLX Studio / vMLX v1.3.33+ automatically disables `--is-mllm` when smelt is active (with a warning) because the vision tower is not wired through the partial-expert loader — image input on a smelt-loaded VLM would produce garbage logits. Use a text-only model when running smelt, or disable smelt when running a VLM.\n\nRequires an MoE model in JANG format. Not compatible with dense models (no experts to partial-load).\n\n---\n\n## System Requirements\n\n| Requirement | Minimum |\n|---|---|\n| **macOS** | 14.0 Sonoma or later |\n| **Chip** | Apple Silicon (M1 / M2 / M3 / M4) |\n| **RAM** | 8 GB (16 GB+ recommended for larger models) |\n| **Disk** | ~500 MB for app; models range from 1-50 GB each |\n\n---\n\n## Build from Source\n\n```bash\ngit clone https://github.com/jjang-ai/vmlx.git\ncd vmlx\n\n# Python engine\npython3 -m venv .venv \u0026\u0026 source .venv/bin/activate\npip install -e \".[dev]\"\n\n# Electron app\ncd panel \u0026\u0026 npm install \u0026\u0026 npm run build\nnpx electron-builder --mac --dir   # .app bundle\nnpx electron-builder --mac dmg     # DMG installer\n```\n\n---\n\n## Links\n\n| Resource | Link |\n|---|---|\n| **Source Code** | [github.com/jjang-ai/vmlx](https://github.com/jjang-ai/vmlx) |\n| **PyPI** | [pypi.org/project/vmlx](https://pypi.org/project/vmlx/) |\n| **MLX Models** | [huggingface.co/mlx-community](https://huggingface.co/mlx-community) |\n| **JANG Models** | [huggingface.co/JANGQ-AI](https://huggingface.co/JANGQ-AI) |\n| **Website** | [vmlx.net](https://vmlx.net) |\n\n---\n\n## License\n\nApache License 2.0\n\n---\n\n\u003cp align=\"center\"\u003e\n  Built by \u003ca href=\"https://github.com/jjang-ai\"\u003eJinho Jang\u003c/a\u003e \u0026bull; \u003ca href=\"mailto:eric@jangq.ai\"\u003eeric@jangq.ai\u003c/a\u003e \u0026bull; \u003ca href=\"https://jangq.ai\"\u003eJANGQ AI\u003c/a\u003e \u0026bull; \u003ca href=\"https://ko-fi.com/jangml\"\u003eSupport on Ko-fi\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n## 한국어 (Korean)\n\n### MLX Studio — Apple Silicon을 위한 네이티브 macOS AI 앱\n\nMac에서 LLM, VLM, 이미지 생성 및 편집 모델을 완전히 로컬로 실행하세요.\n\n\u003e **JANG 2비트가 MLX 4/3/2비트보다 높은 성능** — 적응형 혼합 정밀도 양자화(JANG\\_2S, JANG\\_2.6)가 MiniMax M2.5, Qwen3 등에서 표준 MLX 양자화를 능가합니다. [jangq.ai](https://jangq.ai)에서 벤치마크 확인. [JANGQ-AI](https://huggingface.co/JANGQ-AI)에서 사전 양자화 모델 다운로드.\n\n**설치:** [최신 DMG 다운로드](https://github.com/jjang-ai/mlxstudio/releases/latest) — 드래그 앤 드롭으로 설치.\n\n### 주요 기능\n\n| 기능 | 설명 |\n|------|------|\n| **채팅** | 대화 인터페이스, 도구 호출, 에이전트 코딩 |\n| **이미지 생성** | Flux Schnell/Dev, Z-Image Turbo, FLUX.2 Klein |\n| **이미지 편집** | Qwen Image Edit (텍스트 지시 기반 편집) |\n| **5단계 캐싱** | 프리픽스, 페이지드, KV 양자화, 디스크 캐시 |\n| **API 서버** | OpenAI + Anthropic 호환 API |\n| **30개 도구** | 파일, 웹 검색, Git, 터미널 내장 도구 |\n\n\u003cp align=\"center\"\u003e\n  개발자: \u003ca href=\"https://github.com/jjang-ai\"\u003e장진호\u003c/a\u003e (eric@jangq.ai)\u003cbr\u003e\n  \u003ca href=\"https://jangq.ai\"\u003eJANGQ AI\u003c/a\u003e \u0026bull;\n  \u003ca href=\"https://ko-fi.com/jangml\"\u003eKo-fi로 후원하기\u003c/a\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjjang-ai%2Fmlxstudio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjjang-ai%2Fmlxstudio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjjang-ai%2Fmlxstudio/lists"}