{"id":31539794,"url":"https://github.com/cubist38/mlx-openai-server","last_synced_at":"2026-04-01T23:05:50.522Z","repository":{"id":285350692,"uuid":"956296598","full_name":"cubist38/mlx-openai-server","owner":"cubist38","description":"A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.","archived":false,"fork":false,"pushed_at":"2026-02-02T03:29:46.000Z","size":36073,"stargazers_count":205,"open_issues_count":21,"forks_count":38,"subscribers_count":8,"default_branch":"main","last_synced_at":"2026-02-02T14:45:14.095Z","etag":null,"topics":["apple-silicon","fastapi","flux","image-generation","mlx","mlx-lm","mlx-vlm","openai-compatible","queue","speech-recognition","structured-outputs","tool-calling","vision-api","whisper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cubist38.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2025-03-28T02:44:17.000Z","updated_at":"2026-02-02T03:29:50.000Z","dependencies_parsed_at":"2025-05-04T07:25:15.176Z","dependency_job_id":"512e5cbe-ad11-45fe-8bd4-44bc7dcbc58a","html_url":"https://github.com/cubist38/mlx-openai-server","commit_stats":null,"previous_names":["cubist38/mlx-server-oai-compat","cubist38/mlx-openai-server"],"tags_count":56,"template":false,"template_full_name":null,"purl":"pkg:github/cubist38/mlx-openai-server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cubist38%2Fmlx-openai-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cubist38%2Fmlx-openai-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cubist38%2Fmlx-openai-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cubist38%2Fmlx-openai-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cubist38","download_url":"https://codeload.github.com/cubist38/mlx-openai-server/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cubist38%2Fmlx-openai-server/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29199519,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-07T14:35:27.868Z","status":"ssl_error","status_checked_at":"2026-02-07T14:25:51.081Z","response_time":63,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","fastapi","flux","image-generation","mlx","mlx-lm","mlx-vlm","openai-compatible","queue","speech-recognition","structured-outputs","tool-calling","vision-api","whisper"],"created_at":"2025-10-04T09:15:03.043Z","updated_at":"2026-04-01T23:05:50.495Z","avatar_url":"https://github.com/cubist38.png","language":"Python","funding_links":[],"categories":["others","Vision \u0026 Multimodal"],"sub_categories":[],"readme":"# mlx-openai-server\n\n[![MIT License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)\n\nA high-performance OpenAI-compatible API server for MLX models. Run text, vision, audio, and image generation models locally on Apple Silicon with a drop-in OpenAI replacement.\n\n\u003e **Note:** Requires **macOS with M-series chips** (MLX is optimized for Apple Silicon).\n\n---\n\n## Table of Contents\n\n- [5-Second Quick Start](#5-second-quick-start)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Server Parameters](#server-parameters)\n- [Launching Multiple Models](#launching-multiple-models)\n  - [Custom Model Name](#custom-model-name-single-model-mode)\n  - [Dynamic Model Swapping](#dynamic-model-swapping-on-demand-loading)\n- [Supported Model Types](#supported-model-types)\n- [Common Use Cases](#common-use-cases)\n- [Using the API](#using-the-api)\n- [Advanced Configuration](#advanced-configuration)\n- [Example Notebooks](#example-notebooks)\n- [Large Models](#large-models)\n- [Troubleshooting](#troubleshooting)\n- [Frequently Encountered Problems](#frequently-encountered-problems)\n- [Quick Reference Card](#quick-reference-card)\n- [Featured Launch: MiniMax-M2.5-Uncensored-4bit](#featured-launch-minimax-m25-uncensored-4bit)\n- [Featured Launch: GLM-4.7-Flash-Abliterated-8bit](#featured-launch-glm-47-flash-abliterated-8bit)\n- [Contributing](#contributing)\n- [Support](#support)\n\n---\n\n## 5-Second Quick Start\n\n```bash\nmlx-openai-server launch --model-path mlx-community/Qwen3-Coder-Next-4bit --model-type lm\n```\n\nThen point your OpenAI client to `http://localhost:8000/v1`. For full setup, see [Installation](#installation) and [Quick Start](#quick-start).\n\n---\n\n## Key Features\n\n- 🚀 **OpenAI-compatible API** - Drop-in replacement for OpenAI services\n- 🖼️ **Multimodal support** - Text, vision, audio, and image generation/editing\n- 🎨 **Flux-series models** - Image generation (schnell, dev, krea-dev, flux-2-klein) and editing (kontext, qwen-image-edit)\n- 🔌 **Easy integration** - Works with existing OpenAI client libraries\n- 📦 **Multi-model mode** - Run multiple models in one server via a YAML config; route requests by model ID\n- ⚡ **Performance** - Configurable quantization (4/8/16-bit), context length, and speculative decoding (lm)\n- 🎛️ **LoRA adapters** - Fine-tuned image generation and editing\n- 📈 **Queue management** - Built-in request queuing and monitoring\n\n---\n\n## Installation\n\n### Prerequisites\n- macOS with Apple Silicon (M-series)\n- Python 3.11+\n\n### Quick Install\n\n```bash\n# Create virtual environment\npython3.11 -m venv .venv\nsource .venv/bin/activate\n\n# Install core server from PyPI\nuv pip install mlx-openai-server\n\n# Or install from GitHub\nuv pip install git+https://github.com/cubist38/mlx-openai-server.git\n```\n\n### Optional: Whisper Support\nFor audio transcription models, install ffmpeg:\n```bash\nbrew install ffmpeg\n```\n\n---\n\n## Quick Start\n\n### Start the Server\n\n```bash\n# Text-only or multimodal models\nmlx-openai-server launch \\\n  --model-path \u003cpath-to-mlx-model\u003e \\\n  --model-type \u003clm|multimodal\u003e\n\n# Text-only with speculative decoding (faster generation using a smaller draft model)\nmlx-openai-server launch \\\n  --model-path \u003cpath-to-main-model\u003e \\\n  --model-type lm \\\n  --draft-model-path \u003cpath-to-draft-model\u003e \\\n  --num-draft-tokens 4\n\n# Image generation (Flux-series)\nmlx-openai-server launch \\\n  --model-type image-generation \\\n  --model-path \u003cpath-to-flux-model\u003e \\\n  --config-name flux-dev \\\n  --quantize 8\n\n# Image editing\nmlx-openai-server launch \\\n  --model-type image-edit \\\n  --model-path \u003cpath-to-flux-model\u003e \\\n  --config-name flux-kontext-dev \\\n  --quantize 8\n\n# Embeddings\nmlx-openai-server launch \\\n  --model-type embeddings \\\n  --model-path \u003cembeddings-model-path\u003e\n\n# Whisper (audio transcription)\nmlx-openai-server launch \\\n  --model-type whisper \\\n  --model-path mlx-community/whisper-large-v3-mlx\n```\n\n### Server Parameters\n\n| Parameter | Required | Type | Default | Description |\n|-----------|----------|------|---------|-------------|\n| | | | | **Required parameters** |\n| `--model-path` | Yes | path | — | Path to MLX model (local or HuggingFace repo) |\n| `--model-type` | Yes | string | — | `lm`, `multimodal`, `image-generation`, `image-edit`, `embeddings`, or `whisper` |\n| | | | | **Model configuration** |\n| `--config-name` | No* | string | — | Image models: `flux-schnell`, `flux-dev`, `flux-krea-dev`, `flux-kontext-dev`, `flux2-klein-4b`, `flux2-klein-9b`, `qwen-image`, `qwen-image-edit`, `z-image-turbo`, `fibo` |\n| `--quantize` | No | int | — | Quantization level: 4, 8, or 16 (image models) |\n| `--context-length` | No | int | — | Max sequence length for memory optimization |\n| | | | | **Sampling parameters** (used when API request omits them) |\n| `--max-tokens` | No | int | 100000 | Default maximum tokens to generate |\n| `--temperature` | No | float | 1.0 | Default sampling temperature |\n| `--top-p` | No | float | 1.0 | Default nucleus sampling (top-p) probability |\n| `--top-k` | No | int | 20 | Default top-k sampling parameter |\n| `--repetition-penalty` | No | float | 1.0 | Default repetition penalty for token generation |\n| | | | | **Speculative decoding** (lm only) |\n| `--draft-model-path` | No | path | — | Path to draft model for speculative decoding |\n| `--num-draft-tokens` | No | int | 2 | Draft tokens per step |\n| | | | | **Prompt cache** (lm only) |\n| `--prompt-cache-size` | No | int | 10 | Maximum number of prompt KV cache entries to store |\n| `--max-bytes` | No | int | (unbounded) | Maximum total bytes retained by prompt KV caches before eviction |\n| | | | | **Server options** |\n| `--host` | No | string | `127.0.0.1` | Host address to bind the server to |\n| `--port` | No | int | `8000` | Port to run the server on |\n| `--served-model-name` | No | string | — | Override the model name returned by `/v1/models` and accepted in request `model` field |\n| | | | | **Advanced options** |\n| `--lora-paths` | No | string | — | Comma-separated LoRA adapter paths (image models) |\n| `--lora-scales` | No | string | — | Comma-separated LoRA scales (must match paths) |\n| `--log-level` | No | string | `INFO` | `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` |\n| `--no-log-file` | No | flag | false | Disable file logging (console only) |\n\n*Required for `image-generation` and `image-edit` model types.\n\n## Launching Multiple Models\n\nYou can run several models in one server using a YAML config file. Each model gets its own handler; requests are routed by the **served model name** you use in the API (the `model` field in the request).\n\n**Video:** [Serving Multiple Models at Once? mlx-openai-server + OpenWebUI Test](https://www.youtube.com/watch?v=f7WXSOPZ5H4)\n\n### Start with a config file\n\n```bash\nmlx-openai-server launch --config config.yaml\n```\n\nYou must provide either `--config` (multi-handler) or `--model-path` (single model). You cannot mix them.\n\n### YAML config format\n\nCreate a YAML file with a `server` section (host, port, logging) and a `models` list. Each entry in `models` defines one model and supports the same options as the CLI (model path, type, context length, queue settings, etc.).\n\n| Key | Required | Description |\n|-----|----------|-------------|\n| `model_path` | Yes | Path or HuggingFace repo of the model |\n| `model_type` | No | `lm`, `multimodal`, `image-generation`, `image-edit`, `embeddings`, `whisper` (default: `lm`) |\n| `served_model_name` | No | ID used in API requests; defaults to `model_path` if omitted |\n| `context_length` | No | Max context length (lm / multimodal) |\n| `queue_timeout`, `queue_size` | No | Per-model queue settings |\n| `prompt_cache_size` | No | Max prompt KV cache entries (lm only; default: 10) |\n| `prompt_cache_max_bytes` | No | Max total bytes for prompt KV caches before eviction (lm only) |\n| `on_demand` | No | Enable dynamic swapping — model is loaded on first request, unloaded after idle (default: `false`) |\n| `on_demand_idle_timeout` | No | Seconds to wait before unloading an idle on-demand model (default: `60`) |\n\nExample `config.yaml`:\n\n```yaml\nserver:\n  host: \"0.0.0.0\"\n  port: 8000\n  log_level: INFO\n  # log_file: logs/app.log     # uncomment to log to file\n  # no_log_file: true           # uncomment to disable file logging\n\nmodels:\n  # Language model\n  - model_path: mlx-community/MiniMax-M2.5-4bit\n    model_type: lm\n    served_model_name: Minimax-M2.5    # optional alias (defaults to model_path)\n    enable_auto_tool_choice: true\n    tool_call_parser: minimax_m2\n    reasoning_parser: minimax_m2\n\n  - model_path: black-forest-labs/FLUX.2-klein-4B\n    model_type: image-generation\n    config_name: flux2-klein-4b\n    quantize: 4\n    served_model_name: flux2-klein-4b\n    on_demand: true\n    on_demand_idle_timeout: 120  # seconds before unloading (default: 60)\n```\n\nA full example is in `examples/config.yaml`.\n\n### Custom Model Name (Single-Model Mode)\n\nUse `--served-model-name` to override the model identifier returned by `/v1/models` and accepted in the `model` request field:\n\n```bash\nmlx-openai-server launch \\\n  --model-path mlx-community/Qwen3-Coder-Next-4bit \\\n  --served-model-name my-local-model\n```\n\nClients can then use `\"model\": \"my-local-model\"` in their requests. If omitted, the model path is used as the identifier.\n\n### Dynamic Model Swapping (On-Demand Loading)\n\n\u003e **This feature is only available in multi-model mode** (`--config`). It is not supported with `--model-path` single-model launches.\n\nFor large models you don't want to keep in memory permanently, set `on_demand: true` in the YAML config. The model will appear in `/v1/models` but won't be loaded until a request arrives. After the request completes and the model is idle, it is automatically unloaded.\n\nOnly one on-demand model is loaded at a time — requesting a different on-demand model will unload the current one first.\n\n```yaml\n# config.yaml\nserver:\n  host: \"0.0.0.0\"\n  port: 8000\n\nmodels:\n  # Always loaded at startup\n  - model_path: mlx-community/GLM-4.7-Flash-8bit\n    model_type: lm\n    served_model_name: glm-4.7-flash\n\n  # Loaded on first request, unloaded after 120s idle\n  - model_path: black-forest-labs/FLUX.2-klein-4B\n    model_type: image-generation\n    config_name: flux2-klein-4b\n    quantize: 4\n    served_model_name: flux2-klein-4b\n    on_demand: true\n    on_demand_idle_timeout: 120\n```\n\n```bash\nmlx-openai-server launch --config config.yaml\n```\n\n\u003e **Note:** The first request to an on-demand model will be slower as the model needs to be loaded into memory. Subsequent requests (within the idle timeout) are served at normal speed.\n\n### Multi-handler process isolation (HandlerProcessProxy)\n\nIn multi-handler mode, each model runs in a **dedicated subprocess** spawned via `multiprocessing.get_context(\"spawn\")`. The main FastAPI process uses a `HandlerProcessProxy` to forward requests to the child process over multiprocessing queues.\n\nThis design prevents MLX Metal/GPU semaphore leaks on macOS. When MLX arrays or Metal runtime state are shared across forked processes, the resource tracker can report leaked semaphore objects at shutdown ([ml-explore/mlx#2457](https://github.com/ml-explore/mlx/issues/2457)). Using **spawn** instead of the default fork gives each model a clean Metal context, avoiding those warnings.\n\n```\n┌─────────────────────────────────────┐     ┌─────────────────────────────────────┐\n│  Main Process (FastAPI)             │     │  Child Process (Handler)             │\n│  ┌───────────────────────────────┐  │     │  ┌───────────────────────────────┐  │\n│  │  HandlerProcessProxy          │  │     │  │  Concrete handler (e.g.       │  │\n│  │  • request_queue ────────────┼──┼─────┼─\u003e│    MLXLMHandler)              │  │\n│  │  • response_queue \u003c──────────┼──┼\u003c────┼──│  • Model (MLX_LM)              │  │\n│  │  • generate_*() forwards RPC  │  │     │  │  • InferenceWorker (thread)   │  │\n│  └───────────────────────────────┘  │     │  └───────────────────────────────┘  │\n└─────────────────────────────────────┘     └─────────────────────────────────────┘\n```\n\nThe proxy exposes the same interface as the concrete handlers (`generate_text_stream`, `generate_embeddings_response`, etc.), so API endpoints work without changes. Requests and responses are serialized across the process boundary via queues; non-picklable objects (e.g. uploaded files) are pre-processed in the main process before being sent as file paths.\n\n### Using the API with multiple models\n\nSet the `model` field in your request to the **model name** (the `served_model_name` from the config, or `model_path` if you did not set `served_model_name`). The server looks up the handler for that name and runs the request on the correct model.\n\n```python\nimport openai\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\n# Use the first model (glm-4.7-flash)\nr1 = client.chat.completions.create(\n    model=\"glm-4.7-flash\",\n    messages=[{\"role\": \"user\", \"content\": \"Say hello in one word.\"}],\n)\nprint(r1.choices[0].message.content)\n\n# Use the second model (full path as served_model_name)\nr2 = client.chat.completions.create(\n    model=\"mlx-community/Qwen3-Coder-Next-4bit\",\n    messages=[{\"role\": \"user\", \"content\": \"Say hello in one word.\"}],\n)\nprint(r2.choices[0].message.content)\n```\n\n- **GET `/v1/models`** returns all loaded models (their IDs).\n- If you send a `model` that is not in the config, the server returns **404** with an error listing available models.\n\n---\n\n## Supported Model Types\n\n1. **Text-only** (`lm`) - Language models via `mlx-lm`\n2. **Multimodal** (`multimodal`) - Text, images, audio via `mlx-vlm`\n3. **Image generation** (`image-generation`) - Flux-series, Qwen Image, Z-Image Turbo, Fibo\n4. **Image editing** (`image-edit`) - Flux kontext, Qwen Image Edit\n5. **Embeddings** (`embeddings`) - Text embeddings via `mlx-embeddings`\n6. **Whisper** (`whisper`) - Audio transcription (requires ffmpeg)\n\n### Image Model Configurations\n\n**Generation:**\n- `flux-schnell` - Fast (4 steps, no guidance)\n- `flux-dev` - Balanced (25 steps, 3.5 guidance)\n- `flux-krea-dev` - High quality (28 steps, 4.5 guidance)\n- `flux2-klein-4b` / `flux2-klein-9b` - Flux 2 Klein models\n- `qwen-image` - Qwen image generation (50 steps, 4.0 guidance)\n- `z-image-turbo` - Z-Image Turbo\n- `fibo` - Fibo model\n\n**Editing:**\n- `flux-kontext-dev` - Context-aware editing (28 steps, 2.5 guidance)\n- `flux2-klein-edit-4b` / `flux2-klein-edit-9b` - Flux 2 Klein editing\n- `qwen-image-edit` - Qwen image editing (50 steps, 4.0 guidance)\n\n---\n\n## Common Use Cases\n\n| Use Case | One-liner Launch |\n|----------|------------------|\n| **Text generation** | `mlx-openai-server launch --model-type lm --model-path \u003cpath\u003e` |\n| **Vision Q\u0026A** | `mlx-openai-server launch --model-type multimodal --model-path \u003cpath\u003e` |\n| **Image generation** | `mlx-openai-server launch --model-type image-generation --model-path \u003cpath\u003e --config-name flux-dev` |\n| **Image editing** | `mlx-openai-server launch --model-type image-edit --model-path \u003cpath\u003e --config-name flux-kontext-dev` |\n| **Audio transcription** | `mlx-openai-server launch --model-type whisper --model-path mlx-community/whisper-large-v3-mlx` |\n| **Embeddings** | `mlx-openai-server launch --model-type embeddings --model-path \u003cpath\u003e` |\n\n---\n\n## Using the API\n\nThe server provides OpenAI-compatible endpoints. Use standard OpenAI client libraries.\n\n\u003e **Model name in requests:** The `model` field should be the model path you passed to `--model-path` (e.g. `mlx-community/Qwen3-Coder-Next-4bit`), the `--served-model-name` you set, or the `served_model_name` from your YAML config. No API key is required — use any non-empty string (e.g. `\"not-needed\"`).\n\n### Supported Endpoints\n\n| Endpoint | Model Types | Description |\n|----------|-------------|-------------|\n| `POST /v1/chat/completions` | lm, multimodal | Chat completions (streaming supported) |\n| `POST /v1/responses` | lm, multimodal | OpenAI Responses API |\n| `POST /v1/images/generations` | image-generation | Image generation |\n| `POST /v1/images/edits` | image-edit | Image editing |\n| `POST /v1/embeddings` | embeddings | Text embeddings |\n| `POST /v1/audio/transcriptions` | whisper | Audio transcription |\n| `GET /v1/models` | all | List available models |\n\n### Text Completion\n\n```python\nimport openai\n\nclient = openai.OpenAI(\n    base_url=\"http://localhost:8000/v1\",\n    api_key=\"not-needed\"\n)\n\nresponse = client.chat.completions.create(\n    model=\"local-model\",\n    messages=[{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]\n)\nprint(response.choices[0].message.content)\n```\n\n### Vision (Multimodal)\n\n```python\nimport openai\nimport base64\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\nwith open(\"image.jpg\", \"rb\") as f:\n    base64_image = base64.b64encode(f.read()).decode('utf-8')\n\nresponse = client.chat.completions.create(\n    model=\"local-multimodal\",\n    messages=[{\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"What's in this image?\"},\n            {\"type\": \"image_url\", \"image_url\": {\"url\": f\"data:image/jpeg;base64,{base64_image}\"}}\n        ]\n    }]\n)\nprint(response.choices[0].message.content)\n```\n\n### Image Generation\n\n```python\nimport openai\nimport base64\nfrom io import BytesIO\nfrom PIL import Image\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\nresponse = client.images.generate(\n    prompt=\"A serene landscape with mountains and a lake at sunset\",\n    model=\"local-image-generation-model\",\n    size=\"1024x1024\"\n)\n\nimage_data = base64.b64decode(response.data[0].b64_json)\nimage = Image.open(BytesIO(image_data))\nimage.show()\n```\n\n### Image Editing\n\n```python\nimport openai\nimport base64\nfrom io import BytesIO\nfrom PIL import Image\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\nwith open(\"image.png\", \"rb\") as f:\n    result = client.images.edit(\n        image=f,\n        prompt=\"make it like a photo in 1800s\",\n        model=\"flux-kontext-dev\"\n    )\n\nimage_data = base64.b64decode(result.data[0].b64_json)\nimage = Image.open(BytesIO(image_data))\nimage.show()\n```\n\n### Function Calling\n\n```python\nimport openai\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\nmessages = [{\"role\": \"user\", \"content\": \"What is the weather in Tokyo?\"}]\ntools = [{\n    \"type\": \"function\",\n    \"function\": {\n        \"name\": \"get_weather\",\n        \"description\": \"Get the weather in a given city\",\n        \"parameters\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"city\": {\"type\": \"string\", \"description\": \"The city name\"}\n            }\n        }\n    }\n}]\n\ncompletion = client.chat.completions.create(\n    model=\"local-model\",\n    messages=messages,\n    tools=tools,\n    tool_choice=\"auto\"\n)\n\nif completion.choices[0].message.tool_calls:\n    tool_call = completion.choices[0].message.tool_calls[0]\n    print(f\"Function: {tool_call.function.name}\")\n    print(f\"Arguments: {tool_call.function.arguments}\")\n```\n\n### Embeddings\n\n```python\nimport openai\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\nresponse = client.embeddings.create(\n    model=\"local-model\",\n    input=[\"The quick brown fox jumps over the lazy dog\"]\n)\n\nprint(f\"Embedding dimension: {len(response.data[0].embedding)}\")\n```\n\n### Responses API\n\nThe server exposes the OpenAI [Responses API](https://platform.openai.com/docs/api-reference/responses) at `POST /v1/responses`. Use `client.responses.create()` with the OpenAI SDK for text and multimodal (lm/multimodal) models.\n\n**Text input (non-streaming):**\n\n```python\nimport openai\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\nresponse = client.responses.create(\n    model=\"local-model\",\n    input=\"Tell me a three sentence bedtime story about a unicorn.\"\n)\n# response.output contains reasoning and message items\nfor item in response.output:\n    if item.type == \"message\":\n        for part in item.content:\n            if getattr(part, \"text\", None):\n                print(part.text)\n```\n\n**Text input (streaming):**\n\n```python\nresponse = client.responses.create(\n    model=\"local-model\",\n    input=\"Tell me a three sentence bedtime story about a unicorn.\",\n    stream=True\n)\nfor chunk in response:\n    print(chunk)\n```\n\n**Image input (vision / multimodal):**\n\n```python\nresponse = client.responses.create(\n    model=\"local-multimodal\",\n    input=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"input_text\", \"text\": \"What is in this image?\"},\n                {\n                    \"type\": \"input_image\",\n                    \"image_url\": \"path/to/image.jpg\",\n                    \"detail\": \"low\"\n                }\n            ]\n        }\n    ]\n)\n```\n\n**Function calling:**\n\n```python\ntools = [{\n    \"type\": \"function\",\n    \"name\": \"get_current_weather\",\n    \"description\": \"Get the current weather in a given location\",\n    \"parameters\": {\n        \"type\": \"object\",\n        \"properties\": {\n            \"location\": {\"type\": \"string\", \"description\": \"The city and state\"},\n            \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}\n        },\n        \"required\": [\"location\", \"unit\"]\n    }\n}]\n\nresponse = client.responses.create(\n    model=\"local-model\",\n    tools=tools,\n    input=\"What is the weather like in Boston today?\",\n    tool_choice=\"auto\"\n)\n```\n\n**Structured outputs (Pydantic):**\n\n```python\nfrom pydantic import BaseModel\n\nclass Address(BaseModel):\n    street: str\n    city: str\n    state: str\n    zip: str\n\nresponse = client.responses.parse(\n    model=\"local-model\",\n    input=[{\"role\": \"user\", \"content\": \"Format: 1 Hacker Wy Menlo Park CA 94025\"}],\n    text_format=Address\n)\naddress = response.output_parsed  # Pydantic model instance\nprint(address)\n```\n\nSee `examples/responses_api.ipynb` for full examples including streaming, image input, tool calls, and structured outputs.\n\n### Structured Outputs (JSON Schema)\n\n```python\nimport openai\nimport json\n\nclient = openai.OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\n\nresponse_format = {\n    \"type\": \"json_schema\",\n    \"json_schema\": {\n        \"name\": \"Address\",\n        \"schema\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"street\": {\"type\": \"string\"},\n                \"city\": {\"type\": \"string\"},\n                \"state\": {\"type\": \"string\"},\n                \"zip\": {\"type\": \"string\"}\n            },\n            \"required\": [\"street\", \"city\", \"state\", \"zip\"]\n        }\n    }\n}\n\ncompletion = client.chat.completions.create(\n    model=\"local-model\",\n    messages=[{\"role\": \"user\", \"content\": \"Format: 1 Hacker Wy Menlo Park CA 94025\"}],\n    response_format=response_format\n)\n\naddress = json.loads(completion.choices[0].message.content)\nprint(json.dumps(address, indent=2))\n```\n\n## Advanced Configuration\n\n### Parser Configuration\n\nFor models requiring custom parsing (tool calls, reasoning):\n\n```bash\nmlx-openai-server launch \\\n  --model-path \u003cpath-to-model\u003e \\\n  --model-type lm \\\n  --tool-call-parser qwen3 \\\n  --reasoning-parser qwen3 \\\n  --enable-auto-tool-choice\n```\n\n**Qwen3.5 models** (multimodal):\n\n```bash\nmlx-openai-server launch \\\n  --model-path mlx-community/Qwen3.5-122B-A10B-4bit \\\n  --model-type multimodal \\\n  --reasoning-parser qwen3_5 \\\n  --tool-call-parser qwen3_coder\n```\n\nAvailable parsers: `qwen3`, `qwen3_5`, `glm4_moe`, `qwen3_coder`, `qwen3_moe`, `qwen3_next`, `qwen3_vl`, `harmony`, `minimax_m2`\n\n### Message Converters\n\nMessage converters are **auto-detected** from parser selection. When you set `tool_call_parser` (or `reasoning_parser`), the server uses the same name for message preprocessing when a compatible converter exists. You do not need to pass `--message-converter`.\n\nAuto-detected converters: `glm4_moe`, `minimax_m2`, `minimax`, `nemotron3_nano`, `qwen3_coder`, `longcat_flash_lite`, `step_35`\n\n### Custom Chat Templates\n\n```bash\nmlx-openai-server launch \\\n  --model-path \u003cpath-to-model\u003e \\\n  --model-type lm \\\n  --chat-template-file /path/to/template.jinja\n```\n\n### Speculative Decoding (lm)\n\nUse a smaller draft model to propose tokens and verify them with the main model for faster text generation. Supported only for `--model-type lm`.\n\n```bash\nmlx-openai-server launch \\\n  --model-path mlx-community/MyModel-8B-4bit \\\n  --model-type lm \\\n  --draft-model-path mlx-community/MyModel-1B-4bit \\\n  --num-draft-tokens 4\n```\n\n- **`--draft-model-path`**: Path or HuggingFace repo of the draft model (smaller size model).\n- **`--num-draft-tokens`**: Number of tokens the draft model generates per verification step (default: 2). Higher values can increase throughput at the cost of more draft compute.\n\n## Example Notebooks\n\nCheck the `examples/` directory for comprehensive guides:\n\n| Category | Notebooks | Description |\n|----------|-----------|-------------|\n| **Text \u0026 Chat** | `responses_api.ipynb`, `simple_rag_demo.ipynb` | Responses API (text, image, tools, streaming, structured outputs); RAG pipeline demo |\n| **Vision** | `vision_examples.ipynb` | Vision capabilities |\n| **Audio** | `audio_examples.ipynb`, `transcription_examples.ipynb` | Audio processing and transcription |\n| **Embeddings** | `embedding_examples.ipynb`, `lm_embeddings_examples.ipynb`, `vlm_embeddings_examples.ipynb` | Text, LM, and VLM embeddings |\n| **Images** | `image_generations.ipynb`, `image_edit.ipynb` | Image generation and editing |\n| **Advanced** | `structured_outputs_examples.ipynb` | JSON schema / structured outputs |\n\n## Large Models\n\nFor models that don't fit in RAM, improve performance on macOS 15.0+:\n\n```bash\nbash configure_mlx.sh\n```\n\nThis raises the system's wired memory limit for better performance.\n\n---\n\n## Troubleshooting\n\n| Issue | Solution |\n|-------|----------|\n| **Memory problems** | Use `--quantize 4` or `8` for image models; reduce `--context-length` for lm/multimodal. Run `configure_mlx.sh` on macOS 15+ to raise wired memory limits. |\n| **Model download issues** | Ensure `transformers` and `huggingface_hub` are installed. Check network access; some models require Hugging Face login. |\n| **Port already in use** | Use `--port` to specify a different port (e.g. `--port 8001`). |\n| **Quantization questions** | For lm/multimodal, use pre-quantized models from [mlx-community](https://huggingface.co/mlx-community). For image models, use `--quantize 4` or `8`. |\n| **Metal/semaphore warnings** | Use multi-handler mode (`--config`); each model runs in a spawned subprocess to avoid Metal context issues. |\n\n---\n\n## Frequently Encountered Problems\n\n### Model loading errors (e.g. \"parameters not in model\")\n\nIf you see errors like **\"Received N parameters not in model\"** or weight/parameter mismatches when loading a newly released model, the most common cause is an outdated version of the underlying MLX model library. New models often require the latest architecture support from `mlx-lm`, `mlx-vlm`, or other backend packages.\n\n**Fix:** Install the latest version directly from the source repository:\n\n```bash\n# For text models (lm)\nuv pip install git+https://github.com/ml-explore/mlx-lm.git\n\n# For multimodal models\nuv pip install git+https://github.com/Blaizzy/mlx-vlm.git\n\n# For embeddings\nuv pip install git+https://github.com/Blaizzy/mlx-embeddings.git\n```\n\nThe git versions often contain support for new model architectures before a PyPI release is published. After upgrading, restart the server and try loading the model again.\n\n---\n\n## Quick Reference Card\n\n```bash\n# Text (language model)\nmlx-openai-server launch --model-type lm --model-path \u003cpath\u003e\n\n# Vision (multimodal)\nmlx-openai-server launch --model-type multimodal --model-path \u003cpath\u003e\n\n# Image generation\nmlx-openai-server launch --model-type image-generation --model-path \u003cpath\u003e --config-name flux-dev\n\n# Image editing\nmlx-openai-server launch --model-type image-edit --model-path \u003cpath\u003e --config-name flux-kontext-dev\n\n# Embeddings\nmlx-openai-server launch --model-type embeddings --model-path \u003cpath\u003e\n\n# Whisper (audio transcription)\nmlx-openai-server launch --model-type whisper --model-path mlx-community/whisper-large-v3-mlx\n```\n\n---\n\n## Featured Launch: MiniMax-M2.5-Uncensored-4bit\n\nWant a frontier-style assistant on Apple Silicon without the usual heavyweight setup? [mlx-community/MiniMax-M2.5-Uncensored-4bit](https://huggingface.co/mlx-community/MiniMax-M2.5-Uncensored-4bit) is a 4-bit quantized, uncensored MiniMax-M2.5 release that pairs especially well with `mlx-openai-server` for coding, tool use, search, and agent-style workflows.\n\n### Launch It in One Command\n\n```bash\nmlx-openai-server launch \\\n  --model-path mlx-community/MiniMax-M2.5-Uncensored-4bit \\\n  --model-type lm \\\n  --reasoning-parser minimax_m2 \\\n  --tool-call-parser minimax_m2 \\\n  --trust-remote-code\n```\n\nOnce it is running, point your OpenAI client to `http://localhost:8000/v1` and use it like any other chat-completions endpoint.\n\n### Why This Model Stands Out\n\n- **4-bit efficiency** for lower memory use and faster local inference\n- **Uncensored behavior** for research, creative, and less-filtered assistant use cases\n- **MiniMax-native parsing** with `minimax_m2` for cleaner reasoning and tool-call handling\n- **Drop-in compatibility** with OpenAI SDKs, OpenWebUI, and agent frameworks\n\n---\n\n## Featured Launch: GLM-4.7-Flash-Abliterated-8bit\n\nLooking for a fast, uncensored reasoning model on Apple Silicon? [mlx-community/glm-4.7-flash-abliterated-8bit](https://huggingface.co/mlx-community/glm-4.7-flash-abliterated-8bit) is an 8-bit quantized MLX conversion of [huihui-ai/Huihui-GLM-4.7-Flash-abliterated](https://huggingface.co/huihui-ai/Huihui-GLM-4.7-Flash-abliterated), offering strong reasoning and tool-calling capabilities with efficient memory usage.\n\n### Launch It in One Command\n\n```bash\nmlx-openai-server launch \\\n  --model-path mlx-community/glm-4.7-flash-abliterated-8bit \\\n  --reasoning-parser glm47_flash \\\n  --tool-call-parser glm4_moe\n```\n\nOnce it is running, point your OpenAI client to `http://localhost:8000/v1` and use it like any other chat-completions endpoint.\n\n### Why This Model Stands Out\n\n- **8-bit quantized** for a good balance between quality and memory efficiency on Apple Silicon\n- **Abliterated** — fewer refusals for research, creative, and less-filtered use cases\n- **Built-in reasoning** with dedicated `glm47_flash` parser for chain-of-thought outputs\n- **Tool calling** via `glm4_moe` parser for agent-style workflows\n- **Drop-in compatibility** with OpenAI SDKs, OpenWebUI, and agent frameworks\n\n---\n\n## Contributing\n\nWe welcome contributions! Please:\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes with tests\n4. Submit a pull request\n\nFollow [Conventional Commits](https://www.conventionalcommits.org/) for commit messages.\n\n## Support\n\n- **Documentation**: This README and example notebooks\n- **Issues**: [GitHub Issues](https://github.com/cubist38/mlx-openai-server/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/cubist38/mlx-openai-server/discussions)\n- **Video Tutorials**: [Setup Demo](https://youtu.be/J1gkEMvmTSE), [RAG Demo](https://youtu.be/ANUEZkmR-0s), [Testing Qwen3-Coder-Next-4bit with Qwen-Code](https://youtu.be/X5Hsd3QR_E8), [Serving Multiple Models at Once? mlx-openai-server + OpenWebUI Test](https://www.youtube.com/watch?v=f7WXSOPZ5H4)\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\nBuilt on top of:\n- [MLX](https://github.com/ml-explore/mlx) - Apple's ML framework\n- [mlx-lm](https://github.com/ml-explore/mlx-lm) - Language models\n- [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) - Multimodal models\n- [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) - Embeddings\n- [mflux](https://github.com/filipstrand/mflux) - Flux image models\n- [mlx-whisper](https://github.com/ml-explore/mlx-examples/tree/main/whisper) - Audio transcription\n- [mlx-community](https://huggingface.co/mlx-community) - Model repository\n\n---\n\n[![GitHub stars](https://img.shields.io/github/stars/cubist38/mlx-openai-server?style=social\u0026label=Star)](https://github.com/cubist38/mlx-openai-server)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcubist38%2Fmlx-openai-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcubist38%2Fmlx-openai-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcubist38%2Fmlx-openai-server/lists"}