{"id":13907625,"url":"https://github.com/Blaizzy/mlx-vlm","last_synced_at":"2025-07-18T06:30:32.640Z","repository":{"id":233573420,"uuid":"787462297","full_name":"Blaizzy/mlx-vlm","owner":"Blaizzy","description":"MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.","archived":false,"fork":false,"pushed_at":"2025-07-11T23:19:20.000Z","size":37690,"stargazers_count":1481,"open_issues_count":76,"forks_count":146,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-07-12T02:21:35.458Z","etag":null,"topics":["apple-silicon","florence2","idefics","llava","llm","local-ai","mlx","molmo","paligemma","pixtral","vision-framework","vision-language-model","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Blaizzy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":"Blaizzy","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2024-04-16T15:10:12.000Z","updated_at":"2025-07-11T23:49:22.000Z","dependencies_parsed_at":"2024-04-16T20:06:33.026Z","dependency_job_id":"1cd81abc-d804-4f1d-b5be-c1ad911c59a8","html_url":"https://github.com/Blaizzy/mlx-vlm","commit_stats":{"total_commits":156,"total_committers":12,"mean_commits":13.0,"dds":"0.10256410256410253","last_synced_commit":"2a97875e3283fd13358763fe085b52551d6ff9ad"},"previous_names":["blaizzy/mlx-vlm"],"tags_count":45,"template":false,"template_full_name":null,"purl":"pkg:github/Blaizzy/mlx-vlm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-vlm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-vlm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-vlm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-vlm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Blaizzy","download_url":"https://codeload.github.com/Blaizzy/mlx-vlm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-vlm/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265710530,"owners_count":23815373,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","florence2","idefics","llava","llm","local-ai","mlx","molmo","paligemma","pixtral","vision-framework","vision-language-model","vision-transformer"],"created_at":"2024-08-06T23:02:02.775Z","updated_at":"2025-07-18T06:30:32.629Z","avatar_url":"https://github.com/Blaizzy.png","language":"Python","funding_links":["https://github.com/sponsors/Blaizzy"],"categories":["Python","HarmonyOS","微调 Fine-Tuning","Libraries and Tools","Building","Inference engines","Training","🤖 AI \u0026 Machine Learning","LLM \u0026 Inference","Repos"],"sub_categories":["Windows Manager","2024","LLM Models","FineTune"],"readme":"[![Upload Python Package](https://github.com/Blaizzy/mlx-vlm/actions/workflows/python-publish.yml/badge.svg)](https://github.com/Blaizzy/mlx-vlm/actions/workflows/python-publish.yml)\n# MLX-VLM\n\nMLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.\n\n## Table of Contents\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Command Line Interface (CLI)](#command-line-interface-cli)\n  - [Chat UI with Gradio](#chat-ui-with-gradio)\n  - [Python Script](#python-script)\n- [Multi-Image Chat Support](#multi-image-chat-support)\n  - [Supported Models](#supported-models)\n  - [Usage Examples](#usage-examples)\n- [Fine-tuning](#fine-tuning)\n\n## Installation\n\nThe easiest way to get started is to install the `mlx-vlm` package using pip:\n\n```sh\npip install -U mlx-vlm\n```\n\n## Usage\n\n### Command Line Interface (CLI)\n\nGenerate output from a model using the CLI:\n\n```sh\n# Image generation\nmlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg\n\n# Audio generation (New)\nmlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you hear\" --audio /path/to/audio.wav\n\n# Multi-modal generation (Image + Audio)\nmlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you see and hear\" --image /path/to/image.jpg --audio /path/to/audio.wav\n```\n\n### Chat UI with Gradio\n\nLaunch a chat interface using Gradio:\n\n```sh\nmlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit\n```\n\n### Python Script\n\nHere's an example of how to use MLX-VLM in a Python script:\n\n```python\nimport mlx.core as mx\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load the model\nmodel_path = \"mlx-community/Qwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = load_config(model_path)\n\n# Prepare input\nimage = [\"http://images.cocodataset.org/val2017/000000039769.jpg\"]\n# image = [Image.open(\"...\")] can also be used with PIL.Image.Image objects\nprompt = \"Describe this image.\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(image)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, verbose=False)\nprint(output)\n```\n\n#### Audio Example\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load model with audio support\nmodel_path = \"mlx-community/gemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare audio input\naudio = [\"/path/to/audio1.wav\", \"/path/to/audio2.mp3\"]\nprompt = \"Describe what you hear in these audio files.\"\n\n# Apply chat template with audio\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_audios=len(audio)\n)\n\n# Generate output with audio\noutput = generate(model, processor, formatted_prompt, audio=audio, verbose=False)\nprint(output)\n```\n\n#### Multi-Modal Example (Image + Audio)\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load multi-modal model\nmodel_path = \"mlx-community/gemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare inputs\nimage = [\"/path/to/image.jpg\"]\naudio = [\"/path/to/audio.wav\"]\nprompt = \"\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n    processor, config, prompt,\n    num_images=len(image),\n    num_audios=len(audio)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)\nprint(output)\n```\n\n### Server (FastAPI)\n\nStart the server:\n```sh\nmlx_vlm.server\n```\n\nThe server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).\n\n#### Available Endpoints\n\n- `/generate` - Main generation endpoint with support for images, audio, and text\n- `/chat` - Chat-style interaction endpoint\n- `/responses` - OpenAI-compatible endpoint\n- `/health` - Check server status\n- `/unload` - Unload current model from memory\n\n#### Usage Examples\n\n##### Basic Image Generation\n```sh\ncurl -X POST \"http://localhost:8000/generate\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"mlx-community/Qwen2.5-VL-32B-Instruct-8bit\",\n    \"image\": [\"/path/to/repo/examples/images/renewables_california.png\"],\n    \"prompt\": \"This is today'\\''s chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?\",\n    \"system\": \"You are a helpful assistant.\",\n    \"stream\": true,\n    \"max_tokens\": 1000\n  }'\n```\n\n##### Audio Support (New)\n```sh\ncurl -X POST \"http://localhost:8000/generate\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"mlx-community/gemma-3n-E2B-it-4bit\",\n    \"audio\": [\"/path/to/audio1.wav\", \"https://example.com/audio2.mp3\"],\n    \"prompt\": \"Describe what you hear in these audio files\",\n    \"stream\": true,\n    \"max_tokens\": 500\n  }'\n```\n\n##### Multi-Modal (Image + Audio)\n```sh\ncurl -X POST \"http://localhost:8000/generate\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"mlx-community/gemma-3n-E2B-it-4bit\",\n    \"image\": [\"/path/to/image.jpg\"],\n    \"audio\": [\"/path/to/audio.wav\"],\n    \"prompt\": \"\",\n    \"max_tokens\": 1000\n  }'\n```\n\n##### Chat Endpoint\n```sh\ncurl -X POST \"http://localhost:8000/chat\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"mlx-community/Qwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"What is in this image?\",\n        \"images\": [\"/path/to/image.jpg\"]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n##### OpenAI-Compatible Endpoint\n```sh\ncurl -X POST \"http://localhost:8000/responses\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"mlx-community/Qwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"input_text\", \"text\": \"What is in this image?\"},\n          {\"type\": \"input_image\", \"image\": \"/path/to/image.jpg\"}\n        ]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n#### Request Parameters\n\n- `model`: Model identifier (required)\n- `prompt`: Text prompt for generation\n- `image`: List of image URLs or local paths (optional)\n- `audio`: List of audio URLs or local paths (optional, new)\n- `system`: System prompt (optional)\n- `messages`: Chat messages for chat/OpenAI endpoints\n- `max_tokens`: Maximum tokens to generate\n- `temperature`: Sampling temperature\n- `top_p`: Top-p sampling parameter\n- `stream`: Enable streaming responses\n\n\n## Multi-Image Chat Support\n\nMLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.\n\n\n### Usage Examples\n\n#### Python Script\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\nmodel_path = \"mlx-community/Qwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\nimages = [\"path/to/image1.jpg\", \"path/to/image2.jpg\"]\nprompt = \"Compare these two images.\"\n\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(images)\n)\n\noutput = generate(model, processor, formatted_prompt, images, verbose=False)\nprint(output)\n```\n\n#### Command Line\n\n```sh\nmlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Compare these images\" --image path/to/image1.jpg path/to/image2.jpg\n```\n\n## Video Understanding\n\nMLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.\n\n### Supported Models\n\nThe following models support video chat:\n\n1. Qwen2-VL\n2. Qwen2.5-VL\n3. Idefics3\n4. LLaVA\n\nWith more coming soon.\n\n### Usage Examples\n\n#### Command Line\n```sh\nmlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Describe this video\" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0\n```\n\n\nThese examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.\n\n# Fine-tuning\n\nMLX-VLM supports fine-tuning models with LoRA and QLoRA.\n\n## LoRA \u0026 QLoRA\n\nTo learn more about LoRA, please refer to the [LoRA.md](./mlx_vlm/LORA.MD) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBlaizzy%2Fmlx-vlm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBlaizzy%2Fmlx-vlm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBlaizzy%2Fmlx-vlm/lists"}