https://github.com/Blaizzy/mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
https://github.com/Blaizzy/mlx-vlm
apple-silicon florence2 idefics llava llm local-ai mlx molmo paligemma pixtral vision-framework vision-language-model vision-transformer
Last synced: 3 months ago
JSON representation
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- Host: GitHub
- URL: https://github.com/Blaizzy/mlx-vlm
- Owner: Blaizzy
- License: mit
- Created: 2024-04-16T15:10:12.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-07-11T23:19:20.000Z (3 months ago)
- Last Synced: 2025-07-12T02:21:35.458Z (3 months ago)
- Topics: apple-silicon, florence2, idefics, llava, llm, local-ai, mlx, molmo, paligemma, pixtral, vision-framework, vision-language-model, vision-transformer
- Language: Python
- Homepage:
- Size: 35.9 MB
- Stars: 1,481
- Watchers: 20
- Forks: 146
- Open Issues: 76
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- awesome-LLM-resources - MLX-VLM - VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX. (微调 Fine-Tuning)
- awesome_ai_agents - Mlx-Vlm - MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX. (Building / LLM Models)
- awesome_ai_agents - Mlx-Vlm - MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX. (Building / LLM Models)
- Awesome-LLMOps - MLX-VLM - VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.    (Training / FineTune)
README
[](https://github.com/Blaizzy/mlx-vlm/actions/workflows/python-publish.yml)
# MLX-VLMMLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Command Line Interface (CLI)](#command-line-interface-cli)
- [Chat UI with Gradio](#chat-ui-with-gradio)
- [Python Script](#python-script)
- [Multi-Image Chat Support](#multi-image-chat-support)
- [Supported Models](#supported-models)
- [Usage Examples](#usage-examples)
- [Fine-tuning](#fine-tuning)## Installation
The easiest way to get started is to install the `mlx-vlm` package using pip:
```sh
pip install -U mlx-vlm
```## Usage
### Command Line Interface (CLI)
Generate output from a model using the CLI:
```sh
# Image generation
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg# Audio generation (New)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav# Multi-modal generation (Image + Audio)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav
```### Chat UI with Gradio
Launch a chat interface using Gradio:
```sh
mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit
```### Python Script
Here's an example of how to use MLX-VLM in a Python script:
```python
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config# Load the model
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Describe this image."# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)
```#### Audio Example
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config# Load model with audio support
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config# Prepare audio input
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."# Apply chat template with audio
formatted_prompt = apply_chat_template(
processor, config, prompt, num_audios=len(audio)
)# Generate output with audio
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)
```#### Multi-Modal Example (Image + Audio)
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config# Load multi-modal model
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config# Prepare inputs
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt,
num_images=len(image),
num_audios=len(audio)
)# Generate output
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)
```### Server (FastAPI)
Start the server:
```sh
mlx_vlm.server
```The server provides multiple endpoints for different use cases and supports dynamic model loading/unloading with caching (one model at a time).
#### Available Endpoints
- `/generate` - Main generation endpoint with support for images, audio, and text
- `/chat` - Chat-style interaction endpoint
- `/responses` - OpenAI-compatible endpoint
- `/health` - Check server status
- `/unload` - Unload current model from memory#### Usage Examples
##### Basic Image Generation
```sh
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2.5-VL-32B-Instruct-8bit",
"image": ["/path/to/repo/examples/images/renewables_california.png"],
"prompt": "This is today'\''s chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?",
"system": "You are a helpful assistant.",
"stream": true,
"max_tokens": 1000
}'
```##### Audio Support (New)
```sh
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"audio": ["/path/to/audio1.wav", "https://example.com/audio2.mp3"],
"prompt": "Describe what you hear in these audio files",
"stream": true,
"max_tokens": 500
}'
```##### Multi-Modal (Image + Audio)
```sh
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/gemma-3n-E2B-it-4bit",
"image": ["/path/to/image.jpg"],
"audio": ["/path/to/audio.wav"],
"prompt": "",
"max_tokens": 1000
}'
```##### Chat Endpoint
```sh
curl -X POST "http://localhost:8000/chat" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
"messages": [
{
"role": "user",
"content": "What is in this image?",
"images": ["/path/to/image.jpg"]
}
],
"max_tokens": 100
}'
```##### OpenAI-Compatible Endpoint
```sh
curl -X POST "http://localhost:8000/responses" \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
"messages": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "What is in this image?"},
{"type": "input_image", "image": "/path/to/image.jpg"}
]
}
],
"max_tokens": 100
}'
```#### Request Parameters
- `model`: Model identifier (required)
- `prompt`: Text prompt for generation
- `image`: List of image URLs or local paths (optional)
- `audio`: List of audio URLs or local paths (optional, new)
- `system`: System prompt (optional)
- `messages`: Chat messages for chat/OpenAI endpoints
- `max_tokens`: Maximum tokens to generate
- `temperature`: Sampling temperature
- `top_p`: Top-p sampling parameter
- `stream`: Enable streaming responses## Multi-Image Chat Support
MLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.
### Usage Examples
#### Python Script
```python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_configmodel_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.configimages = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(images)
)output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output)
```#### Command Line
```sh
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg
```## Video Understanding
MLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.
### Supported Models
The following models support video chat:
1. Qwen2-VL
2. Qwen2.5-VL
3. Idefics3
4. LLaVAWith more coming soon.
### Usage Examples
#### Command Line
```sh
mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0
```These examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.
# Fine-tuning
MLX-VLM supports fine-tuning models with LoRA and QLoRA.
## LoRA & QLoRA
To learn more about LoRA, please refer to the [LoRA.md](./mlx_vlm/LORA.MD) file.