https://github.com/davepoon/mlx-vlm-smolvlm-realtime-webcam
Real-time webcam demo with SmolVLM(mlx-community/SmolVLM-Instruct-4bit) and MLX-VLM
https://github.com/davepoon/mlx-vlm-smolvlm-realtime-webcam
apple-silicon idefics llms mlx mlx-vlm vision-framework vision-transformer
Last synced: about 1 month ago
JSON representation
Real-time webcam demo with SmolVLM(mlx-community/SmolVLM-Instruct-4bit) and MLX-VLM
- Host: GitHub
- URL: https://github.com/davepoon/mlx-vlm-smolvlm-realtime-webcam
- Owner: davepoon
- Created: 2025-06-12T06:47:19.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-06-12T12:08:22.000Z (4 months ago)
- Last Synced: 2025-06-12T12:25:14.625Z (4 months ago)
- Topics: apple-silicon, idefics, llms, mlx, mlx-vlm, vision-framework, vision-transformer
- Language: Python
- Homepage:
- Size: 380 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π€ SmolVLM Real-Time Webcam Demo with MLX-VLM
A real-time webcam application powered by **SmolVLM** (Small Vision Language Model) running on Apple Silicon using **MLX-VLM**. This application provides a simple web interface where you can analyze webcam footage in real-time using AI vision language models(mlx-community/SmolVLM-Instruct-4bit).

This repository features a simple demo of real-time object detection using MLX-VLM with mlx-community/SmolVLM-Instruct-4bit, optimized for M1 MacBook Pro.For improved output quality, you can switch to SmolVLM-Instruct-8bit, though it may require a faster Apple Silicon chip for faster performance.
## β¨ Features
- π₯ **Real-time Webcam Analysis** - Capture and analyze webcam frames instantly
- π§ **SmolVLM Integration** - Powered by efficient SmolVLM models via MLX-VLM
- π **Web Interface** - Simple, responsive web UI with modern design
- β‘ **Real-time Processing** - Fast inference on Apple Silicon devices
- ποΈ **Customizable Settings** - Adjust prompts, temperature, tokens, and auto-analysis
- π± **Mobile Friendly** - Responsive design works on various screen sizes
- π **Auto Analysis** - Optional automatic frame analysis at set intervals## π Quick Start
### Prerequisites
- **Apple Silicon Mac** (M1, M2, M3, or newer)
- **Python 3.10+**
- **Webcam or camera access**### Installation
1. **Clone or download the application**
```bash
# If you have the file locally, navigate to the directory
cd /path/to/your/mlx-projects
```2. **Install dependencies**
```bash
# Install MLX-VLM (the key dependency)
pip install mlx-vlm
# Install web server dependencies
pip install flask flask-socketio
# Install image processing
pip install pillow
```3. **Run the application**
```bash
python mlx_smolvlm_webcam.py --model mlx-community/SmolVLM-Instruct-4bit --port 8080
```4. **Open your browser**
- Navigate to: `http://localhost:8080`
- Click "Start Camera" to enable webcam
- Click "πΈ Analyze Frame" to get AI descriptions## π Requirements
### Essential Dependencies
```bash
pip install mlx-vlm flask flask-socketio pillow
```### Supported Models
- `mlx-community/SmolVLM-Instruct-4bit` (default, recommended)
- `mlx-community/SmolVLM-Instruct`
- Other SmolVLM models from mlx-community## π― Usage
### Basic Usage
```bash
python mlx_smolvlm_webcam.py --model mlx-community/SmolVLM-Instruct-4bit
```### Advanced Options
```bash
python mlx_smolvlm_webcam.py \
--model mlx-community/SmolVLM-Instruct-4bit \
--host 127.0.0.1 \
--port 8080 \
--debug
```### Command Line Arguments
- `--model`: HuggingFace model ID (default: `mlx-community/SmolVLM-Instruct-4bit`)
- `--host`: Server host (default: `127.0.0.1`)
- `--port`: Server port (default: `8080`)
- `--debug`: Enable debug mode## ποΈ Web Interface Features
### Camera Controls
- **Start Camera**: Enable webcam access
- **πΈ Analyze Frame**: Capture and analyze current frame
- **βΈοΈ Pause/βΆοΈ Resume**: Toggle camera feed### Settings Panel
- **Custom Prompt**: Customize what you want the AI to describe
- **Max Tokens**: Control response length (5-50)
- **Temperature**: Adjust creativity/randomness (0.1-1.0)
- **Auto Analyze**: Automatic analysis every .5/1/1.5/2/2.5/3/5/10 seconds or Manual### Example Prompts
- "Describe what you see in this image in detail"
- "What objects are visible in this scene?"
- "Analyze the emotions and expressions of people in this image"
- "Describe the lighting and composition of this scene"
- "What activities are taking place in this image?"## π§ Troubleshooting
### Common Issues
**"Module not found: flask_socketio"**
```bash
pip install flask-socketio
```**"Model type idefics3 not supported"**
- Make sure you're using `mlx-vlm` not `mlx-lm`
```bash
pip uninstall mlx-lm
pip install mlx-vlm
```**"Port already in use"**
```bash
# Use a different port
python mlx_smolvlm_webcam.py --port 8080
```**Camera permission denied**
- Allow camera access in your browser
- Check System Preferences > Security & Privacy > Camera**Model loading fails**
```bash
# Clear HuggingFace cache and retry
rm -rf ~/.cache/huggingface/
python mlx_smolvlm_webcam.py --model mlx-community/SmolVLM-Instruct-4bit
```### Performance Tips
1. **Use 4-bit models** for faster inference:
```bash
--model mlx-community/SmolVLM-Instruct-4bit
```2. **Adjust image size** - App automatically resizes to 512px max dimension
3. **Lower max tokens** for faster responses
4. **Use auto-analyze sparingly** to avoid overwhelming the model
## ποΈ Architecture
- **Backend**: Flask + SocketIO for real-time communication
- **Frontend**: Modern HTML5 + JavaScript with WebSocket
- **AI Model**: SmolVLM via MLX-VLM for Apple Silicon optimization
- **Image Processing**: PIL for image handling and resizing## π¨ Features in Detail
### Real-time Analysis
The application captures webcam frames and sends them to SmolVLM for analysis. The AI provides detailed descriptions of what it sees, including objects, people, activities, and scenes.### Modern Web Interface
- Gradient backgrounds and modern CSS
- Responsive design for different screen sizes
- Real-time status indicators
- Smooth animations and transitions### Flexible Configuration
- Adjustable AI parameters (temperature, max tokens)
- Custom prompts for specific use cases
- Auto-analysis for continuous monitoring## π Example Outputs
**Scene Description**:
> "I can see a person sitting at a desk with a laptop computer. There are books and papers scattered on the desk, and a window with natural lighting in the background. The person appears to be working or studying."**Object Detection**:
> "In this image, I can identify: a laptop computer, several books, a coffee mug, a desk lamp, and a potted plant on the windowsill."## π€ Contributing
Feel free to submit issues, feature requests, or pull requests to improve this application.
## π License
This project is open source. Please check individual dependencies for their respective licenses.
## π Acknowledgments
- **SmolVLM**: HuggingFace's efficient vision language model
- **MLX**: Apple's machine learning framework for Apple Silicon
- **MLX-VLM**: MLX integration for vision language models
- **Inspired by**: https://github.com/ngxson/smolvlm-realtime-webcam---
**Enjoy analyzing the world through AI! π**