https://github.com/maharshpatelx/qwen-clip-multimodal
Multimodal Vision-AI: CLIP eyes + Qwen2.5 brain, 155 K-step pipeline & demo.
https://github.com/maharshpatelx/qwen-clip-multimodal
clip computer-vision deep-learning multimodal pytorch qwen2-5 vision-ai
Last synced: 2 months ago
JSON representation
Multimodal Vision-AI: CLIP eyes + Qwen2.5 brain, 155 K-step pipeline & demo.
- Host: GitHub
- URL: https://github.com/maharshpatelx/qwen-clip-multimodal
- Owner: MaharshPatelX
- License: mit
- Created: 2025-06-05T21:10:27.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-06-13T03:56:31.000Z (4 months ago)
- Last Synced: 2025-06-13T04:38:26.908Z (4 months ago)
- Topics: clip, computer-vision, deep-learning, multimodal, pytorch, qwen2-5, vision-ai
- Language: Python
- Homepage:
- Size: 179 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ค Qwen-CLIP Multimodal: Train Your Own Vision AI
[๐ Read the blog post that inspired this project](https://maharshpatelx.medium.com/building-vision-ai-that-actually-works-my-155-000-step-journey-from-curiosity-to-code-1abee45d9dc4)**Train a smart AI that can see images and talk about them!**
This project helps you create an AI model that can:
- Look at pictures and describe what it sees
- Answer questions about images
- Chat about what's in photos[](https://github.com/MaharshPatelX/qwen-clip-multimodal/blob/main/LICENSE)
[](https://www.python.org/downloads/)
[](https://pytorch.org/)---
## ๐ฏ What This Does
**Simple Explanation:**
- Takes a picture + teaches the AI what's in it
- After training, you can show it new pictures and ask questions
- The AI will describe the image or answer your questions**Example:**
- You: "What's in this picture?" + ๐ผ๏ธ[photo of a cat]
- AI: "This is a fluffy orange cat sitting on a windowsill"---
## ๐ Quick Start (Easy Steps)
### Step 1: Setup Your Computer
```bash
#1. Setup env
curl -Ls https://astral.sh/uv/install.sh | shpip install uv
echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc#2. Activate env
source .venv/bin/activate# 3. Download this project
git clone https://github.com/MaharshPatelX/qwen-clip-multimodal.git
cd qwen-clip-multimodal# 4. Install Python packages (this takes a few minutes)
pip install -r requirements.txt
```### Step 2: Download Training Data
```bash
# Download datasets (this will take some time - about 20GB total)
python scripts/download_datasets.py --all
```**What this downloads:**
- ๐ธ **COCO Dataset**: 330,000 images with descriptions (19GB)
- ๐ฌ **LLaVA Instructions**: 150,000 conversation examples (1GB)### Step 3: Train Your AI (2 Phases)
**Phase 1: Teach AI to connect images with words (Pre-training)**
```bash
python examples/train_model.py --config configs/coco_pretraining.json
```
- **Time**: 6-12 hours (depending on your GPU)
- **What happens**: AI learns basic image understanding**Phase 2: Teach AI to follow instructions (Instruction Tuning)**
```bash
python examples/train_model.py --config configs/llava_instruction.json
```
- **Time**: 3-6 hours
- **What happens**: AI learns to chat and answer questions### Step 4: Test Your Trained AI
```python
from inference import MultimodalInferencePipeline
from models import MultimodalLLM# Load your trained model
model = MultimodalLLM.from_pretrained("./outputs/llava_instruction/best_model")
pipeline = MultimodalInferencePipeline(model)# Ask about an image
answer = pipeline.chat("What do you see in this image?", image="path/to/your/image.jpg")
print(answer)
```---
## ๐ What You Need (Hardware)
### ๐ข Minimum (For Testing)
- **GPU**: 8GB VRAM (like RTX 3070)
- **RAM**: 16GB
- **Storage**: 50GB free space
- **Time**: 1-2 days total training### ๐ก Recommended (For Good Results)
- **GPU**: 16GB VRAM (like RTX 4080)
- **RAM**: 32GB
- **Storage**: 100GB free space
- **Time**: 12-18 hours total training### ๐ข Best (For Fast Training)
- **GPU**: 24GB+ VRAM (like RTX 4090)
- **RAM**: 64GB+
- **Storage**: 200GB free space
- **Time**: 6-10 hours total training---
## ๐ Training Explained (Simple)
### What Happens During Training?
**Phase 1: Image-Text Connection (Pre-training)**
1. Show AI 330,000 images with their descriptions
2. AI learns: "This visual pattern = this word description"
3. Like teaching a child to recognize objects**Phase 2: Conversation Skills (Instruction Tuning)**
1. Show AI 150,000 conversation examples
2. AI learns: "When human asks X, I should respond Y"
3. Like teaching AI good manners and how to chat### Why Two Phases?
- **Phase 1**: Builds basic understanding (like learning vocabulary)
- **Phase 2**: Teaches proper conversation (like learning social skills)---
## ๐ Step-by-Step Training Guide
### Option 1: Quick Test (1 hour)
```bash
# Use tiny test dataset (just to see if it works)
python examples/train_model.py --debug
```
- **Dataset**: 2 images only
- **Time**: 2 minutes
- **Purpose**: Check if everything is working### Option 2: Small Training (6 hours)
```bash
# Use small portions of real datasets
python examples/train_model.py --small-scale
```
- **Dataset**: 10,000 images
- **Time**: 4-6 hours
- **Purpose**: Get decent results without huge time investment### Option 3: Full Training (1-2 days)
```bash
# Step 1: Pre-training on COCO (teaches basic vision)
python examples/train_model.py --config configs/coco_pretraining.json# Step 2: Instruction tuning on LLaVA (teaches conversation)
python examples/train_model.py --config configs/llava_instruction.json
```
- **Dataset**: 480,000+ images and conversations
- **Time**: 12-24 hours total
- **Purpose**: Best possible results---
## ๐ง Configuration Files Explained
### configs/coco_pretraining.json
- **Purpose**: Phase 1 training (image understanding)
- **Dataset**: COCO Captions (330K images)
- **Focus**: Learning to connect images with text### configs/llava_instruction.json
- **Purpose**: Phase 2 training (conversation skills)
- **Dataset**: LLaVA Instructions (150K conversations)
- **Focus**: Learning to chat and answer questions### configs/debug.json
- **Purpose**: Quick testing
- **Dataset**: 2 test images
- **Focus**: Make sure code works---
## ๐ฎ How to Use Your Trained AI
### 1. Image Captioning (Describe Pictures)
```python
caption = pipeline.caption_image("photo.jpg", "Describe this image")
# Output: "A brown dog playing in a green park with trees in the background"
```### 2. Visual Question Answering
```python
answer = pipeline.answer_question("photo.jpg", "What color is the car?")
# Output: "The car is red"
```### 3. Multimodal Chat
```python
response = pipeline.chat("What do you think about this scene?", image="photo.jpg")
# Output: "This looks like a peaceful morning scene with beautiful lighting..."
```### 4. Web API (Let others use your AI)
```bash
# Start web server
python inference/api.py# Now anyone can use your AI at: http://localhost:8000
```---
## ๐ Expected Results
### After Full Training, Your AI Should:
- **Describe images accurately**: 85%+ correct descriptions
- **Answer questions**: 60%+ accuracy on visual questions
- **Hold conversations**: Natural chat about images
- **Speed**: Respond in 1-2 seconds### Performance Comparison:
- **Debug training**: Just for testing (not useful for real use)
- **Small training**: Decent results for basic tasks
- **Full training**: Professional-level performance---
## ๐ ๏ธ Troubleshooting (When Things Go Wrong)
### "Out of Memory" Error
**Problem**: Your GPU doesn't have enough space
**Solutions**:
1. Reduce batch size: Change `"batch_size": 8` to `"batch_size": 4` in config
2. Use smaller model: Change `"load_in_8bit": true` in config
3. Close other programs using GPU### Training is Very Slow
**Problem**: Taking too long
**Solutions**:
1. Use smaller dataset first (try `--small-scale`)
2. Reduce number of epochs in config
3. Use faster GPU if possible### "File Not Found" Error
**Problem**: Can't find dataset files
**Solutions**:
1. Run download script again: `python scripts/download_datasets.py --all`
2. Check internet connection
3. Make sure you have enough storage space### AI Gives Bad Responses
**Problem**: AI responses don't make sense
**Solutions**:
1. Train longer (more epochs)
2. Use larger dataset
3. Check if training finished successfully---
## ๐ What's in This Project?
```
qwen-clip-multimodal/
โโโ ๐ scripts/ # Helper scripts
โ โโโ download_datasets.py # Downloads training data
โโโ ๐ configs/ # Training settings
โ โโโ coco_pretraining.json # Phase 1 training
โ โโโ llava_instruction.json # Phase 2 training
โ โโโ debug.json # Quick testing
โโโ ๐ models/ # AI brain components
โ โโโ multimodal_model.py # Main AI model
โ โโโ clip_encoder.py # Vision part (sees images)
โ โโโ qwen_decoder.py # Language part (makes text)
โโโ ๐ data/ # Dataset processing
โโโ ๐ training/ # Training logic
โโโ ๐ inference/ # Using the trained AI
โโโ ๐ examples/ # Training scripts
โ โโโ train_model.py # Main training script
โโโ ๐ test_data/ # Small test files
```---
## ๐ค Getting Help
### If You're Stuck:
1. **Check the error message**: Usually tells you what went wrong
2. **Try the debug mode first**: `python examples/train_model.py --debug`
3. **Read the troubleshooting section above**
4. **Ask for help**: Create an issue on GitHub### Common Questions:
**Q: How long does training take?**
A: 6-24 hours depending on your GPU and dataset size**Q: Do I need a powerful computer?**
A: Yes, you need a GPU with at least 8GB memory**Q: Can I use this commercially?**
A: Yes, this project is open source**Q: How good will my AI be?**
A: With full training, it should work as well as commercial AI assistants---
## ๐ฏ Next Steps After Training
### 1. Save Your Model to HuggingFace (Share with others)
```python
# Your trained model can be uploaded to share with the world
model.push_to_hub("your-username/your-model-name")
```### 2. Create a Web App
- Use the included API server
- Build a website where people can upload images
- Let users chat with your AI### 3. Improve Further
- Train on more data
- Fine-tune for specific tasks (medical images, art, etc.)
- Combine with other AI models---
## ๐ What Makes This Special?
### Compared to Other Projects:
- โ **Easy to understand**: Written in simple language
- โ **Complete pipeline**: Download data โ Train โ Use
- โ **Production ready**: Can handle real-world use
- โ **Well documented**: Every step explained
- โ **Free to use**: No expensive API calls### Technical Features:
- Uses proven models (CLIP + Qwen2.5)
- Two-stage training for best results
- Memory efficient (works on consumer GPUs)
- Supports multiple dataset formats
- Built-in evaluation metrics---
## ๐งช Testing Your Trained Model
After training, you can test your model to see it in action:
### **Quick Test Script**
```bash
# Test your trained model
python success_test.py
```### **What to Expect from Stage 1 Models**
**โ Working Capabilities:**
- **Text Generation**: Excellent quality text completion
- **Image Processing**: Extracts visual features correctly
- **Basic Multimodal**: Connects images with text concepts**โ ๏ธ Stage 1 Limitations (Normal):**
- **Simple Responses**: Basic image understanding only
- **Short Outputs**: Focused on alignment, not conversation
- **Object-Level**: Recognizes main subjects, not detailed descriptions### **Sample Outputs**
**Text-Only Generation:**
```
Input: "A cat is"
Output: "playing with a toy car that moves along a straight"Input: "The image shows"
Output: "a triangle with vertices at points A, B,"
```**Multimodal Generation:**
```
Input: " cat"
Output: [Basic object recognition responses]Input: " What do you see?"
Output: [Simple descriptive responses]
```### **Model Performance Metrics**
After completing training with 155,000+ steps:
- **โ Vision Encoder**: Working (768D features extracted)
- **โ Language Decoder**: Excellent (natural text generation)
- **โ Fusion Module**: Functional (896D projected features)
- **โ Multimodal Pipeline**: Operational
- **๐ Training Loss**: Reduced from ~17 to ~0.1### **Backup and Recovery**
Your trained model is automatically saved to:
```
outputs/coco_pretraining/checkpoints/stage1_step_XXXXX/
```To create a backup:
```bash
python scripts/backup_model.py --backup
```To convert checkpoint to usable model:
```bash
python scripts/convert_checkpoint.py
```---
## ๐ง Advanced Configuration
### **Memory Optimization**
For systems with limited VRAM:
```json
{
"data": {
"batch_size": 2, // Reduce batch size
"num_workers": 2 // Reduce workers
},
"training": {
"gradient_accumulation_steps": 16, // Increase accumulation
"bf16": true // Use mixed precision
},
"model": {
"use_lora": true, // Enable LoRA for efficiency
"load_in_8bit": true // Use 8-bit quantization
}
}
```### **Training Stages Explained**
**Stage 1: Vision-Language Alignment**
- **Duration**: 3 epochs (~6-12 hours)
- **Purpose**: Teach model to connect visual features with text tokens
- **Frozen**: Language model weights (only fusion module trains)
- **Expected Loss**: 17.0 โ 0.1
- **Capabilities**: Basic image recognition, simple text association**Stage 2: End-to-End Fine-tuning**
- **Duration**: 2 epochs (~3-6 hours)
- **Purpose**: Improve conversation quality and instruction following
- **Unfrozen**: All model weights (or LoRA adapters)
- **Expected Loss**: Further reduction
- **Capabilities**: Better conversations, detailed descriptions### **Hardware Requirements by Configuration**
| Configuration | GPU Memory | Training Time | Dataset Size | Use Case |
|---------------|------------|---------------|--------------|----------|
| **Debug** | 6GB+ | 2 minutes | 2 samples | Code testing |
| **Small-Scale** | 8GB+ | 4-6 hours | 10K samples | Limited resources |
| **COCO Pre-training** | 16GB+ | 6-12 hours | 414K samples | Production Stage 1 |
| **Full Pipeline** | 24GB+ | 12-24 hours | 564K samples | Complete training |### **Common Issues and Solutions**
**Issue**: CUDA Out of Memory
```bash
# Solution: Reduce batch size and increase gradient accumulation
"batch_size": 2,
"gradient_accumulation_steps": 16
```**Issue**: Empty Responses During Testing
```bash
# This is normal for Stage 1 models
# Stage 1 focuses on alignment, not generation quality
# Continue to Stage 2 for better responses
```**Issue**: Training Interrupted
```bash
# Your progress is saved! Resume from latest checkpoint:
python scripts/convert_checkpoint.py
```**Issue**: Model Loading Errors
```bash
# Convert checkpoint to proper format:
cp -r outputs/coco_pretraining/checkpoints/stage1_step_XXXXX outputs/coco_pretraining/best_model
```---
## ๐ Training Progress Tracking
Monitor your training with these key metrics:
**Stage 1 Success Indicators:**
- โ Loss drops from ~17 to <1.0
- โ Learning rate follows cosine schedule
- โ No NaN or infinite values
- โ Gradient norms remain stable
- โ Vision encoder produces consistent features**Stage 2 Success Indicators:**
- โ Further loss reduction
- โ Improved validation metrics
- โ Better response quality in testing
- โ Stable training without crashes**Weights & Biases Integration:**
```bash
# View training progress online
# Automatic logging of loss, learning rate, and metrics
# Compare different training runs
# Monitor GPU utilization and memory usage
```---
## ๐ฏ Next Steps After Training
### **1. Model Evaluation**
```bash
# Test model capabilities
python success_test.py# Comprehensive evaluation
python evaluation/evaluate_model.py --model outputs/coco_pretraining/best_model
```### **2. Stage 2 Training**
```bash
# Continue with instruction tuning
python examples/train_model.py --config configs/llava_instruction.json
```### **3. Model Deployment**
```bash
# Start inference API
python inference/api.py# Test API endpoints
curl -X POST "http://localhost:8000/caption" -F "image=@test.jpg" -F "prompt=Describe this"
```### **4. Model Sharing**
```python
# Upload to HuggingFace Hub
from models import MultimodalLLM
model = MultimodalLLM.from_pretrained("outputs/coco_pretraining/best_model")
model.push_to_hub("your-username/qwen-clip-multimodal")
```---
## ๐ Support & Community
- **Documentation**: Everything you need is in this README
- **Issues**: Report bugs or ask questions on GitHub
- **Improvements**: Contributions welcome!
- **Testing**: Use the provided test scripts to verify your model works### **Troubleshooting Resources**
1. **Check Training Logs**: Look for error patterns in `training.log`
2. **Test Components**: Use `diagnostic_test.py` to isolate issues
3. **Memory Issues**: Reduce batch size or enable quantization
4. **Generation Issues**: Stage 1 models have limited generation - this is normal---
**๐ Ready to build your own vision AI? Start with Step 1 above!**
**๐ฌ Already trained? Test your model with the scripts in this guide!**
*Happy AI training! ๐*