https://github.com/tirthraj1605/multimodal_visual_knowledge_assistant
Multimodal_Visual_Knowledge_Assistant is a deep learning project that classifies images and generates contextual text using CLIP and GPT-2. It supports domains like Medical, Fashion, Microscopy, and Nature to provide smart visual-textual insights.
https://github.com/tirthraj1605/multimodal_visual_knowledge_assistant
clip gpt-2 huggingface-transformers ipython-display pillow python3 torchvision
Last synced: 12 months ago
JSON representation
Multimodal_Visual_Knowledge_Assistant is a deep learning project that classifies images and generates contextual text using CLIP and GPT-2. It supports domains like Medical, Fashion, Microscopy, and Nature to provide smart visual-textual insights.
- Host: GitHub
- URL: https://github.com/tirthraj1605/multimodal_visual_knowledge_assistant
- Owner: Tirthraj1605
- Created: 2025-04-13T06:31:07.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-04-13T06:42:44.000Z (12 months ago)
- Last Synced: 2025-04-15T01:18:08.011Z (12 months ago)
- Topics: clip, gpt-2, huggingface-transformers, ipython-display, pillow, python3, torchvision
- Language: Jupyter Notebook
- Homepage:
- Size: 1.17 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🧠 Multimodal_Visual_Knowledge_Assistant
**Multimodal_Visual_Knowledge_Assistant** is a smart AI system that uses computer vision and natural language processing to understand images and generate meaningful text-based responses. By combining the power of **CLIP** and **GPT-2**, this project analyzes images across various domains—Medical, Fashion, Microscopy, and Nature—and produces relevant insights or creative descriptions.
# OUTPUT Demo


---
## 🚀 Features
- 🔍 **Image Classification** using OpenAI’s CLIP model.
- 🧠 **Text Generation** using GPT-2 for dynamic, context-aware language.
- 📷 Visual display of images, predictions, and generated texts.
- 🔄 Supports domain-specific prompts for deeper semantic relevance.
- 💡 Easy to scale by adding new categories and label sets.
---
## 📁 Project Structure
---
## ⚙️ Tech Stack
| Task | Tool / Library |
|------------------------|------------------------------------|
| Multimodal Encoding | `CLIP` (`openai/clip-vit-base-patch32`) |
| Language Generation | `GPT-2` (`gpt2`) |
| Deep Learning Framework| `PyTorch` |
| Tokenization & Modeling| `Hugging Face Transformers` |
| Image Display & Parsing| `Pillow`, `IPython.display` |
---
## 🧩 Use Case Categories
| Category | Example Labels | Prompt for GPT-2 |
|-------------|----------------------------------------------------------------------------|-----------------------------------------------------------------|
| **Medical** | MRI with tumor, Normal MRI, X-ray with fracture, CT scan | *"This image appears to be a medical scan. Based on the content, we can infer:"* |
| **Fashion** | Red dress, Man in suit, Runway fashion, Casual outfit | *"This image appears to be a fashion photo. Here's a marketing copy:"* |
| **Microscopy** | Cell structure, Bacteria culture, Virus, Tissue sample | *"This is a microscopy image. Scientifically, this could imply:"* |
| **Nature** | Mountain landscape, Forest, Beach, Ocean pollution | *"This is a natural scene. Here's a creative caption or fact:"* |
---
## 🧪 How It Works
1. **Load Models**
Load CLIP and GPT-2 from Hugging Face's model hub.
2. **Classify Image**
CLIP processes the image and predicts the most probable label from a set of predefined domain-specific labels.
3. **Generate Text**
GPT-2 generates a natural language output based on the predicted label and a domain-specific prompt.
4. **Display Results**
The notebook displays the original image, predicted label with confidence, and generated descriptive text.
---
## 📥 Installation
Make sure you have Python 3.8+ and install the required dependencies:
```bash
pip install torch torchvision
pip install transformers
pip install pillow
```
---
## 👨💻 Author
**Tirthraj Bhalodiya**
[](https://www.linkedin.com/in/tirthraj-bhalodiya-97534b227/)
[](https://github.com/Tirthraj1605)