https://github.com/notshrirang/loomrag
π§ Multimodal Retrieval-Augmented Generation that "weaves" together text and images seamlessly. πͺ‘
https://github.com/notshrirang/loomrag
clip data-annotation deep-learning embeddings faiss faiss-cpu fine-tuning huggingface langchain machine-learning multimodal multimodal-rag multimodal-retrieval-augmented-generation openai python pytorch retrieval-augmented-generation transformer transformers whisper
Last synced: 7 months ago
JSON representation
π§ Multimodal Retrieval-Augmented Generation that "weaves" together text and images seamlessly. πͺ‘
- Host: GitHub
- URL: https://github.com/notshrirang/loomrag
- Owner: NotShrirang
- License: apache-2.0
- Created: 2024-12-27T06:45:21.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-03-29T13:43:06.000Z (7 months ago)
- Last Synced: 2025-03-29T14:31:40.263Z (7 months ago)
- Topics: clip, data-annotation, deep-learning, embeddings, faiss, faiss-cpu, fine-tuning, huggingface, langchain, machine-learning, multimodal, multimodal-rag, multimodal-retrieval-augmented-generation, openai, python, pytorch, retrieval-augmented-generation, transformer, transformers, whisper
- Language: Python
- Homepage: https://huggingface.co/spaces/NotShrirang/LoomRAG
- Size: 15.6 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π LoomRAG: Multimodal Retrieval-Augmented Generation for AI-Powered Search







This project implements a Multimodal Retrieval-Augmented Generation (RAG) system, named **LoomRAG**, that leverages **OpenAI's CLIP** model for neural cross-modal image retrieval and semantic search, and **OpenAI's Whisper** model for audio processing. The system allows users to input text queries, images, or audio to retrieve multimodal responses seamlessly through vector embeddings. It features a comprehensive annotation interface for creating custom datasets and supports CLIP model fine-tuning with configurable parameters for domain-specific applications. The system also supports uploading images, PDFs, and audio files (including real-time recording) for enhanced interaction and intelligent retrieval capabilities through a Streamlit-based interface.
Experience the project in action:
[](https://huggingface.co/spaces/NotShrirang/LoomRAG)
---
## πΈ Implementation Screenshots
|  |  |
| ---------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Data Upload Page | Data Search / Retrieval |
| | |
|  |  |
| Data Annotation Page | CLIP Fine-Tuning |---
## β¨ Features
- π **Cross-Modal Retrieval**: Search text to retrieve both text and image results using deep learning
- πΌοΈ **Image-Based Search**: Search the database by uploading an image to find similar content
- π§ **Embedding-Based Search**: Uses OpenAI's CLIP, Whisper and SentenceTransformer's Embedding Models for embedding the input data
- π― **CLIP Fine-Tuning**: Supports custom model training with configurable parameters including test dataset split size, learning rate, optimizer, and weight decay
- π¨ **Fine-Tuned Model Integration**: Seamlessly load and utilize fine-tuned CLIP models for enhanced search and retrieval
- π€ **Upload Options**: Allows users to upload images, PDFs and audio files for AI-powered processing and retrieval
- ποΈ **Audio Integration**: Upload audio files or record audio directly through the interface
- π **URL Integration**: Add images directly using URLs and scrape website data including text and images
- π·οΈ **Web Scraping**: Automatically extract and index content from websites for comprehensive search capabilities
- π·οΈ **Image Annotation**: Enables users to annotate uploaded images through an intuitive interface
- π **Augmented Text Generation**: Enhances text results using LLMs for contextually rich outputs
- π **Streamlit Interface**: Provides a user-friendly web interface for interacting with the system---
## πΊοΈ Roadmap
- [x] Fine-tuning CLIP for domain-specific datasets
- [x] Image-based search and retrieval
- [x] Adding support for audeo modalities---
## ποΈ Architecture Overview

*Architecture Diagram*1. **Data Indexing**:
- Text, images, and PDFs are preprocessed and embedded using the CLIP model
- Embeddings are stored in a vector database for fast and efficient retrieval
- Support for direct URL-based image indexing and website content scraping2. **Query Processing**:
- Text queries / image-based queries are converted into embeddings for semantic search
- Uploaded images, audio files and PDFs are processed and embedded for comparison
- The system performs a nearest neighbor search in the vector database to retrieve relevant text, images, and audio3. **Response Generation**:
- For text results: Optionally refined or augmented using a language model
- For image results: Directly returned or enhanced with image captions
- For audio results: Returned with relevant metadata and transcriptions where applicable
- For PDFs: Extracts text content and provides relevant sections4. **Image Annotation**:
- Dedicated annotation page for managing uploaded images
- Support for creating and managing multiple datasets simultaneously
- Flexible annotation workflow for efficient data labeling
- Dataset organization and management capabilities5. **Model Fine-Tuning**:
- Custom CLIP model training on annotated images
- Configurable training parameters for optimization
- Integration of fine-tuned models into the search pipeline---
## π Installation
1. Clone the repository:
```bash
git clone https://github.com/NotShrirang/LoomRAG.git
cd LoomRAG
```2. Create a virtual environment and install dependencies:
```bash
pip install -r requirements.txt
```---
## π Usage
1. **Running the Streamlit Interface**:
- Start the Streamlit app:
```bash
streamlit run app.py
```- Access the interface in your browser to:
- Submit natural language queries
- Upload images or PDFs to retrieve contextually relevant results
- Upload or record audio files
- Add images using URLs
- Scrape and index website content
- Search using uploaded images
- Annotate uploaded images
- Fine-tune CLIP models with custom parameters
- Use fine-tuned models for improved search results2. **Example Queries**:
- **Text Query**: "sunset over mountains"
Output: An image of a sunset over mountains along with descriptive text
- **PDF Upload**: Upload a PDF of a scientific paper
Output: Extracted key sections or contextually relevant images---
## βοΈ Configuration
- π **Vector Database**: It uses FAISS for efficient similarity search
- π€ **Model**: Uses OpenAI CLIP for neural embedding generation
- βοΈ **Augmentation**: Optional LLM-based augmentation for text responses
- ποΈ Fine-Tuning: Configurable parameters for model training and optimization---
## π€ Contributing
Contributions are welcome! Please open an issue or submit a pull request for any feature requests or bug fixes.
---
## π License
This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.
---
## π Acknowledgments
- [OpenAI CLIP](https://openai.com/research/clip)
- [OpenAI Whisper](https://github.com/openai/whisper)
- [FAISS](https://github.com/facebookresearch/faiss)
- [Hugging Face](https://huggingface.co/)