https://github.com/parthapray/gradio_docling_rag_langchain
This repo provide RAG using Docling, langchain, milvus, sentence transformers, huggingface LLMs
https://github.com/parthapray/gradio_docling_rag_langchain
docling html image large-language-models milvus pdf pptx retrieval-augmented-generation sentence-transformers
Last synced: over 1 year ago
JSON representation
This repo provide RAG using Docling, langchain, milvus, sentence transformers, huggingface LLMs
- Host: GitHub
- URL: https://github.com/parthapray/gradio_docling_rag_langchain
- Owner: ParthaPRay
- License: mit
- Created: 2024-12-29T12:12:55.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-29T14:23:56.000Z (over 1 year ago)
- Last Synced: 2025-01-09T23:54:47.364Z (over 1 year ago)
- Topics: docling, html, image, large-language-models, milvus, pdf, pptx, retrieval-augmented-generation, sentence-transformers
- Language: Jupyter Notebook
- Homepage:
- Size: 20.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Gradio Docling + LangChain + RAG Implementation
This project demonstrates how to implement **Retrieval-Augmented Generation (RAG)** using **Gradio**, **LangChain**, **Docling**, **Milvus**, and **HuggingFace**. The system supports multiple file types, including **PDF, Images, HTML, and PPTX**, for document conversion, chunking, and question answering.
---
## Features
1. **File Formats Supported**:
- **PDF**: Text extracted using the `PyPdfiumDocumentBackend`.
- **Images**: OCR can be implemented if actual text extraction is desired.
- **HTML**: Direct parsing of HTML content.
- **PPTX**: PowerPoint files processed for extracting slide content.
- **Other formats under testing**:
- `.txt`, `.md`, `.asciidoc`
- **Not supported**: `.docx` (Word), `.xlsx` (Excel).
2. **Chunking**:
- Utilizes LangChain's `RecursiveCharacterTextSplitter` for splitting documents into manageable chunks.
- Configurable `chunk_size` and `chunk_overlap` for optimal performance.
3. **RAG Workflow**:
- Extracts text from uploaded files using `Docling`.
- Splits documents into chunks for embedding and retrieval.
- Uses **Milvus** as a vector store for embedding-based retrieval.
- Employs HuggingFace LLMs (e.g., Mistral-7B-Instruct) for answering user queries based on the retrieved context.
4. **Gradio Interface**:
- **Upload & Split**: Allows users to upload files and split them into chunks.
- **RAG Q&A**: Enables users to ask questions based on the uploaded content.
---
## Requirements
Install the following dependencies to run the project:
```bash
pip install docling docling-core python-dotenv langchain-text-splitters langchain-huggingface langchain-milvus gradio
```
---
## Architecture
1. **Document Conversion**:
- The `DocumentConverter` from Docling automatically detects file formats and converts them into a unified text representation.
- Extracted text is exported in Markdown format for consistency.
2. **Embedding and Retrieval**:
- Documents are embedded using the `sentence-transformers/all-MiniLM-L6-v2` model.
- Embeddings are stored in a temporary **Milvus** vector database for fast retrieval.
3. **RAG Chain**:
- Retrieved context is formatted into a prompt using LangChain's `PromptTemplate`.
- HuggingFace's endpoint is used to query LLMs like `Mistral-7B-Instruct`.
---
## Usage
### 1. Run the Application
Run the following command to launch the Gradio app:
```bash
python gradio_docling_RAG_langchain.py
```
### 2. Upload and Split Documents
- Navigate to the **Upload & Split** tab.
- Upload supported file types (PDF, Images, HTML, PPTX).
- Click "Split Documents" to chunk the uploaded files.
### 3. Ask Questions
- Go to the **RAG Q&A** tab.
- Input a question related to the uploaded documents.
- Click "Ask" to get an answer based on the document content.
---
## File Handling
### Supported Formats
| File Type | Handling Mechanism | Notes |
|-----------|---------------------------------|--------------------------------------|
| PDF | PyPdfiumDocumentBackend | Extracts text from PDF files. |
| Images | OCR-based (if implemented) | Extracts text from images (requires OCR). |
| HTML | Direct Parsing | Extracts text from HTML files. |
| PPTX | Slide Content Parsing | Extracts text from PowerPoint slides.|
### Unsupported Formats
- `.docx` (Word documents): Currently not supported.
- `.xlsx` (Excel files): Parsing not implemented.
---
## Example Workflow
1. **Upload a PDF file**.
2. **Split into chunks**:
- A PDF with 10 pages is chunked into 1000-character segments with a 200-character overlap.
3. **Ask a question**:
- Query: "What is the content of the first page?"
- The system retrieves relevant chunks and generates an answer using the LLM.
---
## Future Enhancements
- Add support for `.txt` and `.md` file formats.
- Expand compatibility with `.docx` and `.xlsx`.
- Integrate advanced OCR libraries for better image processing.
- Implement additional chunking strategies.
---
## Developed By
**Partha Pratim Ray**
GitHub: [ParthaPRay](https://github.com/ParthaPRay)
# Output

