https://github.com/parthapray/docling_rag_langchain_colab
This repo contains codes for RAG using docling on colab notebook with langchain, milvus, huggingface embedding model and LLM
https://github.com/parthapray/docling_rag_langchain_colab
all-minilm-l6-v2 chunking colab-notebook docling huggingface langchain large-language-models milvus pdf retrieval-augmented-generation sentence-transformers
Last synced: 3 months ago
JSON representation
This repo contains codes for RAG using docling on colab notebook with langchain, milvus, huggingface embedding model and LLM
- Host: GitHub
- URL: https://github.com/parthapray/docling_rag_langchain_colab
- Owner: ParthaPRay
- License: mit
- Created: 2024-12-29T09:26:56.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-12-29T11:20:53.000Z (9 months ago)
- Last Synced: 2025-05-21T09:11:51.509Z (5 months ago)
- Topics: all-minilm-l6-v2, chunking, colab-notebook, docling, huggingface, langchain, large-language-models, milvus, pdf, retrieval-augmented-generation, sentence-transformers
- Language: Jupyter Notebook
- Homepage:
- Size: 20.5 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Docling RAG with LangChain on Colab
This repository demonstrates how to set up a Retrieval-Augmented Generation (RAG) pipeline using **Docling**, **LangChain**, and **Colab**. This setup allows for efficient document processing, embedding generation, vector storage, and querying with a Language Model (LLM). The following sections elaborate on the workflow, components, and implementation details.
---
PDF Supported
---## **Overview**
The pipeline combines:
1. **Docling**: For document loading and conversion into structured formats (Markdown, JSON, YAML, etc.).
2. **LangChain**: For splitting documents, generating embeddings, and querying with a retriever and LLM.
3. **Colab**: For interactive development and leveraging HuggingFace APIs.---
## **Workflow Diagram**
```mermaid
flowchart TD
A[Load PDF Documents] --> B[DoclingPDFLoader]
B --> C[Convert to Markdown]
C --> D[Text Splitting with LangChain]
D --> E[Generate Embeddings with HuggingFace]
E --> F[Store Vectors in Milvus Vector Store]
F --> G[Create Retriever]
G --> H[Define RAG Prompt Template]
H --> I[Query with HuggingFace LLM]
I --> J[Retrieve Answers Based on Context]
```---
## **Setup Instructions**
### **Prerequisites**
- A Colab environment with internet access.
- HuggingFace account for API tokens.
- Python libraries: `docling`, `langchain`, `python-dotenv`, and vector storage tools like `Milvus`.---
### **Installation**
Install the required libraries:
```bash
%pip install -qq docling docling-core python-dotenv langchain-text-splitters langchain-huggingface langchain-milvus
```---
### **Usage**
#### **1. Loading Documents**
- Documents are loaded using `DoclingPDFLoader`, which supports single or multiple PDF paths.
- Converts PDF content into structured text formats (e.g., Markdown).#### **2. Splitting Documents**
- Use `RecursiveCharacterTextSplitter` to divide content into chunks.
- Adjust `chunk_size` and `chunk_overlap` for better granularity.#### **3. Generating Embeddings**
- Use `HuggingFaceEmbeddings` to generate semantic vector embeddings.
- Select the embedding model (e.g., `sentence-transformers/all-MiniLM-L6-v2`).#### **4. Vector Storage**
- Store document embeddings in a `Milvus` vector database.
- The temporary database is created using Python's `TemporaryDirectory`.#### **5. Retrieval-Augmented Generation (RAG)**
- Use the retriever to fetch relevant context based on input queries.
- Define a prompt template for the LLM to generate responses.#### **6. Query Execution**
- Execute queries using the RAG chain to retrieve context-aware answers.---
## **Code Highlights**
### **Document Loading**
```python
from docling.document_converter import DocumentConverterclass DoclingPDFLoader(BaseLoader):
def __init__(self, file_path):
self._file_paths = file_path if isinstance(file_path, list) else [file_path]
self._converter = DocumentConverter()def lazy_load(self):
for source in self._file_paths:
dl_doc = self._converter.convert(source).document
text = dl_doc.export_to_markdown(strict_text=True)
yield LCDocument(page_content=text)
```### **Text Splitting**
```python
from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
```### **Embedding Generation**
```python
from langchain_huggingface.embeddings import HuggingFaceEmbeddingsHF_EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)
```### **Vector Storage**
```python
from langchain_milvus import Milvusvectorstore = Milvus.from_documents(
splits,
embeddings,
connection_args={"uri": MILVUS_URI},
drop_old=True,
)
```### **RAG Chain**
```python
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthroughretriever = vectorstore.as_retriever()
prompt = PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\n"
"Given the context information and not prior knowledge, answer the query.\nQuery: {question}\nAnswer:\n"
)rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
```---
## **Key Features**
1. **Modular Design**: Easily customizable for different documents and queries.
2. **RAG Pipeline**: Combines retrieval and generation for context-aware answers.
3. **Embedding Flexibility**: Supports various HuggingFace embedding models.
4. **Vector Store Integration**: Efficient vector management with Milvus.
5. **Interactive Queries**: Leverages HuggingFace LLM for accurate responses.---
## **Example Queries**
```python
rag_chain.invoke("Does Docling implement a linear pipeline of operations?")
rag_chain.invoke("How many pages were human annotated for DocLayNet?")
```---
## **References**
- [Docling Official Documentation](https://ds4sd.github.io/docling/examples/rag_langchain/)
- [LangChain Documentation](https://docs.langchain.com/)
- [HuggingFace Models](https://huggingface.co/)---
## **Author**
**Partha Pratim Ray**
GitHub: [ParthaPRay](https://github.com/ParthaPRay)
```