https://github.com/muskanpaliwal/rag-tool-zenml

A ZenML-based RAG system for document Q&A with multi-format support. My exploration project to understand how RAG systems work under the hood.
https://github.com/muskanpaliwal/rag-tool-zenml

document-processor embeddings faiss langchain-python nlp pyhton3 ragsys search sentence-transformers vector-database zenml zenml-pipelines

Last synced: about 1 month ago
JSON representation

A ZenML-based RAG system for document Q&A with multi-format support. My exploration project to understand how RAG systems work under the hood.

Host: GitHub
URL: https://github.com/muskanpaliwal/rag-tool-zenml
Owner: MuskanPaliwal
License: mit
Created: 2025-04-16T11:33:30.000Z (2 months ago)
Default Branch: main
Last Pushed: 2025-04-28T15:38:01.000Z (about 2 months ago)
Last Synced: 2025-05-04T09:58:53.087Z (about 1 month ago)
Topics: document-processor, embeddings, faiss, langchain-python, nlp, pyhton3, ragsys, search, sentence-transformers, vector-database, zenml, zenml-pipelines
Language: Python
Homepage:
Size: 43 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # ZenML RAG System

A Retrieval-Augmented Generation (RAG) system built with ZenML pipelines for document question-answering.

## Overview

This RAG system allows you to:

1. Process documents (PDFs, DOCx, TXT, HTML, etc.) into a vector database

2. Query the processed documents using natural language

3. Retrieve the most relevant document chunks for your queries

## Features

- **Multi-format Document Support**: Process PDFs, Word documents, text files, HTML, and more

- **Smart Text Chunking**: Split documents intelligently with customizable chunk sizes

- **Efficient Embedding**: Generate embeddings using SentenceTransformers models

- **Fast Vector Search**: Use FAISS for efficient similarity search

- **Hybrid Search**: Combine semantic search with keyword matching for better results

- **ZenML Integration**: Leverage ZenML for pipeline orchestration and reproducibility

- **CLI Interface**: Simple command-line interface for document processing and querying

## Project Structure

```

rag_system/

├── src/

│   ├── utils/

│   │   ├── documents_processor.py   # Document loading and processing

│   │   ├── text_splitter.py         # Text chunking

│   │   └── vector_utils.py          # Vector operations utilities

│   ├── models/

│   │   └── embeddings.py            # Embedding models

│   ├── data/

│   │   └── vector_store.py          # Vector storage and retrieval

│   └── pipelines/

│       ├── document_pipeline.py     # Document processing pipeline

│       └── query_pipeline.py        # Query pipeline

├── rag_system.py                    # Main RAG system interface

├── main.py                          # Command-line entry point

└── README.md                        # Documentation

```

## Installation

1. Clone the repository:

```bash

git clone https://github.com/yourusername/rag-system.git

cd rag-system

```

2. Install the required dependencies:

```bash

pip install zenml langchain sentence-transformers faiss-cpu pypdf

```

3. For additional document format support:

```bash

pip install unstructured

```

## Usage

### Command Line Interface

The `main.py` script provides a simple command-line interface with three modes of operation:

```bash

# Process documents

python main.py process --document-path path/to/documents/ --storage-path ./vector_db

# Query documents

python main.py query --storage-path ./vector_db --query "What is the main topic of these documents?"

# Interactive mode (ask multiple questions)

python main.py interactive --storage-path ./vector_db

```

### Options

- `--document-path`, `-d`: Path to document or directory to process

- `--storage-path`, `-s`: Path to store the vector database (default: temporary directory)

- `--query`, `-q`: Query string for searching documents

- `--chunk-size`: Size of document chunks (default: 1000)

- `--chunk-overlap`: Overlap between chunks (default: 200)

- `--top-k`, `-k`: Number of results to return for queries (default: 3)

- `--embedding-model`, `-m`: Name of embedding model to use (default: "all-MiniLM-L6-v2")

- `--hybrid-search`: Use hybrid search combining semantic and keyword matching

### Programmatic Usage

You can also use the RAG system programmatically in your Python code:

```python

from rag_system import RAGSystem

# Initialize RAG system

rag = RAGSystem(storage_path="./vector_db")

# Process a document or directory

result = rag.process_documents(

    document_path="path/to/documents/",

    chunk_size=1000,

    chunk_overlap=200

)

print(f"Processed {result['num_chunks']} document chunks")

# Query the processed documents

answer = rag.query(

    query="What is the main topic discussed in these documents?",

    top_k=3,

    hybrid_search=True

)

# Print results

for result in answer['results']:

    print(f"Rank {result['rank']} (Score: {result['score']:.4f})")

    print(f"Content: {result['content']}")

    print(f"Source: {result['source']}")

```

## Customization

### Embedding Models

You can use different SentenceTransformers models by changing the `embedding_model` parameter:

- `all-MiniLM-L6-v2` (default): Fast and balanced

- `all-mpnet-base-v2`: Higher quality but slower

- `paraphrase-multilingual-MiniLM-L12-v2`: For multilingual support

### Vector Search

- Change `index_type` to "IP" (Inner Product) for cosine similarity instead of L2 distance

- Use `hybrid_search=True` to combine semantic search with keyword matching

### Document Chunking

- Modify `chunk_size` and `chunk_overlap` to optimize for your specific documents

- For longer documents, increase chunk size

- For technical documents, decrease chunk size and increase overlap

## Integration with LLMs

To create a complete RAG system, integrate with an LLM:

```python

from rag_system import RAGSystem

import openai  # or any other LLM API

# Initialize RAG system and process documents

rag = RAGSystem(storage_path="./vector_db")

# Process documents if needed

if not os.path.exists("./vector_db"):

    rag.process_documents("path/to/documents/")

# Query function with LLM integration

def answer_question(query, top_k=3):

    # Get relevant context from RAG system

    results = rag.query(query, top_k=top_k)

    

    # Prepare context for the LLM

    context = "\n\n".join([r["content"] for r in results["results"]])

    

    # Create prompt with context

    prompt = f"Answer the question based on the following context:\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"

    

    # Call LLM API

    response = openai.ChatCompletion.create(

        model="gpt-3.5-turbo",

        messages=[

            {"role": "system", "content": "You are a helpful assistant."},

            {"role": "user", "content": prompt}

        ]

    )

    

    return {

        "answer": response.choices[0].message["content"],

        "sources": [r["source"] for r in results["results"]]

    }

# Example usage

result = answer_question("What are the key benefits described in the document?")

print(result["answer"])

print(f"Sources: {result['sources']}")

```

## Troubleshooting

### Common Issues

1. **FileNotFoundError**: Ensure the document path is correct and accessible.

2. **Memory Issues**: For large documents, reduce batch size or chunk size.

3. **CUDA Errors**: Set device to 'cpu' in the embeddings module if you encounter GPU-related errors.

4. **Unsupported File Types**: Ensure you have the necessary dependencies for all file types (e.g., `unstructured` for Word documents).

## License

This project is licensed under the MIT License.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/muskanpaliwal/rag-tool-zenml

Awesome Lists containing this project

README