https://github.com/patrickjaja/workoflow-rag

Last synced: 9 months ago
JSON representation

Host: GitHub
URL: https://github.com/patrickjaja/workoflow-rag
Owner: patrickjaja
Created: 2025-06-23T19:08:25.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-26T08:24:27.000Z (about 1 year ago)
Last Synced: 2025-06-26T09:27:51.031Z (about 1 year ago)
Language: Python
Size: 76.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Hybrid Search RAG API

A production-ready Retrieval-Augmented Generation (RAG) system with advanced hybrid search capabilities, combining semantic understanding with keyword precision for superior document retrieval. Designed for on-premise deployment and seamless n8n integration.

## Features

- **Hybrid Search**: Combines semantic (dense) and keyword (sparse) search for optimal results
- **Advanced Ranking**: Multi-stage ranking with Reciprocal Rank Fusion and LLM-based reranking
- **Multi-Format Support**: Processes PDF, TXT, CSV, and JSON files with format-specific strategies
- **Intelligent Chunking**: Document-type specific chunking with configurable overlap
- **Azure OpenAI Integration**: Uses text-embedding-3-large (3072 dimensions) and gpt-4o-mini
- **MinIO Storage**: S3-compatible object storage with automatic deduplication
- **Qdrant Vector Database**: High-performance vector search with named vectors
- **RESTful API**: FastAPI-based endpoints with comprehensive error handling
- **Docker Deployment**: Fully containerized with health checks and monitoring
- **Enhanced Name Search**: Special handling for person/entity name queries

## Architecture

```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ FastAPI │────▶│ MinIO │ │ Qdrant │
│ API │ │ Storage │ │ Vector │
└─────────────┘ └─────────────┘ │ Database │
│ └─────────────┘
│ ▲
▼ │
┌─────────────┐ ┌─────────────┐ │
│ Unstructured│────▶│ Azure │────────────┘
│ Processor │ │ OpenAI │
└─────────────┘ └─────────────┘
```

## Core Concepts

### Hybrid Search Architecture

This system implements a sophisticated hybrid search approach that combines the strengths of both semantic and keyword-based search:

#### 1. **Dense Vectors (Semantic Search)**
- Uses Azure OpenAI's `text-embedding-3-large` model (3072 dimensions)
- Captures semantic meaning and context
- Excellent for conceptual queries and paraphrasing
- Handles synonyms and related concepts naturally

#### 2. **Sparse Vectors (Keyword Search)**
- Uses Qdrant's built-in sparse vector implementation
- Preserves exact keyword matching capabilities
- Critical for technical terms, names, and specific identifiers
- Ensures important keywords aren't lost in semantic abstraction

### Document Ranking & Reranking Process

```
┌─────────────────┐
│ User Query │
└────────┬────────┘
│
┌────┴────┐
│ Process │
└────┬────┘
│
┌────┴────┐ ┌────┴────┐
│ Dense │ │ Sparse │
│ Search │ │ Search │
│(Semantic)│ │(Keyword)│
└────┬────┘ └────┬────┘
│ │
└──────────┬─────────────┘
│
┌──────┴──────┐
│ RRF │
│ Fusion │
└──────┬──────┘
│
┌──────┴──────┐
│ Optional │
│ LLM Rerank │
└──────┬──────┘
│
┌──────┴──────┐
│ Results │
└─────────────┘
```

#### Stage 1: Initial Retrieval
1. **Query Processing**: User query is simultaneously:
- Embedded into a dense vector for semantic search
- Tokenized into sparse vectors for keyword search

2. **Parallel Search**: Both search methods run concurrently in Qdrant:
- Dense search finds semantically similar documents
- Sparse search finds keyword matches

3. **Reciprocal Rank Fusion (RRF)**:
```
RRF_score = Σ(1 / (k + rank_i))
```
- Combines results from both searches
- `k=60` (constant) prevents bias toward top results
- Creates unified ranking preserving both semantic and keyword relevance

#### Stage 2: LLM-Based Reranking (Optional)
1. **Context Enrichment**: Top-K results are sent to GPT-4o-mini
2. **Relevance Assessment**: LLM evaluates each result against the query
3. **Smart Reordering**: Results are reranked based on:
- Contextual understanding
- Query intent matching
- Information completeness

### Key Advantages

#### 1. **Superior Retrieval Quality**
- **Best of Both Worlds**: Captures both meaning and precision
- **Robust to Query Variations**: Works with natural language and specific terms
- **Context-Aware**: Understands document relationships and intent

#### 2. **Enhanced Name Search**
- Special handling for person/entity queries
- Exact match prioritization for names
- Prevents semantic drift for proper nouns

#### 3. **Flexibility**
- **Configurable Alpha Weight** (`HYBRID_ALPHA=0.7`): Tune semantic vs keyword importance
- **Search Type Selection**: Choose hybrid, dense-only, or sparse-only per query
- **Optional Reranking**: Balance speed vs accuracy based on use case

#### 4. **Performance Optimization**
- **Parallel Processing**: Dense and sparse searches run concurrently
- **Batch Embeddings**: Efficient processing of multiple documents
- **Fallback Strategies**: Graceful degradation if one method fails

#### 5. **Document Intelligence**
- **Format-Specific Processing**: Optimal handling for PDFs, CSVs, JSON, TXT
- **Smart Chunking**: Preserves context with configurable overlap
- **Metadata Preservation**: Maintains source, position, and type information

## Quick Start

### 1. Prerequisites

- Docker and Docker Compose
- Python 3.11+ (for local testing)
- 8GB+ RAM recommended

### 2. Clone and Setup

```bash
git clone
cd python-rag

# Ensure .env file has your Azure credentials
# (Already configured in the provided .env)
```

### 3. Start Services

```bash
# Start all services
#docker-compose up -d

# Check service health
docker-compose ps

# View logs
docker-compose logs -f
```

### 4. Test the API

```bash
# Install test dependencies
pip install aiohttp

# Run test suite
python test_api.py
```

## API Endpoints

### Health Check
```bash
GET /health
```

### Upload File
```bash
POST /upload
Content-Type: multipart/form-data

# Example with curl:
curl -X POST -F "file=@document.pdf" http://localhost:8000/upload
```

### Search
```bash
POST /search
Content-Type: application/json

{
"query": "your search query",
"top_k": 10,
"search_type": "hybrid" # Options: "hybrid", "dense", "sparse"
}
```

### Ask
```bash
POST /ask
Content-Type: application/json

{
"query": "your search query"
}
```

### Refresh Index
```bash
POST /index/refresh
```

### Get Statistics
```bash
GET /stats
```

## n8n Integration

This API is designed to work as a tool in n8n workflows for RAG patterns.

### n8n HTTP Request Node Configuration

1. **Search Endpoint**:
- Method: POST
- URL: `http://your-host:8000/search`
- Body Type: JSON
- Body:
```json
{
"query": "{{ $json.query }}",
"top_k": 10,
"search_type": "hybrid"
}
```

2. **Upload Endpoint**:
- Method: POST
- URL: `http://your-host:8000/upload`
- Body Type: Form-Data
- Send Binary Data: Yes

### Example n8n Workflow

```json
{
"nodes": [
{
"name": "RAG Search",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"method": "POST",
"url": "http://localhost:8000/search",
"jsonParameters": true,
"options": {},
"bodyParametersJson": {
"query": "{{ $json.userQuery }}",
"top_k": 5
}
}
}
]
}
```

## Configuration

Key settings in `.env` that control ranking behavior:

```bash
# Chunking
CHUNK_SIZE=512 # Characters per chunk
CHUNK_OVERLAP=50 # Overlap between chunks

# Search & Ranking
HYBRID_ALPHA=0.7 # Dense vs sparse weight (0.7 = 70% semantic, 30% keyword)
TOP_K_RESULTS=10 # Final results to return
RERANK_TOP_K=20 # Candidates for LLM reranking
ENABLE_RERANKING=true # Toggle LLM-based reranking
RRF_K=60 # Reciprocal Rank Fusion constant

# Azure OpenAI
AZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-large
AZURE_LLM_DEPLOYMENT=gpt-4o-mini
```

## Advanced Usage

### Custom Filters

Search with metadata filters:

```json
{
"query": "workshop",
"filters": {
"file_type": "pdf",
"filename": "Workshop_Manual.pdf"
}
}
```

### Batch Processing

Process multiple files from MinIO:

```bash
# Upload files to MinIO bucket
# Then refresh index
curl -X POST http://localhost:8000/index/refresh
```

## Performance Optimization

### Search Quality Tuning

1. **Hybrid Alpha (`HYBRID_ALPHA`)**:
- `0.0`: Pure keyword search (best for exact matches)
- `0.5`: Balanced semantic and keyword
- `0.7`: Default - emphasizes semantic understanding
- `1.0`: Pure semantic search (best for concepts)

2. **Chunk Configuration**:
- **Size**: Larger chunks (1024) preserve context, smaller (256) increase precision
- **Overlap**: Higher overlap (100) prevents boundary loss, lower (0) maximizes coverage

3. **Reranking Strategy**:
- Enable for critical queries requiring highest accuracy
- Disable for real-time applications needing sub-second response
- Adjust `RERANK_TOP_K` to balance quality vs API costs

### Performance Tips

1. **Embedding Batch Size**: Adjust `EMBEDDING_BATCH_SIZE` for API rate limits
2. **Concurrent Searches**: Hybrid search runs dense and sparse in parallel
3. **Caching**: Results are cached for repeated queries
4. **Index Optimization**: Regular index refresh maintains search quality

## Monitoring

### Check Logs
```bash
# All services
docker-compose logs -f

# Specific service
docker-compose logs -f app
```

### MinIO Console
Access at: http://localhost:9001
- Username: minioadmin
- Password: minioadmin

### Qdrant Dashboard
Access at: http://localhost:6333/dashboard

## Troubleshooting

### Services Not Starting
```bash
# Check Docker resources
docker system df

# Restart services
docker-compose down
docker-compose up -d
```

### Slow Embeddings
- Check Azure OpenAI rate limits
- Reduce `EMBEDDING_BATCH_SIZE`
- Enable request caching

### Search Quality Issues

1. **Poor Semantic Results**:
- Increase `HYBRID_ALPHA` toward 1.0
- Check embedding model deployment
- Verify chunk size isn't too small

2. **Missing Exact Matches**:
- Decrease `HYBRID_ALPHA` toward 0.0
- Ensure sparse vectors are being generated
- Check tokenization isn't removing important terms

3. **Irrelevant Results**:
- Enable reranking with `ENABLE_RERANKING=true`
- Increase `RERANK_TOP_K` for more candidates
- Adjust chunk overlap for better context

## n8n Integration

The API provides REST endpoints that can be easily integrated with n8n workflows:

1. Use HTTP Request nodes to interact with the API
2. Available endpoints:
- `/search`: Search your knowledge base
- `/ask`: Get AI-powered answers
- `/stats`: Monitor your RAG system

## Use Cases

### Ideal For
- **Technical Documentation**: Balances technical terms with conceptual search
- **Knowledge Management**: Handles diverse query types from different users
- **Customer Support**: Finds answers using both keywords and intent
- **Research Libraries**: Combines citation search with topic exploration
- **Enterprise Search**: Handles acronyms, names, and concepts equally well

### Example Scenarios

1. **Technical Query**: "SSL certificate error"
- Sparse search ensures "SSL" and "certificate" are found
- Dense search includes related concepts like "TLS" or "security"

2. **Conceptual Query**: "How to improve team communication"
- Dense search dominates, finding semantically related content
- Sparse search still catches exact phrase matches

3. **Name Search**: "John Smith project updates"
- Enhanced name detection prioritizes exact "John Smith" matches
- Semantic search finds related project content

## Development

### Local Development
```bash
# Install dependencies
pip install -r requirements.txt

# Run locally (requires services running)
cd app
uvicorn main:app --reload
```

### Adding New File Types
1. Extend `DocumentProcessor` in `services/document_processor.py`
2. Add parsing logic for the new type
3. Update chunking strategy if needed

## Enterprise Advantages

### Why Hybrid Search RAG?

1. **Accuracy**: Traditional semantic-only RAG systems can miss critical exact matches (product codes, names, technical terms). Our hybrid approach ensures nothing is lost.

2. **Flexibility**: Single embedding models can't handle all query types equally well. By combining approaches, we excel at both natural language questions and specific keyword searches.

3. **Performance**: Parallel processing and intelligent caching provide fast responses even with large document collections.

4. **Control**: On-premise deployment with configurable ranking weights gives you full control over search behavior and data security.

5. **Integration**: REST API design makes it easy to integrate with existing workflows, especially n8n automation.

## License

This project is provided as-is for on-premise deployment.

## Support

For issues or questions:
1. Check the logs first
2. Ensure all services are healthy
3. Verify Azure credentials are correct
4. Check example data format matches your use case

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/patrickjaja/workoflow-rag

Awesome Lists containing this project

README