https://github.com/patrickjaja/workoflow-rag
https://github.com/patrickjaja/workoflow-rag
Last synced: 7 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/patrickjaja/workoflow-rag
- Owner: patrickjaja
- Created: 2025-06-23T19:08:25.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-06-26T08:24:27.000Z (11 months ago)
- Last Synced: 2025-06-26T09:27:51.031Z (11 months ago)
- Language: Python
- Size: 76.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hybrid Search RAG API
A production-ready Retrieval-Augmented Generation (RAG) system with advanced hybrid search capabilities, combining semantic understanding with keyword precision for superior document retrieval. Designed for on-premise deployment and seamless n8n integration.
## Features
- **Hybrid Search**: Combines semantic (dense) and keyword (sparse) search for optimal results
- **Advanced Ranking**: Multi-stage ranking with Reciprocal Rank Fusion and LLM-based reranking
- **Multi-Format Support**: Processes PDF, TXT, CSV, and JSON files with format-specific strategies
- **Intelligent Chunking**: Document-type specific chunking with configurable overlap
- **Azure OpenAI Integration**: Uses text-embedding-3-large (3072 dimensions) and gpt-4o-mini
- **MinIO Storage**: S3-compatible object storage with automatic deduplication
- **Qdrant Vector Database**: High-performance vector search with named vectors
- **RESTful API**: FastAPI-based endpoints with comprehensive error handling
- **Docker Deployment**: Fully containerized with health checks and monitoring
- **Enhanced Name Search**: Special handling for person/entity name queries
## Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ FastAPI │────▶│ MinIO │ │ Qdrant │
│ API │ │ Storage │ │ Vector │
└─────────────┘ └─────────────┘ │ Database │
│ └─────────────┘
│ ▲
▼ │
┌─────────────┐ ┌─────────────┐ │
│ Unstructured│────▶│ Azure │────────────┘
│ Processor │ │ OpenAI │
└─────────────┘ └─────────────┘
```
## Core Concepts
### Hybrid Search Architecture
This system implements a sophisticated hybrid search approach that combines the strengths of both semantic and keyword-based search:
#### 1. **Dense Vectors (Semantic Search)**
- Uses Azure OpenAI's `text-embedding-3-large` model (3072 dimensions)
- Captures semantic meaning and context
- Excellent for conceptual queries and paraphrasing
- Handles synonyms and related concepts naturally
#### 2. **Sparse Vectors (Keyword Search)**
- Uses Qdrant's built-in sparse vector implementation
- Preserves exact keyword matching capabilities
- Critical for technical terms, names, and specific identifiers
- Ensures important keywords aren't lost in semantic abstraction
### Document Ranking & Reranking Process
```
┌─────────────────┐
│ User Query │
└────────┬────────┘
│
┌────┴────┐
│ Process │
└────┬────┘
│
┌────┴────┐ ┌────┴────┐
│ Dense │ │ Sparse │
│ Search │ │ Search │
│(Semantic)│ │(Keyword)│
└────┬────┘ └────┬────┘
│ │
└──────────┬─────────────┘
│
┌──────┴──────┐
│ RRF │
│ Fusion │
└──────┬──────┘
│
┌──────┴──────┐
│ Optional │
│ LLM Rerank │
└──────┬──────┘
│
┌──────┴──────┐
│ Results │
└─────────────┘
```
#### Stage 1: Initial Retrieval
1. **Query Processing**: User query is simultaneously:
- Embedded into a dense vector for semantic search
- Tokenized into sparse vectors for keyword search
2. **Parallel Search**: Both search methods run concurrently in Qdrant:
- Dense search finds semantically similar documents
- Sparse search finds keyword matches
3. **Reciprocal Rank Fusion (RRF)**:
```
RRF_score = Σ(1 / (k + rank_i))
```
- Combines results from both searches
- `k=60` (constant) prevents bias toward top results
- Creates unified ranking preserving both semantic and keyword relevance
#### Stage 2: LLM-Based Reranking (Optional)
1. **Context Enrichment**: Top-K results are sent to GPT-4o-mini
2. **Relevance Assessment**: LLM evaluates each result against the query
3. **Smart Reordering**: Results are reranked based on:
- Contextual understanding
- Query intent matching
- Information completeness
### Key Advantages
#### 1. **Superior Retrieval Quality**
- **Best of Both Worlds**: Captures both meaning and precision
- **Robust to Query Variations**: Works with natural language and specific terms
- **Context-Aware**: Understands document relationships and intent
#### 2. **Enhanced Name Search**
- Special handling for person/entity queries
- Exact match prioritization for names
- Prevents semantic drift for proper nouns
#### 3. **Flexibility**
- **Configurable Alpha Weight** (`HYBRID_ALPHA=0.7`): Tune semantic vs keyword importance
- **Search Type Selection**: Choose hybrid, dense-only, or sparse-only per query
- **Optional Reranking**: Balance speed vs accuracy based on use case
#### 4. **Performance Optimization**
- **Parallel Processing**: Dense and sparse searches run concurrently
- **Batch Embeddings**: Efficient processing of multiple documents
- **Fallback Strategies**: Graceful degradation if one method fails
#### 5. **Document Intelligence**
- **Format-Specific Processing**: Optimal handling for PDFs, CSVs, JSON, TXT
- **Smart Chunking**: Preserves context with configurable overlap
- **Metadata Preservation**: Maintains source, position, and type information
## Quick Start
### 1. Prerequisites
- Docker and Docker Compose
- Python 3.11+ (for local testing)
- 8GB+ RAM recommended
### 2. Clone and Setup
```bash
git clone
cd python-rag
# Ensure .env file has your Azure credentials
# (Already configured in the provided .env)
```
### 3. Start Services
```bash
# Start all services
#docker-compose up -d
# Check service health
docker-compose ps
# View logs
docker-compose logs -f
```
### 4. Test the API
```bash
# Install test dependencies
pip install aiohttp
# Run test suite
python test_api.py
```
## API Endpoints
### Health Check
```bash
GET /health
```
### Upload File
```bash
POST /upload
Content-Type: multipart/form-data
# Example with curl:
curl -X POST -F "file=@document.pdf" http://localhost:8000/upload
```
### Search
```bash
POST /search
Content-Type: application/json
{
"query": "your search query",
"top_k": 10,
"search_type": "hybrid" # Options: "hybrid", "dense", "sparse"
}
```
### Ask
```bash
POST /ask
Content-Type: application/json
{
"query": "your search query"
}
```
### Refresh Index
```bash
POST /index/refresh
```
### Get Statistics
```bash
GET /stats
```
## n8n Integration
This API is designed to work as a tool in n8n workflows for RAG patterns.
### n8n HTTP Request Node Configuration
1. **Search Endpoint**:
- Method: POST
- URL: `http://your-host:8000/search`
- Body Type: JSON
- Body:
```json
{
"query": "{{ $json.query }}",
"top_k": 10,
"search_type": "hybrid"
}
```
2. **Upload Endpoint**:
- Method: POST
- URL: `http://your-host:8000/upload`
- Body Type: Form-Data
- Send Binary Data: Yes
### Example n8n Workflow
```json
{
"nodes": [
{
"name": "RAG Search",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"method": "POST",
"url": "http://localhost:8000/search",
"jsonParameters": true,
"options": {},
"bodyParametersJson": {
"query": "{{ $json.userQuery }}",
"top_k": 5
}
}
}
]
}
```
## Configuration
Key settings in `.env` that control ranking behavior:
```bash
# Chunking
CHUNK_SIZE=512 # Characters per chunk
CHUNK_OVERLAP=50 # Overlap between chunks
# Search & Ranking
HYBRID_ALPHA=0.7 # Dense vs sparse weight (0.7 = 70% semantic, 30% keyword)
TOP_K_RESULTS=10 # Final results to return
RERANK_TOP_K=20 # Candidates for LLM reranking
ENABLE_RERANKING=true # Toggle LLM-based reranking
RRF_K=60 # Reciprocal Rank Fusion constant
# Azure OpenAI
AZURE_EMBEDDING_DEPLOYMENT=text-embedding-3-large
AZURE_LLM_DEPLOYMENT=gpt-4o-mini
```
## Advanced Usage
### Custom Filters
Search with metadata filters:
```json
{
"query": "workshop",
"filters": {
"file_type": "pdf",
"filename": "Workshop_Manual.pdf"
}
}
```
### Batch Processing
Process multiple files from MinIO:
```bash
# Upload files to MinIO bucket
# Then refresh index
curl -X POST http://localhost:8000/index/refresh
```
## Performance Optimization
### Search Quality Tuning
1. **Hybrid Alpha (`HYBRID_ALPHA`)**:
- `0.0`: Pure keyword search (best for exact matches)
- `0.5`: Balanced semantic and keyword
- `0.7`: Default - emphasizes semantic understanding
- `1.0`: Pure semantic search (best for concepts)
2. **Chunk Configuration**:
- **Size**: Larger chunks (1024) preserve context, smaller (256) increase precision
- **Overlap**: Higher overlap (100) prevents boundary loss, lower (0) maximizes coverage
3. **Reranking Strategy**:
- Enable for critical queries requiring highest accuracy
- Disable for real-time applications needing sub-second response
- Adjust `RERANK_TOP_K` to balance quality vs API costs
### Performance Tips
1. **Embedding Batch Size**: Adjust `EMBEDDING_BATCH_SIZE` for API rate limits
2. **Concurrent Searches**: Hybrid search runs dense and sparse in parallel
3. **Caching**: Results are cached for repeated queries
4. **Index Optimization**: Regular index refresh maintains search quality
## Monitoring
### Check Logs
```bash
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f app
```
### MinIO Console
Access at: http://localhost:9001
- Username: minioadmin
- Password: minioadmin
### Qdrant Dashboard
Access at: http://localhost:6333/dashboard
## Troubleshooting
### Services Not Starting
```bash
# Check Docker resources
docker system df
# Restart services
docker-compose down
docker-compose up -d
```
### Slow Embeddings
- Check Azure OpenAI rate limits
- Reduce `EMBEDDING_BATCH_SIZE`
- Enable request caching
### Search Quality Issues
1. **Poor Semantic Results**:
- Increase `HYBRID_ALPHA` toward 1.0
- Check embedding model deployment
- Verify chunk size isn't too small
2. **Missing Exact Matches**:
- Decrease `HYBRID_ALPHA` toward 0.0
- Ensure sparse vectors are being generated
- Check tokenization isn't removing important terms
3. **Irrelevant Results**:
- Enable reranking with `ENABLE_RERANKING=true`
- Increase `RERANK_TOP_K` for more candidates
- Adjust chunk overlap for better context
## n8n Integration
The API provides REST endpoints that can be easily integrated with n8n workflows:
1. Use HTTP Request nodes to interact with the API
2. Available endpoints:
- `/search`: Search your knowledge base
- `/ask`: Get AI-powered answers
- `/stats`: Monitor your RAG system
## Use Cases
### Ideal For
- **Technical Documentation**: Balances technical terms with conceptual search
- **Knowledge Management**: Handles diverse query types from different users
- **Customer Support**: Finds answers using both keywords and intent
- **Research Libraries**: Combines citation search with topic exploration
- **Enterprise Search**: Handles acronyms, names, and concepts equally well
### Example Scenarios
1. **Technical Query**: "SSL certificate error"
- Sparse search ensures "SSL" and "certificate" are found
- Dense search includes related concepts like "TLS" or "security"
2. **Conceptual Query**: "How to improve team communication"
- Dense search dominates, finding semantically related content
- Sparse search still catches exact phrase matches
3. **Name Search**: "John Smith project updates"
- Enhanced name detection prioritizes exact "John Smith" matches
- Semantic search finds related project content
## Development
### Local Development
```bash
# Install dependencies
pip install -r requirements.txt
# Run locally (requires services running)
cd app
uvicorn main:app --reload
```
### Adding New File Types
1. Extend `DocumentProcessor` in `services/document_processor.py`
2. Add parsing logic for the new type
3. Update chunking strategy if needed
## Enterprise Advantages
### Why Hybrid Search RAG?
1. **Accuracy**: Traditional semantic-only RAG systems can miss critical exact matches (product codes, names, technical terms). Our hybrid approach ensures nothing is lost.
2. **Flexibility**: Single embedding models can't handle all query types equally well. By combining approaches, we excel at both natural language questions and specific keyword searches.
3. **Performance**: Parallel processing and intelligent caching provide fast responses even with large document collections.
4. **Control**: On-premise deployment with configurable ranking weights gives you full control over search behavior and data security.
5. **Integration**: REST API design makes it easy to integrate with existing workflows, especially n8n automation.
## License
This project is provided as-is for on-premise deployment.
## Support
For issues or questions:
1. Check the logs first
2. Ensure all services are healthy
3. Verify Azure credentials are correct
4. Check example data format matches your use case