https://github.com/randomtask2000/hybrid-dense-reranker
RAG - Hybrid Dense + Reranker
https://github.com/randomtask2000/hybrid-dense-reranker
Last synced: 7 months ago
JSON representation
RAG - Hybrid Dense + Reranker
- Host: GitHub
- URL: https://github.com/randomtask2000/hybrid-dense-reranker
- Owner: randomtask2000
- Created: 2025-06-22T22:18:31.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-07-07T06:44:36.000Z (7 months ago)
- Last Synced: 2025-07-07T07:43:00.715Z (7 months ago)
- Language: Python
- Size: 20.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hybrid Dense Reranker - Pure Anthropic
## Configuration
This application uses Anthropic's Claude for intelligent reranking and TF-IDF for embeddings, providing a pure Anthropic-based solution.
## Solution
Since Anthropic doesn't provide an embeddings API (they focus on text generation with Claude), this application combines:
- **TF-IDF embeddings** for initial document retrieval
- **Anthropic Claude** for intelligent relevance scoring and reranking
### Features:
1. **TF-IDF Embeddings**:
- Fast, local embedding generation using scikit-learn
- No external API calls for embeddings
- Efficient for document retrieval
2. **Anthropic Claude Integration**:
- Uses Claude-3-Sonnet for intelligent relevance scoring
- Analyzes query-document relevance with natural language understanding
- Combines TF-IDF and Claude scores for optimal results
3. **Hybrid Scoring**:
- 30% TF-IDF similarity score
- 70% Claude relevance score
- Results sorted by combined score
## Setup Instructions
### Quick Setup (Recommended)
For a quick automated setup, run the provided setup script:
**On macOS/Linux:**
```bash
# Make the script executable (if not already)
chmod +x setup_venv.sh
# Run the setup script
./setup_venv.sh
```
**On Windows:**
```cmd
# Run the batch script
setup_venv.bat
```
This script will:
- Create a virtual environment
- Install all dependencies
- Create a `.env` file from the template
- Provide next steps
### Manual Setup
### 1. Clone and Navigate to Project
```bash
git clone git@github.com:randomtask2000/Hybrid-Dense-Reranker.git
cd Hybrid-Dense-Reranker
```
### 2. Create Virtual Environment
Create a Python virtual environment to isolate project dependencies:
```bash
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
```
You should see `(venv)` in your terminal prompt when the virtual environment is active.
### 3. Install Dependencies
With the virtual environment activated, install the required packages:
```bash
pip install -r requirements.txt
```
### 4. Configure Environment Variables
Copy the example environment file and configure your API key:
```bash
cp .env.example .env
```
Edit the `.env` file and replace `your-anthropic-api-key-here` with your actual Anthropic API key:
```bash
# Get your API key from: https://console.anthropic.com/
ANTHROPIC_API_KEY=your-actual-api-key-here
# Corpus Configuration (optional)
CORPUS_SOURCE=default # Options: 'default' or 'mormon'
CHUNK_SIZE=1000 # Maximum characters per chunk (for Mormon corpus)
CHUNK_OVERLAP=100 # Character overlap between chunks
```
**Alternative method** - Set environment variable directly:
```bash
export ANTHROPIC_API_KEY='your-anthropic-api-key-here'
```
To make it permanent, add it to your shell profile:
```bash
echo 'export ANTHROPIC_API_KEY="your-anthropic-api-key-here"' >> ~/.zshrc
source ~/.zshrc
```
### 5. Test the Setup
Run the test script to verify everything works:
```bash
python test_embedding.py
```
### 6. Run the Application
Make sure your virtual environment is activated, then run:
```bash
# Ensure virtual environment is activated
source venv/bin/activate # On macOS/Linux
# venv\Scripts\activate # On Windows
# Run the application
python app.py
```
### 7. Virtual Environment Management
**Deactivating the virtual environment:**
```bash
deactivate
```
**Reactivating the virtual environment:**
```bash
# Navigate to project directory
cd /path/to/Hybrid-Dense-Reranker
# Activate virtual environment
source venv/bin/activate # On macOS/Linux
# venv\Scripts\activate # On Windows
```
**Installing additional packages:**
```bash
# With virtual environment activated
pip install package-name
# Update requirements.txt if needed
pip freeze > requirements.txt
```
## Corpus Configuration
The application supports configurable corpus sources, allowing you to switch between different document collections:
### Available Corpus Sources
1. **Default Corpus** (`CORPUS_SOURCE=default`):
- Contains sample legal documents
- Includes contracts, compliance memos, and risk assessments
- Ready to use out of the box
2. **Mormon Corpus** (`CORPUS_SOURCE=mormon`):
- Loads text from `data/mormon13short.txt`
- Automatically chunks the Book of Mormon text into manageable pieces
- Configurable chunk size and overlap
### Configuration Options
Set these environment variables in your `.env` file:
```bash
# Corpus source selection
CORPUS_SOURCE=default # Options: 'default' or 'mormon'
# Text chunking configuration (applies to Mormon corpus)
CHUNK_SIZE=1000 # Maximum characters per chunk
CHUNK_OVERLAP=100 # Characters to overlap between chunks
```
### Using the Mormon Corpus
To use the Mormon corpus:
1. Ensure `data/mormon13short.txt` exists in your project
2. Set `CORPUS_SOURCE=mormon` in your `.env` file
3. Configure chunk size and overlap as needed
4. Restart the application
The application will automatically:
- Parse verse references (e.g., "1 Nephi 1:1")
- Create chunks based on your size settings
- Maintain context with configurable overlap
- Fall back to default corpus if the file is not found
### Example Queries by Corpus
**Default Corpus (Legal Documents):**
```bash
curl -X POST http://localhost:5000/rag-query \
-H "Content-Type: application/json" \
-d '{"query": "contract liability and legal risks"}'
```
**Mormon Corpus:**
```bash
curl -X POST http://localhost:5000/rag-query \
-H "Content-Type: application/json" \
-d '{"query": "Nephi and his teachings about faith"}'
```
## Usage
The application provides a RAG (Retrieval-Augmented Generation) endpoint:
```bash
curl -X POST http://localhost:5000/rag-query \
-H "Content-Type: application/json" \
-d '{"query": "What are the security risks?"}'
```
## API Response Format
The application returns enhanced results with multiple scoring methods:
```json
[
{
"title": "Security Memo",
"content": "Ensure all employees use 2FA to reduce unauthorized access risks.",
"tfidf_score": 0.85,
"claude_score": 0.92,
"combined_score": 0.899
}
]
```
## How It Works
1. **Initial Retrieval**: TF-IDF embeddings find potentially relevant documents
2. **Claude Analysis**: Each retrieved document is analyzed by Claude for relevance
3. **Hybrid Scoring**: Combines TF-IDF similarity with Claude's understanding
4. **Intelligent Ranking**: Results sorted by combined score for optimal relevance
## Benefits
- **Pure Anthropic**: Uses only Anthropic's Claude for AI processing
- **Cost Effective**: TF-IDF embeddings are free and fast
- **Intelligent**: Claude provides nuanced relevance understanding
- **Scalable**: Can handle large document collections efficiently
## Testing
For comprehensive testing instructions, including integration tests and performance tests, see [TESTING.md](TESTING.md).
### Quick Test Run
```bash
# Validate setup
python validate_test_setup.py
# Run all tests
python run_integration_tests.py
```
### Corpus Configuration Testing
Test the new corpus configuration functionality:
```bash
# Quick validation of corpus configuration
python test_corpus_quick.py
# Comprehensive corpus configuration tests
python run_corpus_tests.py
# Unit tests for corpus functionality
python test_corpus_config.py
# Integration tests for corpus workflow
python test_corpus_integration.py
```
### Test Different Corpus Sources
```bash
# Test with default corpus
CORPUS_SOURCE=default python test_corpus_quick.py
# Test with Mormon corpus (if file exists)
CORPUS_SOURCE=mormon CHUNK_SIZE=500 python test_corpus_quick.py
```