{"id":29364219,"url":"https://github.com/randomtask2000/hybrid-dense-reranker","last_synced_at":"2025-07-09T10:07:53.311Z","repository":{"id":303356870,"uuid":"1006732642","full_name":"randomtask2000/Hybrid-Dense-Reranker","owner":"randomtask2000","description":"RAG - Hybrid Dense + Reranker","archived":false,"fork":false,"pushed_at":"2025-07-07T06:44:36.000Z","size":21,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-07T07:43:00.715Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/randomtask2000.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-22T22:18:31.000Z","updated_at":"2025-07-07T06:44:39.000Z","dependencies_parsed_at":"2025-07-07T07:43:04.147Z","dependency_job_id":"c127f1fc-906c-42e5-bb21-5472113f4656","html_url":"https://github.com/randomtask2000/Hybrid-Dense-Reranker","commit_stats":null,"previous_names":["randomtask2000/hybrid-dense-reranker"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/randomtask2000/Hybrid-Dense-Reranker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/randomtask2000%2FHybrid-Dense-Reranker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/randomtask2000%2FHybrid-Dense-Reranker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/randomtask2000%2FHybrid-Dense-Reranker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/randomtask2000%2FHybrid-Dense-Reranker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/randomtask2000","download_url":"https://codeload.github.com/randomtask2000/Hybrid-Dense-Reranker/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/randomtask2000%2FHybrid-Dense-Reranker/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264437465,"owners_count":23608183,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-09T10:07:51.867Z","updated_at":"2025-07-09T10:07:53.305Z","avatar_url":"https://github.com/randomtask2000.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hybrid Dense Reranker - Pure Anthropic\n\n## Configuration\n\nThis application uses Anthropic's Claude for intelligent reranking and TF-IDF for embeddings, providing a pure Anthropic-based solution.\n\n## Solution\n\nSince Anthropic doesn't provide an embeddings API (they focus on text generation with Claude), this application combines:\n- **TF-IDF embeddings** for initial document retrieval\n- **Anthropic Claude** for intelligent relevance scoring and reranking\n\n### Features:\n\n1. **TF-IDF Embeddings**:\n   - Fast, local embedding generation using scikit-learn\n   - No external API calls for embeddings\n   - Efficient for document retrieval\n\n2. **Anthropic Claude Integration**:\n   - Uses Claude-3-Sonnet for intelligent relevance scoring\n   - Analyzes query-document relevance with natural language understanding\n   - Combines TF-IDF and Claude scores for optimal results\n\n3. **Hybrid Scoring**:\n   - 30% TF-IDF similarity score\n   - 70% Claude relevance score\n   - Results sorted by combined score\n\n## Setup Instructions\n\n### Quick Setup (Recommended)\n\nFor a quick automated setup, run the provided setup script:\n\n**On macOS/Linux:**\n```bash\n# Make the script executable (if not already)\nchmod +x setup_venv.sh\n\n# Run the setup script\n./setup_venv.sh\n```\n\n**On Windows:**\n```cmd\n# Run the batch script\nsetup_venv.bat\n```\n\nThis script will:\n- Create a virtual environment\n- Install all dependencies\n- Create a `.env` file from the template\n- Provide next steps\n\n### Manual Setup\n\n### 1. Clone and Navigate to Project\n\n```bash\ngit clone git@github.com:randomtask2000/Hybrid-Dense-Reranker.git\ncd Hybrid-Dense-Reranker\n```\n\n### 2. Create Virtual Environment\n\nCreate a Python virtual environment to isolate project dependencies:\n\n```bash\n# Create virtual environment\npython -m venv venv\n\n# Activate virtual environment\n# On macOS/Linux:\nsource venv/bin/activate\n# On Windows:\n# venv\\Scripts\\activate\n```\n\nYou should see `(venv)` in your terminal prompt when the virtual environment is active.\n\n### 3. Install Dependencies\n\nWith the virtual environment activated, install the required packages:\n\n```bash\npip install -r requirements.txt\n```\n\n### 4. Configure Environment Variables\n\nCopy the example environment file and configure your API key:\n\n```bash\ncp .env.example .env\n```\n\nEdit the `.env` file and replace `your-anthropic-api-key-here` with your actual Anthropic API key:\n\n```bash\n# Get your API key from: https://console.anthropic.com/\nANTHROPIC_API_KEY=your-actual-api-key-here\n\n# Corpus Configuration (optional)\nCORPUS_SOURCE=default  # Options: 'default' or 'mormon'\nCHUNK_SIZE=1000        # Maximum characters per chunk (for Mormon corpus)\nCHUNK_OVERLAP=100      # Character overlap between chunks\n```\n\n**Alternative method** - Set environment variable directly:\n\n```bash\nexport ANTHROPIC_API_KEY='your-anthropic-api-key-here'\n```\n\nTo make it permanent, add it to your shell profile:\n\n```bash\necho 'export ANTHROPIC_API_KEY=\"your-anthropic-api-key-here\"' \u003e\u003e ~/.zshrc\nsource ~/.zshrc\n```\n\n### 5. Test the Setup\n\nRun the test script to verify everything works:\n\n```bash\npython test_embedding.py\n```\n\n### 6. Run the Application\n\nMake sure your virtual environment is activated, then run:\n\n```bash\n# Ensure virtual environment is activated\nsource venv/bin/activate  # On macOS/Linux\n# venv\\Scripts\\activate   # On Windows\n\n# Run the application\npython app.py\n```\n\n### 7. Virtual Environment Management\n\n**Deactivating the virtual environment:**\n```bash\ndeactivate\n```\n\n**Reactivating the virtual environment:**\n```bash\n# Navigate to project directory\ncd /path/to/Hybrid-Dense-Reranker\n\n# Activate virtual environment\nsource venv/bin/activate  # On macOS/Linux\n# venv\\Scripts\\activate   # On Windows\n```\n\n**Installing additional packages:**\n```bash\n# With virtual environment activated\npip install package-name\n\n# Update requirements.txt if needed\npip freeze \u003e requirements.txt\n```\n\n## Corpus Configuration\n\nThe application supports configurable corpus sources, allowing you to switch between different document collections:\n\n### Available Corpus Sources\n\n1. **Default Corpus** (`CORPUS_SOURCE=default`):\n   - Contains sample legal documents\n   - Includes contracts, compliance memos, and risk assessments\n   - Ready to use out of the box\n\n2. **Mormon Corpus** (`CORPUS_SOURCE=mormon`):\n   - Loads text from `data/mormon13short.txt`\n   - Automatically chunks the Book of Mormon text into manageable pieces\n   - Configurable chunk size and overlap\n\n### Configuration Options\n\nSet these environment variables in your `.env` file:\n\n```bash\n# Corpus source selection\nCORPUS_SOURCE=default  # Options: 'default' or 'mormon'\n\n# Text chunking configuration (applies to Mormon corpus)\nCHUNK_SIZE=1000        # Maximum characters per chunk\nCHUNK_OVERLAP=100      # Characters to overlap between chunks\n```\n\n### Using the Mormon Corpus\n\nTo use the Mormon corpus:\n\n1. Ensure `data/mormon13short.txt` exists in your project\n2. Set `CORPUS_SOURCE=mormon` in your `.env` file\n3. Configure chunk size and overlap as needed\n4. Restart the application\n\nThe application will automatically:\n- Parse verse references (e.g., \"1 Nephi 1:1\")\n- Create chunks based on your size settings\n- Maintain context with configurable overlap\n- Fall back to default corpus if the file is not found\n\n### Example Queries by Corpus\n\n**Default Corpus (Legal Documents):**\n```bash\ncurl -X POST http://localhost:5000/rag-query \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"contract liability and legal risks\"}'\n```\n\n**Mormon Corpus:**\n```bash\ncurl -X POST http://localhost:5000/rag-query \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"Nephi and his teachings about faith\"}'\n```\n\n## Usage\n\nThe application provides a RAG (Retrieval-Augmented Generation) endpoint:\n\n```bash\ncurl -X POST http://localhost:5000/rag-query \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"What are the security risks?\"}'\n```\n\n## API Response Format\n\nThe application returns enhanced results with multiple scoring methods:\n```json\n[\n  {\n    \"title\": \"Security Memo\",\n    \"content\": \"Ensure all employees use 2FA to reduce unauthorized access risks.\",\n    \"tfidf_score\": 0.85,\n    \"claude_score\": 0.92,\n    \"combined_score\": 0.899\n  }\n]\n```\n\n## How It Works\n\n1. **Initial Retrieval**: TF-IDF embeddings find potentially relevant documents\n2. **Claude Analysis**: Each retrieved document is analyzed by Claude for relevance\n3. **Hybrid Scoring**: Combines TF-IDF similarity with Claude's understanding\n4. **Intelligent Ranking**: Results sorted by combined score for optimal relevance\n\n## Benefits\n\n- **Pure Anthropic**: Uses only Anthropic's Claude for AI processing\n- **Cost Effective**: TF-IDF embeddings are free and fast\n- **Intelligent**: Claude provides nuanced relevance understanding\n- **Scalable**: Can handle large document collections efficiently\n\n## Testing\n\nFor comprehensive testing instructions, including integration tests and performance tests, see [TESTING.md](TESTING.md).\n\n### Quick Test Run\n```bash\n# Validate setup\npython validate_test_setup.py\n\n# Run all tests\npython run_integration_tests.py\n```\n\n### Corpus Configuration Testing\n\nTest the new corpus configuration functionality:\n\n```bash\n# Quick validation of corpus configuration\npython test_corpus_quick.py\n\n# Comprehensive corpus configuration tests\npython run_corpus_tests.py\n\n# Unit tests for corpus functionality\npython test_corpus_config.py\n\n# Integration tests for corpus workflow\npython test_corpus_integration.py\n```\n\n### Test Different Corpus Sources\n\n```bash\n# Test with default corpus\nCORPUS_SOURCE=default python test_corpus_quick.py\n\n# Test with Mormon corpus (if file exists)\nCORPUS_SOURCE=mormon CHUNK_SIZE=500 python test_corpus_quick.py\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frandomtask2000%2Fhybrid-dense-reranker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frandomtask2000%2Fhybrid-dense-reranker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frandomtask2000%2Fhybrid-dense-reranker/lists"}