{"id":30425072,"url":"https://github.com/robertoperuzzo/tomm-202509","last_synced_at":"2025-08-22T11:44:34.627Z","repository":{"id":310082048,"uuid":"1038606054","full_name":"robertoperuzzo/tomm-202509","owner":"robertoperuzzo","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-15T15:02:53.000Z","size":13,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-15T17:20:56.575Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/robertoperuzzo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-15T14:07:15.000Z","updated_at":"2025-08-15T15:02:56.000Z","dependencies_parsed_at":"2025-08-15T17:21:02.868Z","dependency_job_id":null,"html_url":"https://github.com/robertoperuzzo/tomm-202509","commit_stats":null,"previous_names":["robertoperuzzo/tomm-202509"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/robertoperuzzo/tomm-202509","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertoperuzzo%2Ftomm-202509","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertoperuzzo%2Ftomm-202509/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertoperuzzo%2Ftomm-202509/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertoperuzzo%2Ftomm-202509/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/robertoperuzzo","download_url":"https://codeload.github.com/robertoperuzzo/tomm-202509/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robertoperuzzo%2Ftomm-202509/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271632878,"owners_count":24793774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-22T11:44:33.610Z","updated_at":"2025-08-22T11:44:34.552Z","avatar_url":"https://github.com/robertoperuzzo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Semantic Search RAG Demo Setup Guide\n\nThis implementation provides a complete semantic search system using Cheshire Cat AI and Typesense for RAG (Retrieval-Augmented Generation) with PDF processing capabilities.\n\n## Requirements\n\n### Python Dependencies\n```\n# Core dependencies\ntypesense==0.15.0\nPyPDF2==3.0.1\nsentence-transformers==2.2.2\nnumpy==1.24.3\ncheshire-cat-api==1.3.1\npydantic==1.10.7\nPyYAML==6.0\n\n# For token-based chunking (recommended)\ntransformers==4.30.0\ntorch\u003e=1.9.0\n\n# Additional utilities\npython-dotenv==1.0.0\nasyncio-mqtt==0.13.0\n```\n\n### System Requirements\n- Python 3.8+\n- Docker and Docker Compose\n- Typesense and Cheshire Cat services (provided via `compose.yml`)\n\n## Installation Steps\n\n### 1. Install Python Dependencies\n```bash\npip install typesense PyPDF2 sentence-transformers numpy cheshire-cat-api pydantic PyYAML python-dotenv transformers torch\n```\n\n### 2. Start services (Typesense + Cheshire Cat) with Docker Compose\nUsing the provided `compose.yml` at the project root:\n```bash\ndocker compose -f compose.yml up -d\n```\n\nOptional health checks:\n```bash\ncurl -s http://localhost:8108/health\ncurl -s http://localhost:1865/ | head\n```\n\nCheshire Cat is included in `compose.yml` and will be available at `http://localhost:1865`.\n\n### 3. Configuration\n\nCreate a configuration file using the built-in template:\n```bash\npython semantic_search_rag.py --create-config\n```\n\nThis creates `config.yaml`:\n```yaml\ntypesense_host: localhost\ntypesense_port: 8108\ntypesense_protocol: http\ntypesense_api_key: your-api-key-here\ntypesense_collection: documents\ncheshire_cat_host: localhost\ncheshire_cat_port: 1865\ncheshire_cat_user_id: user\nembedding_model: sentence-transformers/all-MiniLM-L6-v2\n\n# Chunking strategy configuration - KEY FOR EXPERIMENTATION\nchunk_strategy: token_based  # token_based, word_based, sentence_based\nchunk_size: 512\nchunk_overlap: 128\nmin_chunk_size: 32\n\n# Alternative: use research-based predefined configs\n# chunk_config_name: fact_based_medium  # See available options below\n\nmax_results: 5\nsimilarity_threshold: 0.7\n```\n\n### Predefined Chunk Configurations\n\nThe system includes research-based chunk configurations optimized for different use cases:\n\n```bash\n# List all available configurations\npython semantic_search_rag.py --list-chunk-configs\n```\n\nAvailable configurations:\n- `fact_based_small`: 64 tokens - optimal for fact-based queries\n- `fact_based_medium`: 128 tokens - good balance for fact retrieval\n- `context_small`: 256 tokens - small context preservation\n- `context_medium`: 512 tokens - medium context for broader queries\n- `context_large`: 1024 tokens - large context for complex topics\n- `sentence_based`: 5 sentences per chunk with 1 sentence overlap\n- `paragraph_based`: ~10 sentences for paragraph-like chunks\n\nAlternatively, use environment variables in a `.env` file:\n```env\nTYPESENSE_HOST=localhost\nTYPESENSE_PORT=8108\nTYPESENSE_API_KEY=your-api-key-here\nCHESHIRE_CAT_HOST=localhost\nCHESHIRE_CAT_PORT=1865\nEMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2\nCHUNK_SIZE=500\nMAX_SEARCH_RESULTS=5\n```\n\n## Usage\n\n### Chunk Size Experimentation (NEW!)\n\n**1. List available chunk configurations:**\n```bash\npython semantic_search_rag.py --list-chunk-configs\n```\n\n**2. Run comprehensive chunk size experiment:**\n```bash\npython semantic_search_rag.py --pdf-dir ./documents --experiment \\\n  --experiment-queries \"What are the main topics?\" \"List specific facts\" \"Explain the methodology\"\n```\n\n**3. Test specific chunk configuration:**\n```bash\n# Use fact-based small chunks (64 tokens)\npython semantic_search_rag.py --config config.yaml --pdf-dir ./documents --chunk-config fact_based_small\n\n# Use large context chunks (1024 tokens)\npython semantic_search_rag.py --config config.yaml --pdf-dir ./documents --chunk-config context_large\n```\n\n### Command Line Interface\n\n1. **Create configuration templates:**\n```bash\npython semantic_search_rag.py --create-config\n# This creates: config.yaml, config_fact_based_64.yaml, config_context_512.yaml, etc.\n```\n\n2. **Process PDF documents with specific chunk size:**\n```bash\n# Using predefined config for fact-based queries\npython semantic_search_rag.py --config config_fact_based_128.yaml --pdf-dir /path/to/pdf/directory\n\n# Using custom token-based chunking\npython semantic_search_rag.py --config config.yaml --pdf-dir /path/to/pdf/directory\n```\n\n3. **Perform semantic search:**\n```bash\npython semantic_search_rag.py --config config.yaml --query \"What is machine learning?\"\n```\n\n### Programmatic Usage with Chunk Experimentation\n\n```python\nimport asyncio\nfrom semantic_search_rag import SemanticSearchRAG, PDFExtractor\n\nasync def chunk_size_comparison():\n    # Test different chunk configurations\n    configs = ['fact_based_small', 'fact_based_medium', 'context_medium', 'context_large']\n\n    results = {}\n    for config_name in configs:\n        # Initialize with specific chunk config\n        rag_system = SemanticSearchRAG()\n        rag_system.config.chunk_config_name = config_name\n        rag_system.config.typesense_collection = f\"test_{config_name}\"\n\n        # Reinitialize with new config\n        rag_system.pdf_extractor = PDFExtractor(config_name=config_name)\n        rag_system.typesense_manager.collection_name = f\"test_{config_name}\"\n        rag_system.typesense_manager.create_collection(rag_system.embedding_manager.dimension)\n\n        # Process documents\n        rag_system.process_pdfs(\"./documents\")\n\n        # Test queries\n        fact_query = \"What specific numbers or statistics are mentioned?\"\n        context_query = \"Explain the overall methodology and approach?\"\n\n        fact_result = await rag_system.search_and_generate(fact_query)\n        context_result = await rag_system.search_and_generate(context_query)\n\n        results[config_name] = {\n            'config': PDFExtractor.CHUNK_CONFIGS[config_name],\n            'fact_query_results': len(fact_result['context_chunks']),\n            'context_query_results': len(context_result['context_chunks']),\n            'fact_response_length': len(fact_result['response']),\n            'context_response_length': len(context_result['response'])\n        }\n\n        print(f\"\\n=== {config_name.upper()} ===\")\n        print(f\"Config: {PDFExtractor.CHUNK_CONFIGS[config_name]['description']}\")\n        print(f\"Fact query - Found chunks: {len(fact_result['context_chunks'])}\")\n        print(f\"Context query - Found chunks: {len(context_result['context_chunks'])}\")\n\n    return results\n\n# Run automated experiment\nasync def run_experiment():\n    rag_system = SemanticSearchRAG()\n\n    # Run comprehensive experiment\n    experiment_results = rag_system.run_chunk_size_experiment(\n        \"./documents\",\n        [\n            \"What are the specific facts and numbers mentioned?\",\n            \"Explain the main concepts and their relationships?\",\n            \"What methodology was used in this research?\",\n            \"List all mentioned authors or sources?\"\n        ]\n    )\n\n    print(\"Experiment completed! Results saved to chunk_experiment_results.json\")\n\n    # Analyze results\n    for config_name, config_results in experiment_results['results'].items():\n        config_info = config_results['config']\n        print(f\"\\n{config_name} ({config_info['description']}):\")\n        print(f\"  Total chunks created: {config_results['total_chunks']}\")\n        print(f\"  Average chunk length: {config_results['avg_chunk_length']:.0f} characters\")\n\n        for query, query_results in config_results['query_results'].items():\n            print(f\"  '{query[:30]}...': {query_results['num_results']} results (avg score: {query_results['avg_score']:.3f})\")\n\nif __name__ == \"__main__\":\n    asyncio.run(run_experiment())\n```\n\n## Configuration Parameters\n\n### Chunking Strategy Settings (NEW!)\n- `chunk_strategy`: Chunking method (`token_based`, `word_based`, `sentence_based`)\n- `chunk_size`: Size per chunk (tokens/words/sentences based on strategy)\n- `chunk_overlap`: Overlap between consecutive chunks (same units)\n- `min_chunk_size`: Minimum chunk size threshold\n- `chunk_config_name`: Use predefined research-based configuration\n\n**Research-Based Recommendations:**\n- **64-128 tokens**: Best for fact-based queries, specific information retrieval\n- **512-1024 tokens**: Better for context-heavy queries, conceptual understanding\n- **Sentence-based**: Good for maintaining semantic coherence\n- **Token-based**: Most precise, requires transformers library\n\n### Typesense Settings\n- `typesense_host`: Typesense server host\n- `typesense_port`: Typesense server port\n- `typesense_api_key`: API key for Typesense authentication\n- `typesense_collection`: Collection name for documents\n\n### Cheshire Cat Settings\n- `cheshire_cat_host`: Cheshire Cat server host\n- `cheshire_cat_port`: Cheshire Cat server port\n- `cheshire_cat_user_id`: User ID for chat sessions\n\n### Processing Settings\n- `embedding_model`: Sentence transformer model name\n- `max_results`: Maximum search results to return\n- `similarity_threshold`: Minimum similarity score threshold\n\n## Features\n\n### Advanced Chunking Strategies\n- **Token-based chunking**: Precise control using transformer tokenizers\n- **Word-based chunking**: Simple word count-based splitting\n- **Sentence-based chunking**: Maintains semantic boundaries\n- **Research-based presets**: Optimized configurations for different query types\n- **Overlap control**: Configurable overlap to preserve context across chunks\n\n### Chunk Size Experimentation Framework\n- **Automated experimentation**: Test multiple chunk sizes with same documents\n- **Performance comparison**: Compare precision/recall across configurations\n- **Query-specific optimization**: Different chunk sizes for different query types\n- **Results logging**: Detailed JSON output for analysis\n- **Predefined configurations**: Research-backed chunk size recommendations\n\n### PDF Processing\n- Extracts text from PDF files page by page\n- Creates overlapping text chunks for better context\n- Handles multiple PDFs in batch processing\n- Preserves document metadata (source, page numbers)\n\n### Semantic Search\n- Uses sentence transformers for text embeddings\n- Vector similarity search with Typesense\n- Configurable similarity thresholds\n- Returns ranked results with scores\n\n### RAG Integration\n- Integrates with Cheshire Cat AI for response generation\n- Provides relevant context from search results\n- Generates comprehensive answers based on retrieved documents\n- Handles cases where context is insufficient\n\n## Chunk Size Research Implementation\n\nBased on the research notes you provided, this implementation includes:\n\n### Precision vs. Recall Trade-offs\n- **Small chunks (64-128 tokens)**: Higher precision for fact-based queries\n  - Better for: \"What is the exact number?\", \"Who said this quote?\"\n  - Use: `fact_based_small` or `fact_based_medium` configs\n\n- **Large chunks (512-1024 tokens)**: Higher recall for context queries\n  - Better for: \"Explain the methodology\", \"What are the main concepts?\"\n  - Use: `context_medium` or `context_large` configs\n\n### Experimental Validation\n```bash\n# Test the research hypothesis with your documents\npython semantic_search_rag.py --pdf-dir ./docs --experiment \\\n  --experiment-queries \\\n    \"What specific statistics are mentioned?\" \\\n    \"Explain the overall approach and methodology?\" \\\n    \"Who are the key authors cited?\" \\\n    \"What are the main theoretical frameworks discussed?\"\n\n# Results will show:\n# - fact_based_small: High precision for specific facts\n# - context_large: Better recall for conceptual queries\n# - Quantitative metrics: chunk counts, similarity scores, response lengths\n```\n\n### Query-Adaptive Chunking\nThe system allows you to use different chunk sizes for different query types:\n\n```python\n# For fact-based queries\nfact_rag = SemanticSearchRAG('config_fact_based_64.yaml')\nfact_result = await fact_rag.search_and_generate(\"What is the exact methodology used?\")\n\n# For context-heavy queries\ncontext_rag = SemanticSearchRAG('config_context_1024.yaml')\ncontext_result = await context_rag.search_and_generate(\"Explain the theoretical framework and its implications?\")\n```\n\n## Architecture Overview\n\n```\nPDFs → PDF Extractor → Text Chunks → Embedding Manager → Typesense (Vector DB)\n                                                              ↓\nUser Query → Embedding Manager → Semantic Search ← Typesense\n                    ↓\n            Context + Query → Cheshire Cat AI → Generated Response\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Cheshire Cat Connection Error**\n   - Ensure Cheshire Cat is running on the correct port\n   - Check if the API endpoint is accessible\n\n2. **Typesense Connection Error**\n   - Verify Typesense server is running\n   - Check API key configuration\n   - Ensure proper network connectivity\n\n3. **Embedding Model Download**\n   - First run may take time to download the embedding model\n   - Ensure internet connectivity for model download\n\n4. **PDF Processing Errors**\n   - Check if PDF files are not corrupted\n   - Ensure PDFs contain extractable text (not just images)\n\n### Performance Optimization\n\n- Use GPU-enabled sentence transformers for faster embedding generation\n- Adjust chunk size based on your document types\n- Configure Typesense memory settings for large document collections\n- Consider using lighter embedding models for faster inference\n\n## License\n\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\n\nPlease ensure compliance with the licenses of all dependencies:\n- Typesense: Apache 2.0\n- Cheshire Cat AI: MIT\n- Sentence Transformers: Apache 2.0","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobertoperuzzo%2Ftomm-202509","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobertoperuzzo%2Ftomm-202509","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobertoperuzzo%2Ftomm-202509/lists"}