{"id":28744756,"url":"https://github.com/netologist/secure-rag-system","last_synced_at":"2025-06-16T12:05:16.894Z","repository":{"id":295507618,"uuid":"990292484","full_name":"netologist/secure-rag-system","owner":"netologist","description":"A production-ready Retrieval-Augmented Generation (RAG) system built with Pydantic AI and Chroma that prioritizes data security and privacy for enterprise environments.","archived":false,"fork":false,"pushed_at":"2025-05-25T22:27:35.000Z","size":81,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-25T23:25:08.418Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/netologist.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-25T22:09:36.000Z","updated_at":"2025-05-25T22:27:38.000Z","dependencies_parsed_at":"2025-05-25T23:25:24.578Z","dependency_job_id":"1f13967a-bf9a-490f-a736-2dee53972f96","html_url":"https://github.com/netologist/secure-rag-system","commit_stats":null,"previous_names":["netologist/secure-rag-system"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/netologist/secure-rag-system","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netologist%2Fsecure-rag-system","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netologist%2Fsecure-rag-system/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netologist%2Fsecure-rag-system/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netologist%2Fsecure-rag-system/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/netologist","download_url":"https://codeload.github.com/netologist/secure-rag-system/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/netologist%2Fsecure-rag-system/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260158326,"owners_count":22967226,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-16T12:05:16.307Z","updated_at":"2025-06-16T12:05:16.872Z","avatar_url":"https://github.com/netologist.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🔒 Secure RAG System\n\nA production-ready Retrieval-Augmented Generation (RAG) system built with Pydantic AI and Chroma that prioritizes data security and privacy for enterprise environments.\n\n## 🎯 Overview\n\nThis RAG system ensures your sensitive company data never leaves your infrastructure while still leveraging powerful AI capabilities. Raw documents are processed locally, embeddings are generated on-premises, and only minimal context is sent to external AI services.\n\n## ✨ Key Features\n\n- **🔐 Data Sovereignty**: Raw documents never leave your system\n- **🏠 Local Processing**: Embeddings generated locally with SentenceTransformers\n- **📊 Local Vector DB**: Chroma database runs entirely on your infrastructure\n- **🛡️ Minimal Data Exposure**: Only selected context sent to AI APIs\n- **⚡ Multiple Security Levels**: Air-gapped, cache-first, or fallback options\n- **🚀 Production Ready**: Built with Pydantic AI for type safety and reliability\n- **📈 Scalable**: Easy to extend and modify for enterprise needs\n\n## 🔒 Security Architecture\n\n```mermaid\ngraph LR\n    A[Company Documents] --\u003e B[Local Chunking]\n    B --\u003e C[Local Embeddings]\n    C --\u003e D[Local Vector DB]\n    D --\u003e E[Context Retrieval]\n    E --\u003e F[Minimal Context]\n    F --\u003e G[AI API]\n    G --\u003e H[Secure Response]\n    \n    style A fill:#f9f,stroke:#333,stroke-width:2px\n    style D fill:#bbf,stroke:#333,stroke-width:2px\n    style F fill:#bfb,stroke:#333,stroke-width:2px\n```\n\n### Security Levels\n\n| Level | Description | Internet Required | Security Rating |\n|-------|-------------|-------------------|-----------------|\n| 🥇 **Air-gapped** | Pre-downloaded model, 100% offline | Never | Maximum |\n| 🥈 **Cache-first** | Download once, then offline | First run only | High |\n| 🥉 **TF-IDF Fallback** | Simple embeddings, no downloads | Never | Basic |\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n- Python 3.13+\n- OpenAI API key (or compatible API)\n- 2GB+ RAM for embedding models\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/netologist/secure-rag-system.git\ncd secure-rag-system\n\n# Install development dependencies\nuv install\n\n# Activate Virtual Env\nsource .venv/bin/activate\n\n# Run\npython -m main\n\n# Or install manually\npip install pydantic-ai chromadb openai sentence-transformers scikit-learn\n```\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom secure_rag_system import SecureRAGSystem\n\nasync def main():\n    # Initialize the system\n    rag = SecureRAGSystem(\"my_company_docs\")\n    \n    # Add documents\n    documents = [\n        \"Your company policy document...\",\n        \"Technical documentation...\",\n        \"HR guidelines...\"\n    ]\n    rag.add_documents(documents)\n    \n    # Query the system\n    answer = await rag.query(\"What is our remote work policy?\")\n    print(answer)\n\n# Run the system\nasyncio.run(main())\n```\n\n## ⚙️ Configuration\n\n### Environment Variables\n\n```bash\n# Required\nexport OPENAI_API_KEY=\"your-openai-api-key\"\n\n# Optional\nexport CHROMA_DB_PATH=\"./chroma_db\"\nexport EMBEDDING_MODEL_PATH=\"./models/all-MiniLM-L6-v2\"\nexport MAX_CHUNK_SIZE=\"500\"\nexport TOP_K_RESULTS=\"3\"\n```\n\n### Maximum Security Setup (Air-gapped)\n\nFor maximum security, pre-download the embedding model:\n\n```bash\n# Install Hugging Face CLI\npip install huggingface_hub\n\n# Download model locally\nhuggingface-cli download sentence-transformers/all-MiniLM-L6-v2 \\\n    --local-dir ./models/all-MiniLM-L6-v2\n\n# The system will automatically detect and use the local model\n```\n\n## 📖 Advanced Usage\n\n### Custom Document Processing\n\n```python\nfrom secure_rag_system import SecureRAGSystem\n\nclass CustomRAGSystem(SecureRAGSystem):\n    def _chunk_document(self, document: str, chunk_size: int = 500):\n        # Implement custom chunking logic\n        # E.g., semantic chunking, sentence-based splitting\n        return custom_chunks\n    \n    def add_pdf_documents(self, pdf_paths: List[str]):\n        # Add PDF processing capability\n        documents = []\n        for pdf_path in pdf_paths:\n            text = extract_text_from_pdf(pdf_path)\n            documents.append(text)\n        self.add_documents(documents)\n```\n\n### Batch Processing\n\n```python\n# Process large document collections\nasync def process_document_library():\n    rag = SecureRAGSystem(\"document_library\")\n    \n    # Process documents in batches\n    batch_size = 10\n    for i in range(0, len(all_documents), batch_size):\n        batch = all_documents[i:i + batch_size]\n        rag.add_documents(batch)\n        print(f\"Processed batch {i//batch_size + 1}\")\n    \n    return rag\n```\n\n### Integration with Different AI Providers\n\n```python\nfrom pydantic_ai.models.anthropic import AnthropicModel\nfrom pydantic_ai.models.openai import OpenAIModel\n\n# Use Anthropic Claude\nrag.agent = Agent(\n    model=AnthropicModel('claude-3-sonnet-20240229'),\n    system_prompt=\"Your custom prompt...\"\n)\n\n# Use different OpenAI models\nrag.agent = Agent(\n    model=OpenAIModel('gpt-4-turbo'),\n    system_prompt=\"Your custom prompt...\"\n)\n```\n\n## 🔧 API Reference\n\n### SecureRAGSystem Class\n\n#### Constructor\n```python\nSecureRAGSystem(collection_name: str = \"company_docs\")\n```\n\n#### Methods\n\n| Method | Description | Parameters |\n|--------|-------------|------------|\n| `add_documents()` | Add documents to the system | `documents: List[str]`, `metadatas: List[dict]` |\n| `query()` | Query the system | `question: str`, `top_k: int = 3` |\n| `get_stats()` | Get system statistics | None |\n\n#### Security Methods\n\n| Method | Description | Returns |\n|--------|-------------|---------|\n| `_setup_embedding_model()` | Configure embedding model security | `SentenceTransformer` or `TFIDFEmbedder` |\n| `_create_tfidf_embedder()` | Create offline TF-IDF embedder | `SimpleTFIDFEmbedder` |\n\n## 🏗️ System Architecture\n\n### Components\n\n1. **Document Processor**: Chunks and preprocesses documents\n2. **Embedding Engine**: Generates vector representations locally\n3. **Vector Database**: Stores and indexes embeddings (Chroma)\n4. **Retrieval Engine**: Finds relevant document chunks\n5. **AI Agent**: Generates responses using Pydantic AI\n6. **Security Layer**: Ensures data never leaves your control\n\n### Data Flow\n\n1. **Ingestion**: Documents → Chunking → Local Embeddings\n2. **Storage**: Embeddings → Local Chroma Database\n3. **Retrieval**: Query → Vector Search → Context Selection\n4. **Generation**: Context + Query → AI API → Response\n\n## 🔍 Monitoring and Observability\n\n### Built-in Statistics\n\n```python\n# Get system statistics\nstats = rag.get_stats()\nprint(f\"Total documents: {stats['total_documents']}\")\nprint(f\"Database path: {stats['database_path']}\")\n```\n\n### Custom Metrics\n\n```python\n# Add custom monitoring\nclass MonitoredRAGSystem(SecureRAGSystem):\n    def __init__(self, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n        self.query_count = 0\n        self.response_times = []\n    \n    async def query(self, question: str, top_k: int = 3):\n        start_time = time.time()\n        result = await super().query(question, top_k)\n        end_time = time.time()\n        \n        self.query_count += 1\n        self.response_times.append(end_time - start_time)\n        \n        return result\n```\n\n## 🛡️ Security Best Practices\n\n### Data Handling\n- Never log sensitive document content\n- Use environment variables for API keys\n- Regularly rotate API keys\n- Implement access controls on the Chroma database\n\n### Network Security\n- Run on isolated networks when possible\n- Use VPNs for remote access\n- Monitor API calls to external services\n- Implement rate limiting\n\n### Compliance\n- Maintain audit logs of document access\n- Implement data retention policies\n- Regular security assessments\n- Document data flow for compliance reviews\n\n## 🚨 Troubleshooting\n\n### Common Issues\n\n#### Model Download Fails\n```bash\n# Check internet connectivity\nping huggingface.co\n\n# Use manual download\nwget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/pytorch_model.bin\n```\n\n#### Chroma Database Issues\n```python\n# Reset database if corrupted\nrag.chroma_client.reset()\n\n# Change database path\nrag = SecureRAGSystem(\"new_collection\")\n```\n\n#### Memory Issues\n```python\n# Reduce chunk size for large documents\nrag.add_documents(docs, chunk_size=200)\n\n# Process documents in smaller batches\nfor batch in chunks(documents, 5):\n    rag.add_documents(batch)\n```\n\n### Performance Optimization\n\n#### Embedding Performance\n- Use GPU acceleration when available\n- Batch process documents\n- Optimize chunk sizes for your use case\n\n#### Vector Search Performance\n- Adjust `top_k` based on your needs\n- Use appropriate distance metrics\n- Consider index optimization for large datasets\n\n### Development Setup\n\n```bash\n# Clone for development\ngit clone https://github.com/netologist/secure-rag-system.git\ncd secure-rag-system\n\n# Install development dependencies\nuv install\n\n# Activate Virtual Env\nsource .venv/bin/activate\n\n# Run\npython -m main\n```\n\n## 🔗 Related Projects\n\n- [Pydantic AI](https://github.com/pydantic/pydantic-ai) - Type-safe AI agents\n- [Chroma](https://github.com/chroma-core/chroma) - Vector database\n- [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) - Embedding models\n\n\n## 📊 Roadmap\n\n- [ ] Multi-modal document support (images, tables)\n- [ ] Advanced chunking strategies\n- [ ] Integration with more vector databases\n- [ ] Kubernetes deployment manifests\n- [ ] Federated learning capabilities\n- [ ] Real-time document updates\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetologist%2Fsecure-rag-system","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetologist%2Fsecure-rag-system","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetologist%2Fsecure-rag-system/lists"}