https://github.com/netologist/secure-rag-system
A production-ready Retrieval-Augmented Generation (RAG) system built with Pydantic AI and Chroma that prioritizes data security and privacy for enterprise environments.
https://github.com/netologist/secure-rag-system
Last synced: about 1 year ago
JSON representation
A production-ready Retrieval-Augmented Generation (RAG) system built with Pydantic AI and Chroma that prioritizes data security and privacy for enterprise environments.
- Host: GitHub
- URL: https://github.com/netologist/secure-rag-system
- Owner: netologist
- Created: 2025-05-25T22:09:36.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-25T22:27:35.000Z (about 1 year ago)
- Last Synced: 2025-05-25T23:25:08.418Z (about 1 year ago)
- Language: Python
- Size: 79.1 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π Secure RAG System
A production-ready Retrieval-Augmented Generation (RAG) system built with Pydantic AI and Chroma that prioritizes data security and privacy for enterprise environments.
## π― Overview
This RAG system ensures your sensitive company data never leaves your infrastructure while still leveraging powerful AI capabilities. Raw documents are processed locally, embeddings are generated on-premises, and only minimal context is sent to external AI services.
## β¨ Key Features
- **π Data Sovereignty**: Raw documents never leave your system
- **π Local Processing**: Embeddings generated locally with SentenceTransformers
- **π Local Vector DB**: Chroma database runs entirely on your infrastructure
- **π‘οΈ Minimal Data Exposure**: Only selected context sent to AI APIs
- **β‘ Multiple Security Levels**: Air-gapped, cache-first, or fallback options
- **π Production Ready**: Built with Pydantic AI for type safety and reliability
- **π Scalable**: Easy to extend and modify for enterprise needs
## π Security Architecture
```mermaid
graph LR
A[Company Documents] --> B[Local Chunking]
B --> C[Local Embeddings]
C --> D[Local Vector DB]
D --> E[Context Retrieval]
E --> F[Minimal Context]
F --> G[AI API]
G --> H[Secure Response]
style A fill:#f9f,stroke:#333,stroke-width:2px
style D fill:#bbf,stroke:#333,stroke-width:2px
style F fill:#bfb,stroke:#333,stroke-width:2px
```
### Security Levels
| Level | Description | Internet Required | Security Rating |
|-------|-------------|-------------------|-----------------|
| π₯ **Air-gapped** | Pre-downloaded model, 100% offline | Never | Maximum |
| π₯ **Cache-first** | Download once, then offline | First run only | High |
| π₯ **TF-IDF Fallback** | Simple embeddings, no downloads | Never | Basic |
## π Quick Start
### Prerequisites
- Python 3.13+
- OpenAI API key (or compatible API)
- 2GB+ RAM for embedding models
### Installation
```bash
# Clone the repository
git clone https://github.com/netologist/secure-rag-system.git
cd secure-rag-system
# Install development dependencies
uv install
#Β Activate Virtual Env
source .venv/bin/activate
# Run
python -m main
# Or install manually
pip install pydantic-ai chromadb openai sentence-transformers scikit-learn
```
### Basic Usage
```python
import asyncio
from secure_rag_system import SecureRAGSystem
async def main():
# Initialize the system
rag = SecureRAGSystem("my_company_docs")
# Add documents
documents = [
"Your company policy document...",
"Technical documentation...",
"HR guidelines..."
]
rag.add_documents(documents)
# Query the system
answer = await rag.query("What is our remote work policy?")
print(answer)
# Run the system
asyncio.run(main())
```
## βοΈ Configuration
### Environment Variables
```bash
# Required
export OPENAI_API_KEY="your-openai-api-key"
# Optional
export CHROMA_DB_PATH="./chroma_db"
export EMBEDDING_MODEL_PATH="./models/all-MiniLM-L6-v2"
export MAX_CHUNK_SIZE="500"
export TOP_K_RESULTS="3"
```
### Maximum Security Setup (Air-gapped)
For maximum security, pre-download the embedding model:
```bash
# Install Hugging Face CLI
pip install huggingface_hub
# Download model locally
huggingface-cli download sentence-transformers/all-MiniLM-L6-v2 \
--local-dir ./models/all-MiniLM-L6-v2
# The system will automatically detect and use the local model
```
## π Advanced Usage
### Custom Document Processing
```python
from secure_rag_system import SecureRAGSystem
class CustomRAGSystem(SecureRAGSystem):
def _chunk_document(self, document: str, chunk_size: int = 500):
# Implement custom chunking logic
# E.g., semantic chunking, sentence-based splitting
return custom_chunks
def add_pdf_documents(self, pdf_paths: List[str]):
# Add PDF processing capability
documents = []
for pdf_path in pdf_paths:
text = extract_text_from_pdf(pdf_path)
documents.append(text)
self.add_documents(documents)
```
### Batch Processing
```python
# Process large document collections
async def process_document_library():
rag = SecureRAGSystem("document_library")
# Process documents in batches
batch_size = 10
for i in range(0, len(all_documents), batch_size):
batch = all_documents[i:i + batch_size]
rag.add_documents(batch)
print(f"Processed batch {i//batch_size + 1}")
return rag
```
### Integration with Different AI Providers
```python
from pydantic_ai.models.anthropic import AnthropicModel
from pydantic_ai.models.openai import OpenAIModel
# Use Anthropic Claude
rag.agent = Agent(
model=AnthropicModel('claude-3-sonnet-20240229'),
system_prompt="Your custom prompt..."
)
# Use different OpenAI models
rag.agent = Agent(
model=OpenAIModel('gpt-4-turbo'),
system_prompt="Your custom prompt..."
)
```
## π§ API Reference
### SecureRAGSystem Class
#### Constructor
```python
SecureRAGSystem(collection_name: str = "company_docs")
```
#### Methods
| Method | Description | Parameters |
|--------|-------------|------------|
| `add_documents()` | Add documents to the system | `documents: List[str]`, `metadatas: List[dict]` |
| `query()` | Query the system | `question: str`, `top_k: int = 3` |
| `get_stats()` | Get system statistics | None |
#### Security Methods
| Method | Description | Returns |
|--------|-------------|---------|
| `_setup_embedding_model()` | Configure embedding model security | `SentenceTransformer` or `TFIDFEmbedder` |
| `_create_tfidf_embedder()` | Create offline TF-IDF embedder | `SimpleTFIDFEmbedder` |
## ποΈ System Architecture
### Components
1. **Document Processor**: Chunks and preprocesses documents
2. **Embedding Engine**: Generates vector representations locally
3. **Vector Database**: Stores and indexes embeddings (Chroma)
4. **Retrieval Engine**: Finds relevant document chunks
5. **AI Agent**: Generates responses using Pydantic AI
6. **Security Layer**: Ensures data never leaves your control
### Data Flow
1. **Ingestion**: Documents β Chunking β Local Embeddings
2. **Storage**: Embeddings β Local Chroma Database
3. **Retrieval**: Query β Vector Search β Context Selection
4. **Generation**: Context + Query β AI API β Response
## π Monitoring and Observability
### Built-in Statistics
```python
# Get system statistics
stats = rag.get_stats()
print(f"Total documents: {stats['total_documents']}")
print(f"Database path: {stats['database_path']}")
```
### Custom Metrics
```python
# Add custom monitoring
class MonitoredRAGSystem(SecureRAGSystem):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.query_count = 0
self.response_times = []
async def query(self, question: str, top_k: int = 3):
start_time = time.time()
result = await super().query(question, top_k)
end_time = time.time()
self.query_count += 1
self.response_times.append(end_time - start_time)
return result
```
## π‘οΈ Security Best Practices
### Data Handling
- Never log sensitive document content
- Use environment variables for API keys
- Regularly rotate API keys
- Implement access controls on the Chroma database
### Network Security
- Run on isolated networks when possible
- Use VPNs for remote access
- Monitor API calls to external services
- Implement rate limiting
### Compliance
- Maintain audit logs of document access
- Implement data retention policies
- Regular security assessments
- Document data flow for compliance reviews
## π¨ Troubleshooting
### Common Issues
#### Model Download Fails
```bash
# Check internet connectivity
ping huggingface.co
# Use manual download
wget https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/pytorch_model.bin
```
#### Chroma Database Issues
```python
# Reset database if corrupted
rag.chroma_client.reset()
# Change database path
rag = SecureRAGSystem("new_collection")
```
#### Memory Issues
```python
# Reduce chunk size for large documents
rag.add_documents(docs, chunk_size=200)
# Process documents in smaller batches
for batch in chunks(documents, 5):
rag.add_documents(batch)
```
### Performance Optimization
#### Embedding Performance
- Use GPU acceleration when available
- Batch process documents
- Optimize chunk sizes for your use case
#### Vector Search Performance
- Adjust `top_k` based on your needs
- Use appropriate distance metrics
- Consider index optimization for large datasets
### Development Setup
```bash
# Clone for development
git clone https://github.com/netologist/secure-rag-system.git
cd secure-rag-system
# Install development dependencies
uv install
#Β Activate Virtual Env
source .venv/bin/activate
# Run
python -m main
```
## π Related Projects
- [Pydantic AI](https://github.com/pydantic/pydantic-ai) - Type-safe AI agents
- [Chroma](https://github.com/chroma-core/chroma) - Vector database
- [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) - Embedding models
## π Roadmap
- [ ] Multi-modal document support (images, tables)
- [ ] Advanced chunking strategies
- [ ] Integration with more vector databases
- [ ] Kubernetes deployment manifests
- [ ] Federated learning capabilities
- [ ] Real-time document updates