https://github.com/kanugurajesh/assistly
An AI-powered customer support system that automatically classifies tickets and provides intelligent responses using Retrieval-Augmented Generation
https://github.com/kanugurajesh/assistly
fastembed firecrawl openai python3 qdrant streamlit
Last synced: 26 days ago
JSON representation
An AI-powered customer support system that automatically classifies tickets and provides intelligent responses using Retrieval-Augmented Generation
- Host: GitHub
- URL: https://github.com/kanugurajesh/assistly
- Owner: kanugurajesh
- Created: 2025-09-10T16:14:05.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2025-09-14T14:37:33.000Z (26 days ago)
- Last Synced: 2025-09-14T16:28:58.720Z (26 days ago)
- Topics: fastembed, firecrawl, openai, python3, qdrant, streamlit
- Language: Python
- Homepage:
- Size: 440 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Atlan Customer Support Copilot
An advanced AI-powered customer support system that automatically classifies tickets and provides intelligent responses using state-of-the-art Retrieval-Augmented Generation (RAG) with hybrid search, query enhancement, and optimized chunking strategies.
## π Features
### Core Functionality
- **Bulk Ticket Classification**: Automatically classify 30+ sample tickets with AI-powered categorization
- **Interactive AI Agent**: Real-time chat interface for new ticket submission and response
- **Conversational Memory**: Context-aware conversations using LangChain ChatMessageHistory with in-memory storage
- **Smart Classification**: Topic tags, sentiment analysis, and priority assignment
- **Advanced RAG Responses**: Intelligent answers powered by hybrid search and enhanced retrieval
- **Source Citations**: All responses include links to relevant documentation
- **Search Transparency**: Real-time indicators showing search methods used (vector, keyword, or hybrid)
- **Dynamic Settings Management**: Comprehensive settings page for real-time pipeline configuration### Advanced RAG Features
- **Hybrid Search**: Combines vector similarity and BM25 keyword search for optimal relevance
- **Query Enhancement**: GPT-4o powered query expansion for technical terms (configurable)
- **Enhanced Chunking**: Code block preservation with intelligent markdown structure awareness
- **Smart Reranking**: Configurable weighted merging of vector and keyword search results
- **Quality Metrics**: Chunk quality indicators including code detection and header analysis
- **Real-time Configuration**: Dynamic settings updates without application restart
- **Settings Import/Export**: JSON-based configuration backup and sharing### Classification Schema
- **Topic Tags**: How-to, Product, Connector, Lineage, API/SDK, SSO, Glossary, Best practices, Sensitive data
- **Sentiment**: Frustrated, Curious, Angry, Neutral
- **Priority**: P0 (High), P1 (Medium), P2 (Low)## π― Major Design Decisions & Trade-offs
### 1. Multi-Stage Pipeline Architecture
**Decision**: Separate data pipeline (scraping β storage β vectorization) from deployment application.**Why**:
- **Data Persistence**: Web scraping is expensive and rate-limited. MongoDB storage allows reprocessing embeddings without re-scraping.
- **Deployment Flexibility**: App folder contains only deployment dependencies, enabling clean Streamlit Cloud deployment.
- **Development Efficiency**: Can iterate on AI logic without re-running expensive data collection.
- **A/B Testing**: Separate collections enable comparison between basic and enhanced RAG implementations.**Trade-off**: Increased complexity vs. reliability, cost efficiency, and experimentation capability.
### 2. Advanced Technology Stack Choices
#### Hybrid Search: Vector + BM25 vs. Pure Vector Search
**Decision**: Implement hybrid search combining vector similarity and BM25 keyword search.**Why**:
- **Technical Term Precision**: BM25 excels at exact matches for technical terms, APIs, and product names.
- **Semantic Understanding**: Vector search captures conceptual relationships and context.
- **Complementary Strengths**: Vector search for "how to authenticate" + BM25 for "SAML SSO" = comprehensive coverage.
- **Fallback Strategy**: Graceful degradation to vector-only if BM25 fails.**Trade-off**: System complexity and processing overhead vs. significantly improved retrieval quality for technical documentation.
#### Query Enhancement: GPT-4o Expansion vs. Direct Search
**Decision**: Optional GPT-4o query enhancement with configurable toggle.**Why**:
- **Technical Term Expansion**: "SSO" β "SAML single sign-on authentication setup"
- **Context Enrichment**: "API rate limits" β "REST API rate limiting configuration and best practices"
- **Acronym Resolution**: Critical for technical documentation where acronyms are prevalent.
- **Cost Control**: Configurable feature allows optimization for different use cases.**Trade-off**: Additional API costs and latency vs. dramatically improved retrieval for technical queries.
#### Enhanced Chunking: Code-Aware vs. Simple Character Splitting
**Decision**: Advanced recursive splitting with code block preservation and quality metrics.**Why**:
- **Code Integrity**: Preserves ```code blocks``` as single units to maintain functional examples.
- **Structure Awareness**: Respects markdown headers, lists, and procedures.
- **Quality Tracking**: Metadata enables optimization and debugging of retrieval quality.
- **Context Preservation**: Smart boundaries prevent splitting related instructions.**Trade-off**: Processing complexity and storage overhead vs. significantly better content quality and retrieval accuracy.
### 3. Feature Toggle Architecture
**Decision**: Configurable enhancement toggles rather than fixed implementation.**Why**:
- **Deployment Flexibility**: Different environments can optimize for cost vs. quality.
- **Performance Tuning**: Disable expensive features for high-volume scenarios.
- **Gradual Rollout**: Test advanced features incrementally in production.
- **User Choice**: Let users balance speed vs. comprehensive results.**Trade-off**: Configuration complexity vs. deployment flexibility and performance optimization.
### 4. Smart Reranking Strategy
**Decision**: Configurable weighted fusion of vector and BM25 results with intelligent deduplication.**Why**:
- **Flexible Relevance**: Configurable weights allow optimization for different use cases.
- **Exact Match Boost**: BM25 results receive configurable weight for technical precision.
- **Deduplication**: Documents found by both methods receive relevance boost.
- **Empirical Optimization**: Default weights can be tuned based on specific documentation types.**Trade-off**: Algorithm complexity vs. superior result ranking and relevance.
### 5. Dual Collection Strategy
**Decision**: Separate "enhanced" and "standard" Qdrant collections for A/B testing.**Why**:
- **Performance Comparison**: Direct measurement of advanced features' impact.
- **Risk Mitigation**: Fallback to standard collection if enhanced features fail.
- **Feature Validation**: Quantitative assessment of enhancement value.
- **Gradual Migration**: Safe transition from basic to advanced implementations.**Trade-off**: Storage overhead and maintenance complexity vs. risk reduction and optimization capability.
### 6. Technology Stack for Advanced RAG
#### MongoDB + Qdrant vs. Single Database
**Decision**: Dual storage with enhanced Qdrant collections for hybrid search.**Why**:
- **Data Integrity**: MongoDB preserves original content for reprocessing and debugging.
- **Hybrid Performance**: Qdrant's vector capabilities + in-memory BM25 for keyword search.
- **Collection Management**: Separate enhanced collections for advanced features.
- **Backup Strategy**: Multiple data preservation layers prevent data loss.**Trade-off**: Infrastructure complexity vs. performance, flexibility, and data safety.
#### OpenAI GPT-4o vs. Local Models
**Decision**: OpenAI GPT-4o for classification, response generation, and query enhancement.**Why**:
- **Quality**: Superior reasoning for complex ticket classification and technical query expansion.
- **JSON Reliability**: Consistent structured output for automated processing.
- **Context Window**: Large context enables conversation memory and comprehensive responses.
- **Development Speed**: No model training, fine-tuning, or hosting infrastructure needed.**Trade-off**: Ongoing API costs vs. response quality, development speed, and advanced capabilities.
#### FastEmbed BGE-small + rank-bm25 vs. Single Approach
**Decision**: Hybrid embedding strategy with local FastEmbed and in-memory BM25.**Why**:
- **Cost Efficiency**: Free local embeddings vs. OpenAI embedding API costs.
- **Privacy**: Document content never leaves local environment.
- **Performance**: 384-dim embeddings balance quality with speed.
- **Hybrid Capability**: BM25 enables exact term matching for technical precision.**Trade-off**: Implementation complexity vs. cost savings, privacy, and enhanced search capabilities.
## ποΈ Architecture
### Complete System Architecture with Component Interactions
```
π USER INTERFACE LAYER
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π€ User Browser Session β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β π Dashboard π¬ Chat Agent βοΈ Settings π Analytics Page β β
β β β’ Bulk Class. β’ Real-time Chat β’ Dynamic Config β’ Performance β β
β β β’ 30+ Tickets β’ Memory Context β’ Import/Export β’ Search Stats β β
β β β’ Statistics β’ Source Cites β’ Validation β’ Usage Metrics β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTP Requests
βΌ
π₯οΈ STREAMLIT APPLICATION LAYER
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β main.py (Port 8501) β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
β β UI Controls β β Session State β β Event Handlers β β
β β β’ Input Forms β β β’ User Session β β β’ Button Clicks β β
β β β’ Display Logic β β β’ Memory Store β β β’ Text Input β β
β β β’ File Uploads β β β’ Chat History β β β’ Page Navigation β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Function Calls
βΌ
π§ AI PROCESSING LAYER (rag_pipeline.py)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Advanced RAG Pipeline Engine β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
β βClassification β β Query Pipeline β β Response Generator β β
β β β’ Topic Tags β β β’ Enhancement β β β’ Template Rendering β β
β β β’ Sentiment β β β’ Hybrid Search β β β’ Citation Assembly β β
β β β’ Priority β β β’ Smart Rerank β β β’ Context Integration β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
ββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββββββ¬βββββββββββββββββββββ
β β β
βΌ βΌ βΌ
π€ EXTERNAL AI APIs ποΈ DATA STORAGE LAYER
βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenAI GPT-4o β β Database Services β
β βββββββββββββββ β β βββββββββββββββββββ βββββββββββββββββββββββββββ β
β βClassificationβ ββββΆβ β MongoDB Atlas β β Qdrant Cloud β β
β ββ’ JSON Output β β β β βββββββββββββββ β β βββββββββββββββββββββββ β β
β ββ’ Structured β β β β βRaw Documentsβ β β βVector Collections β β β
β βββββββββββββββ β β β ββ’ Markdown Textβ β ββ’ atlan_docs_enhancedβ β β
β βββββββββββββββ β β β ββ’ Metadata β β β ββ’ Embeddings (384d) β β β
β βQuery Enhanceβ β β β ββ’ Timestamps β β β ββ’ Payloads β β β
β ββ’ Term Expandβ β β β βββββββββββββββ β β βββββββββββββββββββββββ β β
β ββ’ Tech Terms β β β β βββββββββββββββ β β βββββββββββββββββββββββ β β
β βββββββββββββββ β β β βBackup Files β β β βIn-Memory BM25 Index β β β
β βββββββββββββββ β β β ββ’ JSON Dumps β β β ββ’ Keyword Search β β β
β βRAG Response β β β β ββ’ Recovery β β β ββ’ TF-IDF Scoring β β β
β ββ’ Contextual β β β β βββββββββββββββ β β ββ’ rank-bm25 Library β β β
β ββ’ Cited β β β βββββββββββββββββββ β βββββββββββββββββββββββ β β
β βββββββββββββββ β ββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββ ββ β
βββββββββββββββββββ β β
β² β β
β HTTPS/REST API β β
β βΌ βΌ
π PIPELINE SCRIPTS LAYER
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Processing Pipeline β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β scrape.py β β qdrant_ingestion.py β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β β βFirecrawl APIβ β β βText Chunkingβ βFastEmbed BGEβ β β
β β ββ’ Rate Limitsβ β β ββ’ 1200 tokensβ ββ’ Local Gen β β β
β β ββ’ Content β β β ββ’ 200 overlapβ ββ’ 384 dims β β β
β β β Extraction β β β ββ’ Code Aware β ββ’ Privacy β β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β β βMongoDB Save β β β βQuality β βQdrant Uploadβ β β
β β ββ’ Documents β β β βMetrics β ββ’ Collectionsβ β β
β β ββ’ Metadata β β β ββ’ Code Detectβ ββ’ Vectors β β β
β β ββ’ Backup β β β ββ’ Headers β ββ’ Payloads β β β
β β βββββββββββββββ β β βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β²
β Manual Execution β Manual Execution
β β
π DATA SOURCES
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β External Documentation β
β βββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ
β β docs.atlan.com β β developer.atlan.com ββ
β β β’ Product Documentation (~1078 pages)β β β’ API Documentation (~611) ββ
β β β’ User Guides β β β’ SDK References ββ
β β β’ Feature Explanations β β β’ Code Examples ββ
β β β’ Best Practices β β β’ Technical Specifications ββ
β βββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ KEY INTERACTION FLOWS:
π₯ DATA PIPELINE FLOW:
docs.atlan.com β Firecrawl API β scrape.py β MongoDB β qdrant_ingestion.py β Qdrantπ REAL-TIME SEARCH FLOW:
User Query β Query Enhancement (GPT-4o) β Hybrid Search (Vector+BM25) β
Smart Reranking β Context Assembly β Response Generation (GPT-4o) β Userπ¬ CHAT INTERACTION FLOW:
User Input β Streamlit UI β rag_pipeline.py β Classification (GPT-4o) β
RAG/Routing Decision β Search & Generate β Display with CitationsβοΈ CONFIGURATION FLOW:
.env Variables β Feature Toggles β Pipeline Behavior β Performance Optimizationπ ERROR HANDLING & RECOVERY:
β’ MongoDB Backup Files for Data Recovery
β’ Graceful Degradation: Hybrid β Vector-only β Routing
β’ Rate Limiting with Exponential Backoff
β’ Session State Management for UI Persistence
```### System Components
- **Data Pipeline** (Root scripts): Web scraping β Storage β Vector preparation
- **Deployment** (App folder): Streamlit application with AI capabilities
- **AI Services**: OpenAI for classification and response generation
- **Storage**: MongoDB for documents, Qdrant for vector search## π οΈ Tech Stack
### AI/ML
- **OpenAI GPT-4o**: LLM for classification, response generation, and query enhancement
- **FastEmbed BAAI/bge-small-en-v1.5**: Vector embeddings for semantic search (384 dimensions)
- **Qdrant Cloud**: Vector database with hybrid search capabilities
- **rank-bm25**: BM25 algorithm for keyword search and hybrid retrieval
- **LangChain**: Enhanced text processing with advanced chunking strategies### Application
- **Streamlit**: Interactive web application framework
- **Python**: Core application logic and AI pipeline
- **MongoDB**: Document storage for scraped content### UI/UX
- **Streamlit Components**: Dashboard and chat interface
- **Custom CSS**: Styled components and responsive design
- **Interactive Elements**: Real-time classification and response generation### Data Sources & Pipeline
- **Firecrawl API**: Automated web scraping service for documentation
- **docs.atlan.com**: Product documentation and user guides
- **developer.atlan.com**: API and SDK documentation
- **MongoDB**: Persistent storage for all scraped content with metadata
- **Qdrant**: Vector database for semantic search and RAG retrieval### Pipeline Stages
1. **Web Scraping**: Firecrawl crawls documentation sites and extracts content
2. **Document Storage**: Raw content stored in MongoDB with full metadata
3. **Vector Processing**: Content chunked and embedded using FastEmbed BGE-small
4. **RAG Deployment**: Streamlit app queries Qdrant for relevant context## π Prerequisites
### For Deployment (Streamlit App)
- Python 3.8+ and pip
- OpenAI API key
- Qdrant Cloud instance (vector database)
- MongoDB Atlas instance (document storage)
- Firecrawl API key (if running custom scraping)### Project Structure
- **Root directory**: Data pipeline scripts (scrape.py, qdrant_ingestion.py)
- **app/ directory**: Streamlit deployment application with own requirements and .env## π Quick Start
### 1. Clone and Setup Environment
```bash
git clone https://github.com/kanugurajesh/Assistly
cd Assistly
```**Project Structure Overview:**
```
crawling/
βββ app/ # Streamlit deployment
β βββ main.py # Main Streamlit application
β βββ rag_pipeline.py # AI pipeline implementation
β βββ requirements.txt # App dependencies
β βββ .env.example # Environment template
β βββ sample_tickets.json
βββ memory_manager.py # Conversational memory management
βββ scrape.py # Firecrawl web scraping
βββ qdrant_ingestion.py # Vector database ingestion
βββ requirements.txt # Data pipeline dependencies
βββ README.md
```Create `.env` file in the `app/` directory (copy from `app/.env.example`):
```env
OPENAI_API_KEY=your_openai_api_key
QDRANT_URI=your_qdrant_cloud_endpoint
QDRANT_API_KEY=your_qdrant_api_key
MONGODB_URI=your_mongodb_atlas_connection_string
FIRECRAWL_API_KEY=your_firecrawl_api_key
```> **π Deployment-Ready Structure**: The `.env` file is located in the `app/` directory to enable standalone deployment. Root directory scripts (scrape.py, qdrant_ingestion.py) automatically load environment variables from `app/.env`, ensuring consistent configuration across the entire project while maintaining deployment flexibility for platforms like Streamlit Cloud.
### Environment Variables Reference
**Required for Core Functionality:**
- `OPENAI_API_KEY`: OpenAI API key for GPT-4o classification, response generation, and query enhancement
- Obtain from: https://platform.openai.com/api-keys
- Required permissions: GPT-4o model access and sufficient credits
- Usage: Classification, RAG responses, and optional query enhancement- `QDRANT_URI`: Qdrant Cloud vector database endpoint URL
- Format: `https://your-cluster-id.europe-west3-0.gcp.cloud.qdrant.io:6333`
- Obtain from: Qdrant Cloud dashboard after cluster creation
- Usage: Vector similarity search and hybrid search operations- `QDRANT_API_KEY`: Authentication key for Qdrant Cloud instance
- Obtain from: Qdrant Cloud cluster settings β API Keys
- Usage: Secure access to vector database operations- `MONGODB_URI`: MongoDB Atlas connection string
- Format: `mongodb+srv://username:password@cluster.mongodb.net/database`
- Obtain from: MongoDB Atlas dashboard β Connect β Application
- Usage: Document storage, metadata persistence, and backup operations**Optional (Data Pipeline Only):**
- `FIRECRAWL_API_KEY`: Firecrawl API key for web scraping (only needed for custom data ingestion)
- Obtain from: https://www.firecrawl.dev/
- Usage: Automated web scraping of documentation sites
- Note: Pre-processed data is included, so this is optional for basic deployment### 2. Install Dependencies
**For deployment (Streamlit app):**
```bash
pip install -r app/requirements.txt
```**For data pipeline (if running scraping/ingestion):**
```bash
pip install -r requirements.txt
```### 3. Data Pipeline Setup (Optional - for custom data)
**Step 1: Web Scraping with Firecrawl**
```bash
# Basic scraping (pre-completed for Atlan docs)
python scrape.py https://docs.atlan.com --limit 3000
python scrape.py https://developer.atlan.com --limit 1000# Custom scraping examples
python scrape.py https://your-docs.com --limit 500 --collection custom_docs
```> **Note**: The limits above (3000 for docs.atlan.com, 1000 for developers.atlan.com) are set higher than the actual page counts (~1078 and ~611 respectively) to ensure complete data scraping during development. You can use lower limits based on your needs - the crawler will stop when all available pages are scraped regardless of the limit setting.
*All scraped content automatically stored in MongoDB with metadata and backup files.*
**Step 2: Enhanced Vector Database Ingestion**
```bash
# Create enhanced collection with advanced chunking
python qdrant_ingestion.py --qdrant-collection "atlan_docs_enhanced" --recreate# Advanced ingestion with source filtering
python qdrant_ingestion.py --source-url "https://docs.atlan.com" --qdrant-collection "atlan_docs_enhanced"# Incremental updates (recommended for production)
python qdrant_ingestion.py --qdrant-collection "atlan_docs_enhanced"
```
*Enhanced chunking preserves code blocks, creates quality metrics, and generates embeddings with FastEmbed BGE-small for hybrid search.***Note**: The application comes with pre-processed data, so this step is only needed for custom datasets or updates. For advanced configuration options, see the "Advanced Pipeline Options" section below.
## π§ Advanced Pipeline Options
### Scraping Configuration (scrape.py)
**Basic Command Structure:**
```bash
python scrape.py [OPTIONS]
```**Available Options:**
- `--limit `: Maximum pages to crawl (default: 3000)
- `--collection `: MongoDB collection name (default: atlan_developer_docs)**Common Scraping Scenarios:**
```bash
# Scrape with custom page limit
python scrape.py https://docs.atlan.com --limit 500# Scrape to custom MongoDB collection
python scrape.py https://docs.atlan.com --collection custom_docs# Scrape developer docs with different limits
python scrape.py https://developer.atlan.com --limit 200 --collection dev_docs
```### Ingestion Configuration (qdrant_ingestion.py)
**Advanced Command Structure:**
```bash
python qdrant_ingestion.py [OPTIONS]
```**Available Options:**
- `--source-url `: Filter documents by specific source URL
- `--collection `: MongoDB collection name (default: atlan_developer_docs)
- `--qdrant-collection `: Qdrant collection name (default: atlan_docs)
- `--recreate`: Delete and recreate Qdrant collection (removes existing data)
- `--no-incremental`: Process all documents (skip duplicate checking)**Advanced Ingestion Examples:**
```bash
# Process only developer documentation
python qdrant_ingestion.py --source-url "https://developer.atlan.com"# Process only general documentation
python qdrant_ingestion.py --source-url "https://docs.atlan.com"# Recreate collection (fresh start)
python qdrant_ingestion.py --recreate# Process all documents without incremental checking
python qdrant_ingestion.py --no-incremental# Process custom collection with filtering
python qdrant_ingestion.py --collection custom_docs --source-url "https://example.com"# Create custom Qdrant collection
python qdrant_ingestion.py --qdrant-collection "developer_vectors"# Process custom MongoDB collection to custom Qdrant collection
python qdrant_ingestion.py --collection dev_docs --qdrant-collection "dev_vectors"# Full rebuild with specific source and custom collection
python qdrant_ingestion.py --recreate --source-url "https://developer.atlan.com" --qdrant-collection "dev_only"
```## π Document Filtering & Collection Management
### Source URL Filtering Benefits
**Selective Processing:**
- Update only specific documentation domains
- Test pipeline with subset of data
- Separate processing schedules for different sites**Document Type Classification:**
- Automatic categorization: `developer.atlan.com` β "developer" type
- All other sources β "docs" type
- Enables filtered search and analytics**Performance Optimization:**
- Process only changed documentation
- Reduce vector database update time
- Minimize embedding generation costs**Custom Qdrant Collections:**
- Separate vector collections for different projects
- Independent collection lifecycle management
- Isolated testing and production environments
- Multiple documentation versions in parallel### Collection Management Workflows
**Development & Testing:**
```bash
# Create test collection with limited data
python scrape.py https://docs.atlan.com --limit 50 --collection test_docs
python qdrant_ingestion.py --collection test_docs --qdrant-collection test_vectors --recreate
```**Production Updates:**
```bash
# Re-scrape updated content (overwrites existing URLs)
python scrape.py https://docs.atlan.com --limit 3000# Incremental ingestion (only new/changed documents)
python qdrant_ingestion.py
```**Multi-Source Management:**
```bash
# Separate ingestion for different documentation types
python qdrant_ingestion.py --source-url "https://developer.atlan.com"
python qdrant_ingestion.py --source-url "https://docs.atlan.com"
```### Incremental Processing
**How It Works:**
- Checks MongoDB document IDs already in Qdrant
- Skips processing of existing documents
- Only processes new or updated content**When to Use `--no-incremental`:**
- After modifying chunking parameters
- When reprocessing is needed due to embedding model changes
- For debugging or validation purposes### 4. Run the Application
**Run the Streamlit app:**
```bash
cd app
streamlit run main.py
```The application will open automatically in your browser at `http://localhost:8501`
### 5. Streamlit Deployment
**Deploy to Streamlit Community Cloud:**
1. Push your repository to GitHub
2. Visit [share.streamlit.io](https://share.streamlit.io)
3. Connect your GitHub repository
4. Set main file path: `app/main.py`
5. Add environment variables in Streamlit Cloud settings
6. Deploy your application## π Usage Guide
### Dashboard Page
1. Navigate to "π Dashboard" in the sidebar
2. Click "Load & Classify All Tickets"
3. View AI-generated classifications for all 30+ sample tickets
4. Analyze summary statistics and topic distributions
5. Search and examine individual ticket classifications### Interactive Agent Page
1. Navigate to "π¬ Chat Agent" in the sidebar
2. Enter your question in the chat interface
3. Toggle "Show internal analysis" to view classification details
4. Get intelligent responses with source citations
5. Experience context-aware conversations with memory
6. Use the "Conversation Management" sidebar to view memory stats or clear history
7. Try sample questions or submit your own tickets### Analytics Page
1. Navigate to "π Analytics" in the sidebar
2. **Performance Metrics**: View real-time search performance statistics
- Response times for different search methods (vector, hybrid, keyword)
- Query enhancement usage and effectiveness metrics
- Average retrieval quality scores and user satisfaction
3. **Usage Analytics**: Monitor system utilization patterns
- Daily/weekly query volume trends
- Most common topic classifications and routing decisions
- Memory usage statistics across active sessions
4. **Search Method Distribution**: Analyze search strategy effectiveness
- Breakdown of vector vs. hybrid vs. keyword search usage
- Success rates and fallback patterns for different methods
- Quality metrics per search type with comparative analysis
5. **System Health Monitoring**: Track infrastructure performance
- Qdrant collection performance and vector database health
- MongoDB connection status and query response times
- OpenAI API usage, rate limits, and cost optimization insights## π¬ Conversational Memory Features
### Context-Aware Conversations
The system maintains conversation history to provide context-aware responses:
- **Follow-up Questions**: Ask related questions without repeating context
- **Reference Previous Answers**: The AI remembers what it told you earlier
- **Natural Flow**: Conversations feel more natural and coherent### Memory Management
**Conversation Management Sidebar:**
- **Memory Statistics**: View active sessions and total message count
- **Current Session Info**: See number of exchanges in current conversation
- **Clear History**: Manually reset conversation memory when needed**Automatic Features:**
- **Session Isolation**: Each browser session has its own conversation memory
- **Message Limits**: Automatically trims to last 20 messages to prevent token overflow
- **Auto Expiry**: Sessions expire after 60 minutes of inactivity
- **Smart Trimming**: Removes oldest messages while preserving conversation pairs### Example Conversation Flow
```
User: "How do I connect Snowflake to Atlan?"
AI: "To connect Snowflake to Atlan, you need to configure..." [provides detailed steps]User: "What permissions do I need for this?"
AI: "For the Snowflake connection we discussed, you'll need..." [remembers previous context]User: "Are there any security considerations?"
AI: "Yes, for your Snowflake-Atlan integration, consider..." [builds on conversation]
```### Technical Implementation
- **Backend**: LangChain's `InMemoryChatMessageHistory` for pure RAM storage
- **No Database**: Conversations stored in Python dictionaries (no external dependencies)
- **Session Management**: UUID-based session identification with Streamlit session state
- **Context Integration**: Previous conversation included in RAG prompts for better responses### Memory Manager Implementation (`memory_manager.py`)
The `ConversationMemoryManager` class provides advanced conversation memory features:**Core Features:**
- **Session Isolation**: Each browser session maintains separate conversation history
- **Automatic Cleanup**: Configurable auto-cleanup removes expired sessions (default: every 100 operations)
- **Message Trimming**: Automatically limits conversations to last 20 messages per session
- **Session Timeout**: Sessions expire after 60 minutes of inactivity
- **Smart Trimming**: Preserves conversation pairs (human + AI messages) when trimming**Configuration Options:**
- `max_messages_per_session`: Maximum messages per conversation (default: 20)
- `session_timeout_minutes`: Session expiration time (default: 60 minutes)
- `auto_cleanup_interval`: Operations between automatic cleanup (default: 100)**Memory Statistics:**
- Active session count and total message tracking
- Per-session message counts and last activity timestamps
- Memory usage optimization with automatic garbage collection
- Real-time memory health monitoring via the Analytics page### Settings Page
1. Navigate to "βοΈ Settings" in the sidebar
2. **Collection Management**: Select from available Qdrant collections with real-time discovery
3. **Collection Information**: View collection points, vector size, and distance metrics
4. Configure search parameters (TOP_K, score thresholds, configurable hybrid weights)
5. Adjust model settings (temperature, max tokens, model selection)
6. Toggle features (hybrid search, query enhancement)
7. Customize UI preferences (show analysis default)
8. Apply settings in real-time without restarting the application
9. Export/import settings configurations as JSON files
10. View configuration warnings for potentially problematic settings
11. **Troubleshooting**: Built-in connection diagnostics and collection validation## π§ Advanced AI Pipeline Details
### Enhanced Classification Logic
The system analyzes tickets using structured prompts to generate:
1. **Topic Tags**: Multiple relevant categories with high accuracy
2. **Sentiment**: Emotional tone analysis for prioritization
3. **Priority**: Business impact assessment with context awareness### Advanced RAG Response Logic
- **RAG Topics**: How-to, Product, Best practices, API/SDK, SSO β Generate answers using hybrid search
- **Routing Topics**: Connector, Lineage, Glossary, Sensitive data β Route to specialized teams
- **Query Processing**: Optional GPT-4o enhancement expands technical terms
- **Search Strategy**: Hybrid vector + keyword search with smart reranking
- **Response Generation**: Context-aware answers with source attribution### Advanced RAG Pipeline Components
#### 1. Query Enhancement Pipeline (Optional)
- **Input**: Raw user query (e.g., "How to setup SSO?")
- **Processing**: GPT-4o expands technical terms and acronyms
- **Output**: Enhanced query (e.g., "How to configure SAML single sign-on authentication in Atlan?")
- **Benefits**: Better retrieval for technical documentation
- **Toggle**: Configurable via `ENABLE_QUERY_ENHANCEMENT`#### 2. Hybrid Search System
- **Vector Search**: Semantic similarity using FastEmbed BGE-small (384 dim)
- **Keyword Search**: BM25 algorithm for exact term matching
- **Fusion Strategy**: Configurable weighted combination with smart deduplication
- **Reranking**: Boosts documents found by both methods
- **Fallback**: Graceful degradation to vector-only if BM25 fails#### 3. Enhanced Chunking Strategy
- **Structure Preservation**: Special handling for code blocks and headers
- **Smart Separators**: 15+ separator types for optimal boundaries
- **Quality Metrics**: Tracks code presence, headers, and word count
- **Metadata Enhancement**: Chunk-level quality indicators
- **Context Maintenance**: Preserves related content together### Enhanced Chunking Strategy
- **Chunk Size**: 1200 tokens with 200 token overlap
- **Method**: Advanced recursive character splitting with enhanced separators
- **Code Preservation**: Special handling for ``` code blocks and indented code
- **Structure Awareness**: Preserves headers, lists, procedures, and markdown formatting
- **Quality Metrics**: Tracks code blocks, headers, word count, and chunk quality scores
- **Smart Boundaries**: 15+ separator types for optimal semantic chunking### Hybrid Search System
- **Vector Search**: BAAI/bge-small-en-v1.5 (384 dimensions) with cosine similarity
- **Keyword Search**: BM25 algorithm for exact term matching
- **Search Fusion**: Configurable weighted combination of vector and keyword results
- **Smart Reranking**: Deduplication and relevance scoring with boost for multi-method matches
- **Score Threshold**: 0.3 minimum similarity for vector results
- **Top-K Retrieval**: 5 most relevant chunks from hybrid results### Conversational Memory System
- **Memory Backend**: LangChain's `InMemoryChatMessageHistory` for pure RAM storage
- **Session Management**: Unique session IDs for each browser session with automatic timeout
- **Context Window**: Last 5 message exchanges included in RAG prompts for continuity
- **Memory Features**:
- Automatic message trimming (max 20 messages per session)
- Session cleanup and expiration (60-minute timeout)
- Manual conversation clearing via UI
- Memory usage statistics and monitoring
- **No External Dependencies**: Pure in-memory storage without databases## π§ Configuration Options
### Environment Variables (app/.env)
- `OPENAI_API_KEY`: Required for GPT-4o classification, response generation, and query enhancement
- `QDRANT_URI`: Qdrant Cloud vector database endpoint for hybrid search
- `QDRANT_API_KEY`: Authentication for Qdrant Cloud instance
- `MONGODB_URI`: MongoDB Atlas connection string for document storage
- `FIRECRAWL_API_KEY`: Firecrawl API key for web scraping (data pipeline only)### Advanced RAG Configuration (app/rag_pipeline.py)
- `ENABLE_QUERY_ENHANCEMENT`: Toggle GPT-4o query expansion (default: False)
- `ENABLE_HYBRID_SEARCH`: Toggle vector + BM25 hybrid search (default: True)
- `HYBRID_VECTOR_WEIGHT`: Configurable weight for vector search results (default: 1.0)
- `HYBRID_KEYWORD_WEIGHT`: Configurable weight for BM25 keyword results (default: 0.0)
- `COLLECTION_NAME`: Qdrant collection name (default: "atlan_docs_enhanced")
- `SCORE_THRESHOLD`: Minimum similarity threshold (default: 0.3)
- `TOP_K`: Number of search results to retrieve (default: 5)
- `MAX_TOKENS`: Maximum response length (default: 1000)
- `TEMPERATURE`: Response creativity level (default: 0.3)
- `LLM_MODEL`: OpenAI model for responses (default: "gpt-4o")### Dynamic Settings Management
- **Real-time Updates**: All configuration changes apply immediately without restart
- **Collection Management**: Dynamic Qdrant collection discovery and switching
- **Settings Validation**: Built-in warnings for potentially problematic configurations
- **Import/Export**: JSON-based settings backup and sharing capabilities
- **UI Integration**: Settings page with tabbed interface for different parameter categories
- **Configuration Persistence**: Settings stored in session state and applied to pipeline
- **Connection Diagnostics**: Real-time collection validation and troubleshooting
- **Fallback Handling**: Graceful degradation when settings cause issues### Data Pipeline Configuration
- **Scraping Parameters**: Use `--limit` and `--collection` options in scrape.py for custom URLs and crawl limits
- **Source Filtering**: Use `--source-url` in qdrant_ingestion.py for selective document processing
- **Collection Management**: Use `--qdrant-collection`, `--recreate` and `--no-incremental` options for collection lifecycle
- **Custom Collections**: Use `--qdrant-collection` to create separate vector collections for different projects
- **Chunk Configuration**: Adjust size and overlap in qdrant_ingestion.py (default: 1200 tokens, 200 overlap)
- **Vector Search**: Modify threshold and top-K in app/rag_pipeline.py (default: 0.3 threshold, 5 chunks)## π Qdrant Collections & Chunking Strategies
This project implements two different Qdrant collections with distinct chunking strategies to optimize for different use cases and document types.
### Collections Overview
| Collection | Chunking Strategy | Branch Availability | Best For |
|------------|------------------|-------------------|----------|
| **`atlan_docs`** | Basic Chunking | Development branch | Plain text, fast processing |
| **`atlan_docs_enhanced`** | Enhanced Chunking | Main & advanced-rag-enhancements branches | Technical docs with code |### Basic Chunking Strategy (`atlan_docs`)
**Implementation Location**: Available in the development branch of this repository
**Technical Details**:
- Uses simple `RecursiveCharacterTextSplitter` with basic separators: `["\n\n", "\n", " ", ""]`
- Chunk size: ~1200 characters with 200 character overlap
- Keeps separators in chunks (`keep_separator=True`)
- Fast, straightforward processing**Metadata Structure**:
```json
{
"text": "chunk content...",
"source_url": "https://docs.atlan.com/...",
"title": "Document Title",
"doc_type": "docs" | "developer",
"chunk_index": 0,
"total_chunks": 5
}
```**Characteristics**:
- β **Pros**: Simple, fast, works well for plain text documents
- β **Cons**:
- Doesn't preserve code blocks (may split ```` blocks)
- Markdown structure (headers, lists) may be broken
- Chunks may cut through semantic boundaries β lower retrieval quality### Enhanced Chunking Strategy (`atlan_docs_enhanced`)
**Implementation Location**: Current implementation in main and advanced-rag-enhancements branches
**Technical Details**:
- **Code Block Preservation**: Uses `preserve_code_blocks()` function to surround code blocks with newlines
- **Rich Separators**: 15+ separator types for optimal semantic boundaries:
```python
separators=[
"\n\n\n", # Major section breaks
"\n\n", # Paragraph breaks
"\n```\n", "```\n", # Code block boundaries
"\n# ", "\n## ", "\n### ", "\n#### ", # Headers
"\n- ", "\n* ", "\n1. ", "\n2. ", # Lists
"\n", ". ", "? ", "! ", "; ", ", ", # Sentences & punctuation
" ", "" # Words & characters
]
```
- **Quality Metrics**: Analyzes chunk content for optimization**Enhanced Metadata Structure**:
```json
{
"text": "chunk content...",
"source_url": "https://docs.atlan.com/...",
"title": "Document Title",
"doc_type": "docs" | "developer",
"chunk_index": 0,
"total_chunks": 5,
"word_count": 150,
"has_code": true,
"has_headers": true,
"chunk_quality": "high" | "medium"
}
```**Characteristics**:
- β **Pros**:
- Preserves semantic meaning better (respects headers, lists, sentences)
- Code blocks are chunked as whole units (maintains functional examples)
- Quality metadata enables downstream filtering and optimization
- Better context preservation for technical documentation
- β **Cons**:
- More complex processing (slower)
- Slightly larger metadata footprint### Key Differences Summary
| Aspect | Basic Chunking | Enhanced Chunking |
|--------|----------------|-------------------|
| **Speed** | Fast β‘ | Moderate β±οΈ |
| **Code Preservation** | β May split code blocks | β Preserves complete code blocks |
| **Markdown Awareness** | β Basic line/paragraph splitting | β Respects headers, lists, structure |
| **Quality Tracking** | β No quality metrics | β Chunk quality indicators |
| **Use Case** | Raw text ingestion | Developer documentation |
| **Retrieval Quality** | Good for simple text | Superior for technical content |### Collection Selection Guidance
**Choose `atlan_docs` (Basic) when**:
- Processing large volumes of plain text
- Speed is critical over quality
- Documents don't contain code examples
- Simple question-answering scenarios**Choose `atlan_docs_enhanced` (Enhanced) when**:
- Processing technical documentation
- Documents contain code examples and structured content
- Quality of retrieval is more important than speed
- Need chunk-level quality metrics for optimization**Switching Collections**:
1. Navigate to "βοΈ Settings" in the sidebar
2. Use the "Collection Management" section
3. Select from available Qdrant collections
4. Apply changes in real-time without restart### Performance Implications
- **Basic Chunking**: ~40% faster processing, smaller storage footprint
- **Enhanced Chunking**: Higher retrieval accuracy for technical queries, better context preservationChoose based on your specific use case: speed vs. quality trade-off.
### Common Pipeline Workflows
**Complete Fresh Setup:**
```bash
# Scrape new documentation
python scrape.py https://new-docs.com --limit 500 --collection new_docs# Create fresh vector database
python qdrant_ingestion.py --collection new_docs --recreate
```**Incremental Updates (Recommended):**
```bash
# Re-scrape updated content (overwrites existing URLs)
python scrape.py https://docs.atlan.com --limit 700# Incremental ingestion (only new/changed documents)
python qdrant_ingestion.py
```**Domain-Specific Processing:**
```bash
# Update only developer documentation vectors
python qdrant_ingestion.py --source-url "https://developer.atlan.com"# Update only general documentation vectors
python qdrant_ingestion.py --source-url "https://docs.atlan.com"
```**Testing and Development:**
```bash
# Create test dataset
python scrape.py https://docs.atlan.com --limit 20 --collection test_data# Test ingestion pipeline with custom Qdrant collection
python qdrant_ingestion.py --collection test_data --qdrant-collection test_vectors --recreate
```**Multiple Project Management:**
```bash
# Project A: Customer documentation
python scrape.py https://customer-docs.com --collection customer_docs
python qdrant_ingestion.py --collection customer_docs --qdrant-collection customer_vectors# Project B: Internal documentation
python scrape.py https://internal-docs.com --collection internal_docs
python qdrant_ingestion.py --collection internal_docs --qdrant-collection internal_vectors
```### Application Customization
- **Classification Prompts**: Edit prompts in app/rag_pipeline.py for custom categorization
- **Response Templates**: Modify RAG and routing responses in app/main.py
- **UI Styling**: Update custom CSS in app/main.py for branding## π Performance Metrics & Advanced Features
### Enhanced Data Pipeline Efficiency
- **Smart Scraping**: Firecrawl with automated content extraction and metadata preservation
- **Persistent Storage**: MongoDB with backup capabilities and incremental processing
- **Advanced Vector Ingestion**: Batch processing with enhanced chunking and quality metrics
- **Hybrid Search Performance**: Combined vector + keyword search with intelligent reranking### Advanced Response Quality Measures
- **Multi-Method Retrieval**: Hybrid search combines semantic and keyword matching
- **Query Enhancement**: GPT-4o expands technical terms for better retrieval (configurable)
- **Smart Reranking**: Configurable weighted fusion of vector and BM25 results
- **Source Attribution**: All RAG responses include original documentation URLs
- **Relevance Scoring**: Vector similarity + BM25 scoring with threshold 0.3
- **Context Quality**: Top-5 chunks from hybrid results for comprehensive answers
- **Search Transparency**: Real-time indicators showing search methods used### Advanced Scalability Features
- **Feature Toggles**: Configurable query enhancement and hybrid search
- **Collection Management**: Separate enhanced and standard collections
- **Incremental Processing**: Skip already processed documents for efficiency
- **Quality Metrics**: Chunk-level quality indicators (code detection, headers, word count)
- **Error Resilience**: Graceful fallbacks for all advanced features
- **Performance Monitoring**: Search method tracking and optimization insights## π¨ Troubleshooting
### Data Pipeline Issues
**1. Firecrawl Scraping Problems**
- Verify Firecrawl API key in environment variables
- Check rate limits and adjust scraping delays in scrape.py
- Monitor MongoDB connection for storage issues**2. MongoDB Storage Issues**
- Validate MongoDB Atlas connection string
- Check database and collection permissions
- Verify network access to MongoDB cluster**3. Vector Database Problems**
- Verify Qdrant Cloud instance is accessible
- Check embedding dimensions match (384 for BGE-small)
- Validate collection exists and has correct configuration## π Backup and Recovery Procedures
### MongoDB Data Backup
The system includes automatic backup mechanisms for data protection:**Automatic Backup Features:**
- **Scraping Backup**: All scraped content is automatically saved as backup files during data collection
- **Document Persistence**: MongoDB Atlas provides built-in automated backups (snapshots every 24 hours)
- **Metadata Preservation**: Full document metadata, URLs, and timestamps stored for recovery**Manual Backup Procedures:**
```bash
# Export specific collection to JSON backup
# Use MongoDB Compass or Atlas export functionality
# Or use mongodump for command-line backup:
mongodump --uri "your_mongodb_uri" --collection scraped_pages --db Cluster0 --out ./backup/# Export custom collection with date stamp
mongodump --uri "your_mongodb_uri" --collection custom_docs --db Cluster0 --out ./backup/$(date +%Y%m%d)/
```**Recovery Procedures:**
```bash
# Restore from MongoDB Atlas snapshot (via Atlas UI)
# 1. Go to Atlas Dashboard β Clusters β Backup
# 2. Select snapshot date and restore to new cluster
# 3. Update MONGODB_URI in environment variables# Restore from local backup
mongorestore --uri "your_mongodb_uri" --db Cluster0 ./backup/dump/Cluster0/
```### Vector Database Recovery
**Qdrant Collection Backup:**
- Vector collections can be recreated using `qdrant_ingestion.py --recreate`
- All embedding data is regenerated from MongoDB source documents
- Collection metadata and configuration preserved in code**Recovery Process:**
```bash
# Full vector database recreation from MongoDB
python qdrant_ingestion.py --recreate --qdrant-collection atlan_docs_enhanced# Partial recovery for specific sources
python qdrant_ingestion.py --source-url "https://docs.atlan.com" --recreate# Verify collection health after recovery
python -c "
from app.rag_pipeline import RAGPipeline
pipeline = RAGPipeline()
print(f'Collection status: {pipeline.qdrant_client.get_collection(\"atlan_docs_enhanced\")}')"
```### Disaster Recovery Checklist
1. **Environment Variables**: Ensure `.env` files are backed up securely
2. **MongoDB**: Verify Atlas automated backups are enabled
3. **Source Code**: Regular git commits with configuration files
4. **API Keys**: Secure storage of all service credentials
5. **Documentation**: Keep setup instructions updated for recovery scenarios### Application Issues
**4. Streamlit Deployment**
- Ensure all environment variables are set in app/.env
- Check that app/requirements.txt includes all dependencies
- Verify OpenAI API key has sufficient credits**5. Classification Errors**
- Review prompt templates in app/rag_pipeline.py
- Check JSON parsing logic for malformed responses
- Monitor OpenAI API rate limits### Debugging Tips
- Check Streamlit logs for detailed error messages
- Validate environment variable loading in app directory
- Test individual pipeline components (MongoDB, Qdrant, OpenAI)
- Monitor API usage and rate limits across all services## π Development Notes
### Advanced Project Structure Philosophy
- **Separation of Concerns**: Clean separation between data pipeline (root) and deployment (app/)
- **Environment Isolation**: Each tier has independent requirements and configuration
- **Feature Modularity**: Advanced RAG features can be toggled independently
- **Data Persistence**: MongoDB enables reprocessing and experimentation
- **Deployment Ready**: Enhanced app/ folder with advanced search capabilities### Enhanced Architecture Decisions
1. **Advanced Data Pipeline**: Firecrawl β MongoDB β Enhanced Qdrant β Advanced Streamlit
2. **Hybrid Search System**: Vector + BM25 keyword search with intelligent fusion
3. **Query Enhancement**: Optional GPT-4o query expansion for technical terms
4. **Enhanced Chunking**: Code-aware splitting with quality metrics
5. **Smart Configuration**: Feature toggles for different deployment scenarios
6. **Dual Collection Strategy**: Standard vs enhanced collections for comparison
7. **Performance Optimization**: Configurable search weights and thresholds### Advanced Trade-offs & Design Decisions
- **Query Enhancement**: Optional GPT-4o expansion vs direct search (configurable)
- **Hybrid Search**: Vector + keyword complexity vs pure vector simplicity
- **Enhanced Chunking**: Structure preservation vs simple character splitting
- **Feature Toggles**: Flexibility vs configuration complexity
- **Dual Collections**: Comparison capability vs storage overhead
- **Search Transparency**: User insight vs UI complexity
- **Performance vs Features**: Configurable enhancement levels for different use cases### Production-Ready Enhancements
- **Collection Management**: Enhanced vs standard collections for A/B testing
- **Feature Flags**: Runtime configuration of advanced features
- **Quality Metrics**: Chunk-level quality indicators for optimization
- **Search Analytics**: Real-time method tracking and performance insights
- **Graceful Degradation**: Fallbacks ensure system reliability### Enhanced RAG Implementation (advanced-rag-enhancements branch)
| Feature | Implementation |
|---------|----------------|
| **Search Method** | Hybrid vector + BM25 keyword search |
| **Query Processing** | Optional GPT-4o query enhancement |
| **Chunking** | Code-aware splitting with quality metrics |
| **Results** | Smart reranking with configurable fusion weights |
| **UI Feedback** | Search method indicators + transparency |
| **Collection** | `atlan_docs_enhanced` with enhanced metadata |
| **Configurability** | Feature toggles for all enhancements |
| **Performance** | Graceful degradation and fallbacks |
| **Settings Management** | Dynamic configuration with real-time updates |
| **Configuration** | Import/export, validation, and persistence |### Key Improvements
1. **β Better Technical Term Handling**: Hybrid search excels at exact matches
2. **β Enhanced Code Examples**: Preserved code blocks in chunking
3. **β Query Expansion**: GPT-4o expands acronyms and technical terms
4. **β Search Transparency**: Users see which methods found their answers
5. **β Quality Metrics**: Chunk-level indicators for optimization
6. **β Configurable Features**: Toggle enhancements based on needs
7. **β Dynamic Settings Management**: Real-time configuration without restart
8. **β Settings Import/Export**: JSON-based configuration sharing and backup
9. **β Collection Management**: Real-time Qdrant collection discovery and switching
10. **β Connection Diagnostics**: Built-in troubleshooting for collection issues## π οΈ Developer Utilities
### Database Utility Functions (`utils.py`)
The project includes utility functions for MongoDB operations in the data pipeline:**Core Functions:**
- `get_mongodb_client()`: Creates authenticated MongoDB client using environment variables
- Returns: `MongoClient` instance configured with `MONGODB_URI`
- Handles connection string validation and error handling
- Usage: For direct database operations and debugging- `get_mongodb_collection(database_name, collection_name)`: Gets MongoDB client, database, and collection
- Args:
- `database_name`: Target database (default: "Cluster0")
- `collection_name`: Target collection (default: "scraped_pages")
- Returns: Tuple of `(client, database, collection)`
- Usage: For pipeline scripts that need database access- `close_mongodb_client(client)`: Safely closes MongoDB connections
- Args: `client` - MongoDB client instance to close
- Includes error handling for cleanup operations
- Usage: Ensures proper resource cleanup in scripts**Constants:**
- `DEFAULT_DATABASE = "Cluster0"`: Default MongoDB database name
- `DEFAULT_COLLECTION = "scraped_pages"`: Default collection for scraped content**Example Usage:**
```python
from utils import get_mongodb_collection, close_mongodb_client# Get database components
client, db, collection = get_mongodb_collection("Cluster0", "custom_docs")# Perform operations
documents = collection.find({"source_url": {"$regex": "docs.atlan.com"}})# Cleanup
close_mongodb_client(client)
```## π€ Contributing
1. Fork the repository
2. Create a feature branch
3. Implement changes with tests
4. Update documentation
5. Submit pull request## π License
[MIT License](LICENSE)
## π Support
For issues and questions:
1. Check the troubleshooting section
2. Review API documentation
3. Create an issue with detailed reproduction steps