{"id":31051740,"url":"https://github.com/kanugurajesh/assistly","last_synced_at":"2026-04-16T04:03:20.795Z","repository":{"id":314755726,"uuid":"1054286996","full_name":"kanugurajesh/Assistly","owner":"kanugurajesh","description":"An AI-powered customer support system that automatically classifies tickets and provides intelligent responses using Retrieval-Augmented Generation","archived":false,"fork":false,"pushed_at":"2025-10-02T10:57:30.000Z","size":524,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-11T14:34:11.513Z","etag":null,"topics":["fastembed","firecrawl","openai","python3","qdrant","streamlit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kanugurajesh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-10T16:14:05.000Z","updated_at":"2025-09-19T15:06:49.000Z","dependencies_parsed_at":"2025-09-14T16:29:00.361Z","dependency_job_id":"43232c6d-ffd8-41b2-93ee-b1a625a7abbb","html_url":"https://github.com/kanugurajesh/Assistly","commit_stats":null,"previous_names":["kanugurajesh/assistly"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kanugurajesh/Assistly","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanugurajesh%2FAssistly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanugurajesh%2FAssistly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanugurajesh%2FAssistly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanugurajesh%2FAssistly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kanugurajesh","download_url":"https://codeload.github.com/kanugurajesh/Assistly/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kanugurajesh%2FAssistly/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31870516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"online","status_checked_at":"2026-04-16T02:00:06.042Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastembed","firecrawl","openai","python3","qdrant","streamlit"],"created_at":"2025-09-15T00:52:24.872Z","updated_at":"2026-04-16T04:03:20.761Z","avatar_url":"https://github.com/kanugurajesh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Atlan Customer Support Copilot\n\nAn advanced AI-powered customer support system that automatically classifies tickets and provides intelligent responses using state-of-the-art Retrieval-Augmented Generation (RAG) with hybrid search, query enhancement, and optimized chunking strategies.\n\n## 🌟 Features\n\n### Core Functionality\n- **Bulk Ticket Classification**: Automatically classify 30+ sample tickets with AI-powered categorization\n- **Interactive AI Agent**: Real-time chat interface for new ticket submission and response\n- **Conversational Memory**: Context-aware conversations using LangChain ChatMessageHistory with in-memory storage\n- **Smart Classification**: Topic tags, sentiment analysis, and priority assignment\n- **Advanced RAG Responses**: Intelligent answers powered by hybrid search and enhanced retrieval\n- **Source Citations**: All responses include links to relevant documentation\n- **Search Transparency**: Real-time indicators showing search methods used (vector, keyword, or hybrid)\n- **Dynamic Settings Management**: Comprehensive settings page for real-time pipeline configuration\n\n### Advanced RAG Features\n- **Hybrid Search**: Combines vector similarity and BM25 keyword search for optimal relevance\n- **Query Enhancement**: GPT-4o powered query expansion for technical terms (configurable)\n- **Enhanced Chunking**: Code block preservation with intelligent markdown structure awareness\n- **Smart Reranking**: Configurable weighted merging of vector and keyword search results\n- **Quality Metrics**: Chunk quality indicators including code detection and header analysis\n- **Real-time Configuration**: Dynamic settings updates without application restart\n- **Settings Import/Export**: JSON-based configuration backup and sharing\n\n### Classification Schema\n- **Topic Tags**: How-to, Product, Connector, Lineage, API/SDK, SSO, Glossary, Best practices, Sensitive data\n- **Sentiment**: Frustrated, Curious, Angry, Neutral\n- **Priority**: P0 (High), P1 (Medium), P2 (Low)\n\n## 🎯 Major Design Decisions \u0026 Trade-offs\n\n### 1. Multi-Stage Pipeline Architecture\n**Decision**: Separate data pipeline (scraping → storage → vectorization) from deployment application.\n\n**Why**:\n- **Data Persistence**: Web scraping is expensive and rate-limited. MongoDB storage allows reprocessing embeddings without re-scraping.\n- **Deployment Flexibility**: App folder contains only deployment dependencies, enabling clean Streamlit Cloud deployment.\n- **Development Efficiency**: Can iterate on AI logic without re-running expensive data collection.\n- **A/B Testing**: Separate collections enable comparison between basic and enhanced RAG implementations.\n\n**Trade-off**: Increased complexity vs. reliability, cost efficiency, and experimentation capability.\n\n### 2. Advanced Technology Stack Choices\n\n#### Hybrid Search: Vector + BM25 vs. Pure Vector Search\n**Decision**: Implement hybrid search combining vector similarity and BM25 keyword search.\n\n**Why**:\n- **Technical Term Precision**: BM25 excels at exact matches for technical terms, APIs, and product names.\n- **Semantic Understanding**: Vector search captures conceptual relationships and context.\n- **Complementary Strengths**: Vector search for \"how to authenticate\" + BM25 for \"SAML SSO\" = comprehensive coverage.\n- **Fallback Strategy**: Graceful degradation to vector-only if BM25 fails.\n\n**Trade-off**: System complexity and processing overhead vs. significantly improved retrieval quality for technical documentation.\n\n#### Query Enhancement: GPT-4o Expansion vs. Direct Search\n**Decision**: Optional GPT-4o query enhancement with configurable toggle.\n\n**Why**:\n- **Technical Term Expansion**: \"SSO\" → \"SAML single sign-on authentication setup\"\n- **Context Enrichment**: \"API rate limits\" → \"REST API rate limiting configuration and best practices\"\n- **Acronym Resolution**: Critical for technical documentation where acronyms are prevalent.\n- **Cost Control**: Configurable feature allows optimization for different use cases.\n\n**Trade-off**: Additional API costs and latency vs. dramatically improved retrieval for technical queries.\n\n#### Enhanced Chunking: Code-Aware vs. Simple Character Splitting\n**Decision**: Advanced recursive splitting with code block preservation and quality metrics.\n\n**Why**:\n- **Code Integrity**: Preserves ```code blocks``` as single units to maintain functional examples.\n- **Structure Awareness**: Respects markdown headers, lists, and procedures.\n- **Quality Tracking**: Metadata enables optimization and debugging of retrieval quality.\n- **Context Preservation**: Smart boundaries prevent splitting related instructions.\n\n**Trade-off**: Processing complexity and storage overhead vs. significantly better content quality and retrieval accuracy.\n\n### 3. Feature Toggle Architecture\n**Decision**: Configurable enhancement toggles rather than fixed implementation.\n\n**Why**:\n- **Deployment Flexibility**: Different environments can optimize for cost vs. quality.\n- **Performance Tuning**: Disable expensive features for high-volume scenarios.\n- **Gradual Rollout**: Test advanced features incrementally in production.\n- **User Choice**: Let users balance speed vs. comprehensive results.\n\n**Trade-off**: Configuration complexity vs. deployment flexibility and performance optimization.\n\n### 4. Smart Reranking Strategy\n**Decision**: Configurable weighted fusion of vector and BM25 results with intelligent deduplication.\n\n**Why**:\n- **Flexible Relevance**: Configurable weights allow optimization for different use cases.\n- **Exact Match Boost**: BM25 results receive configurable weight for technical precision.\n- **Deduplication**: Documents found by both methods receive relevance boost.\n- **Empirical Optimization**: Default weights can be tuned based on specific documentation types.\n\n**Trade-off**: Algorithm complexity vs. superior result ranking and relevance.\n\n### 5. Dual Collection Strategy\n**Decision**: Separate \"enhanced\" and \"standard\" Qdrant collections for A/B testing.\n\n**Why**:\n- **Performance Comparison**: Direct measurement of advanced features' impact.\n- **Risk Mitigation**: Fallback to standard collection if enhanced features fail.\n- **Feature Validation**: Quantitative assessment of enhancement value.\n- **Gradual Migration**: Safe transition from basic to advanced implementations.\n\n**Trade-off**: Storage overhead and maintenance complexity vs. risk reduction and optimization capability.\n\n### 6. Technology Stack for Advanced RAG\n\n#### MongoDB + Qdrant vs. Single Database\n**Decision**: Dual storage with enhanced Qdrant collections for hybrid search.\n\n**Why**:\n- **Data Integrity**: MongoDB preserves original content for reprocessing and debugging.\n- **Hybrid Performance**: Qdrant's vector capabilities + in-memory BM25 for keyword search.\n- **Collection Management**: Separate enhanced collections for advanced features.\n- **Backup Strategy**: Multiple data preservation layers prevent data loss.\n\n**Trade-off**: Infrastructure complexity vs. performance, flexibility, and data safety.\n\n#### OpenAI GPT-4o vs. Local Models\n**Decision**: OpenAI GPT-4o for classification, response generation, and query enhancement.\n\n**Why**:\n- **Quality**: Superior reasoning for complex ticket classification and technical query expansion.\n- **JSON Reliability**: Consistent structured output for automated processing.\n- **Context Window**: Large context enables conversation memory and comprehensive responses.\n- **Development Speed**: No model training, fine-tuning, or hosting infrastructure needed.\n\n**Trade-off**: Ongoing API costs vs. response quality, development speed, and advanced capabilities.\n\n#### FastEmbed BGE-small + rank-bm25 vs. Single Approach\n**Decision**: Hybrid embedding strategy with local FastEmbed and in-memory BM25.\n\n**Why**:\n- **Cost Efficiency**: Free local embeddings vs. OpenAI embedding API costs.\n- **Privacy**: Document content never leaves local environment.\n- **Performance**: 384-dim embeddings balance quality with speed.\n- **Hybrid Capability**: BM25 enables exact term matching for technical precision.\n\n**Trade-off**: Implementation complexity vs. cost savings, privacy, and enhanced search capabilities.\n\n## 🏗️ Architecture\n\n### Complete System Architecture with Component Interactions\n\n```\n                           🌐 USER INTERFACE LAYER\n    ┌─────────────────────────────────────────────────────────────────────────────┐\n    │                    👤 User Browser Session                                  │\n    │  ┌─────────────────────────────────────────────────────────────────────┐    │\n    │  │  📊 Dashboard   💬 Chat Agent   ⚙️ Settings   📈 Analytics Page   │    │\n    │  │  • Bulk Class.  • Real-time Chat • Dynamic Config • Performance     │    │\n    │  │  • 30+ Tickets  • Memory Context  • Import/Export  • Search Stats   │    │\n    │  │  • Statistics   • Source Cites    • Validation    • Usage Metrics   │    │\n    │  └─────────────────────────────────────────────────────────────────────┘    │\n    └─────────────────────────┬───────────────────────────────────────────────────┘\n                             │ HTTP Requests\n                             ▼\n                   🖥️ STREAMLIT APPLICATION LAYER\n    ┌─────────────────────────────────────────────────────────────────────────────┐\n    │                         main.py (Port 8501)                                 │\n    │  ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────────────┐    │\n    │  │   UI Controls   │   │  Session State  │   │    Event Handlers       │    │\n    │  │ • Input Forms   │   │ • User Session  │   │  • Button Clicks        │    │\n    │  │ • Display Logic │   │ • Memory Store  │   │  • Text Input           │    │\n    │  │ • File Uploads  │   │ • Chat History  │   │  • Page Navigation      │    │\n    │  └─────────────────┘   └─────────────────┘   └─────────────────────────┘    │\n    └─────────────────────────┬───────────────────────────────────────────────────┘\n                              │ Function Calls\n                              ▼\n                   🧠 AI PROCESSING LAYER (rag_pipeline.py)\n    ┌──────────────────────────────────────────────────────────────────────────┐\n    │                      Advanced RAG Pipeline Engine                        │\n    │  ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────────────┐ │\n    │  │Classification   │   │  Query Pipeline │   │   Response Generator    │ │\n    │  │ • Topic Tags    │   │ • Enhancement   │   │ • Template Rendering    │ │\n    │  │ • Sentiment     │   │ • Hybrid Search │   │ • Citation Assembly     │ │\n    │  │ • Priority      │   │ • Smart Rerank  │   │ • Context Integration   │ │\n    │  └─────────────────┘   └─────────────────┘   └─────────────────────────┘ │\n    └──────┬──────────────────┬───────────────────────────┬────────────────────┘\n           │                  │                           │\n           ▼                  ▼                           ▼\n        🤖 EXTERNAL AI APIs                🗄️ DATA STORAGE LAYER\n    ┌─────────────────┐     ┌────────────────────────────────────────────────────┐\n    │   OpenAI GPT-4o │     │                 Database Services                  │\n    │ ┌─────────────┐ │     │  ┌─────────────────┐   ┌─────────────────────────┐ │\n    │ │Classification│ │──▶│  │   MongoDB Atlas │   │     Qdrant Cloud        │ │\n    │ │• JSON Output │ │    │  │ ┌─────────────┐ │   │ ┌─────────────────────┐ │ │\n    │ │• Structured │ │     │  │ │Raw Documents│ │   │ │Vector Collections   │ │ │\n    │ └─────────────┘ │     │  │ │• Markdown Text│   │ │• atlan_docs_enhanced│ │ │\n    │ ┌─────────────┐ │     │  │ │• Metadata   │ │   │ │• Embeddings (384d)  │ │ │\n    │ │Query Enhance│ │     │  │ │• Timestamps │ │   │ │• Payloads           │ │ │\n    │ │• Term Expand│ │     │  │ └─────────────┘ │   │ └─────────────────────┘ │ │\n    │ │• Tech Terms │ │     │  │ ┌─────────────┐ │   │ ┌─────────────────────┐ │ │\n    │ └─────────────┘ │     │  │ │Backup Files │ │   │ │In-Memory BM25 Index │ │ │\n    │ ┌─────────────┐ │     │  │ │• JSON Dumps │ │   │ │• Keyword Search     │ │ │\n    │ │RAG Response │ │     │  │ │• Recovery   │ │   │ │• TF-IDF Scoring     │ │ │\n    │ │• Contextual │ │     │  │ └─────────────┘ │   │ │• rank-bm25 Library  │ │ │\n    │ │• Cited      │ │     │  └─────────────────┘   │ └─────────────────────┘ │ │\n    │ └─────────────┘ │     └──────────┬─────────────────────┬─────────────── ── ┘\n    └─────────────────┘                │                    │\n           ▲                           │                    │\n           │ HTTPS/REST API            │                    │\n           │                           ▼                    ▼\n                              📝 PIPELINE SCRIPTS LAYER\n    ┌───────────────────────────────────────────────────────────────────────────┐\n    │                        Data Processing Pipeline                           │\n    │  ┌─────────────────┐   ┌────────────────────────────────────────────────┐ │\n    │  │   scrape.py     │   │              qdrant_ingestion.py               │ │\n    │  │ ┌─────────────┐ │   │ ┌─────────────┐   ┌─────────────┐              │ │\n    │  │ │Firecrawl API│ │   │ │Text Chunking│   │FastEmbed BGE│              │ │\n    │  │ │• Rate Limits│ │   │ │• 1200 tokens│   │• Local Gen  │              │ │\n    │  │ │• Content    │ │   │ │• 200 overlap│   │• 384 dims   │              │ │\n    │  │ │  Extraction │ │   │ │• Code Aware │   │• Privacy    │              │ │\n    │  │ └─────────────┘ │   │ └─────────────┘   └─────────────┘              │ │\n    │  │ ┌─────────────┐ │   │ ┌─────────────┐   ┌─────────────┐              │ │\n    │  │ │MongoDB Save │ │   │ │Quality      │   │Qdrant Upload│              │ │\n    │  │ │• Documents  │ │   │ │Metrics      │   │• Collections│              │ │\n    │  │ │• Metadata   │ │   │ │• Code Detect│   │• Vectors    │              │ │\n    │  │ │• Backup     │ │   │ │• Headers    │   │• Payloads   │              │ │\n    │  │ └─────────────┘ │   │ └─────────────┘   └─────────────┘              │ │\n    │  └─────────────────┘   └────────────────────────────────────────────────┘ │\n    └───────────────────────────────────────────────────────────────────────────┘\n           ▲                                       ▲\n           │ Manual Execution                      │ Manual Execution\n           │                                       │\n                              🌐 DATA SOURCES\n    ┌────────────────────────────────────────────────────────────────────────────┐\n    │                           External Documentation                           │\n    │  ┌─────────────────────────────────────┐  ┌───────────────────────────────┐│\n    │  │         docs.atlan.com              │  │     developer.atlan.com       ││\n    │  │ • Product Documentation (~1078 pages)│  │ • API Documentation (~611)    ││ \n    │  │ • User Guides                       │  │ • SDK References              ││\n    │  │ • Feature Explanations              │  │ • Code Examples               ││\n    │  │ • Best Practices                    │  │ • Technical Specifications    ││\n    │  └─────────────────────────────────────┘  └───────────────────────────────┘│\n    └────────────────────────────────────────────────────────────────────────────┘\n\n            🔄 KEY INTERACTION FLOWS:\n\n            📥 DATA PIPELINE FLOW:\n            docs.atlan.com → Firecrawl API → scrape.py → MongoDB → qdrant_ingestion.py → Qdrant\n\n            🔍 REAL-TIME SEARCH FLOW:\n            User Query → Query Enhancement (GPT-4o) → Hybrid Search (Vector+BM25) →\n            Smart Reranking → Context Assembly → Response Generation (GPT-4o) → User\n\n            💬 CHAT INTERACTION FLOW:\n            User Input → Streamlit UI → rag_pipeline.py → Classification (GPT-4o) →\n            RAG/Routing Decision → Search \u0026 Generate → Display with Citations\n\n            ⚙️ CONFIGURATION FLOW:\n            .env Variables → Feature Toggles → Pipeline Behavior → Performance Optimization\n\n            🔒 ERROR HANDLING \u0026 RECOVERY:\n            • MongoDB Backup Files for Data Recovery\n            • Graceful Degradation: Hybrid → Vector-only → Routing\n            • Rate Limiting with Exponential Backoff\n            • Session State Management for UI Persistence\n```\n\n### System Components\n- **Data Pipeline** (Root scripts): Web scraping → Storage → Vector preparation\n- **Deployment** (App folder): Streamlit application with AI capabilities\n- **AI Services**: OpenAI for classification and response generation\n- **Storage**: MongoDB for documents, Qdrant for vector search\n\n## 🛠️ Tech Stack\n\n### AI/ML\n- **OpenAI GPT-4o**: LLM for classification, response generation, and query enhancement\n- **FastEmbed BAAI/bge-small-en-v1.5**: Vector embeddings for semantic search (384 dimensions)\n- **Qdrant Cloud**: Vector database with hybrid search capabilities\n- **rank-bm25**: BM25 algorithm for keyword search and hybrid retrieval\n- **LangChain**: Enhanced text processing with advanced chunking strategies\n\n### Application\n- **Streamlit**: Interactive web application framework\n- **Python**: Core application logic and AI pipeline\n- **MongoDB**: Document storage for scraped content\n\n### UI/UX\n- **Streamlit Components**: Dashboard and chat interface\n- **Custom CSS**: Styled components and responsive design\n- **Interactive Elements**: Real-time classification and response generation\n\n### Data Sources \u0026 Pipeline\n- **Firecrawl API**: Automated web scraping service for documentation\n- **docs.atlan.com**: Product documentation and user guides\n- **developer.atlan.com**: API and SDK documentation\n- **MongoDB**: Persistent storage for all scraped content with metadata\n- **Qdrant**: Vector database for semantic search and RAG retrieval\n\n### Pipeline Stages\n1. **Web Scraping**: Firecrawl crawls documentation sites and extracts content\n2. **Document Storage**: Raw content stored in MongoDB with full metadata\n3. **Vector Processing**: Content chunked and embedded using FastEmbed BGE-small\n4. **RAG Deployment**: Streamlit app queries Qdrant for relevant context\n\n## 📋 Prerequisites\n\n### For Deployment (Streamlit App)\n- Python 3.8+ and pip\n- OpenAI API key\n- Qdrant Cloud instance (vector database)\n- MongoDB Atlas instance (document storage)\n- Firecrawl API key (if running custom scraping)\n\n### Project Structure\n- **Root directory**: Data pipeline scripts (scrape.py, qdrant_ingestion.py)\n- **app/ directory**: Streamlit deployment application with own requirements and .env\n\n## 🚀 Quick Start\n\n### 1. Clone and Setup Environment\n\n```bash\ngit clone https://github.com/kanugurajesh/Assistly\ncd Assistly\n```\n\n**Project Structure Overview:**\n```\ncrawling/\n├── app/                    # Streamlit deployment\n│   ├── main.py            # Main Streamlit application\n│   ├── rag_pipeline.py    # AI pipeline implementation\n│   ├── requirements.txt   # App dependencies\n│   ├── .env.example       # Environment template\n│   └── sample_tickets.json\n├── memory_manager.py      # Conversational memory management\n├── scrape.py              # Firecrawl web scraping\n├── qdrant_ingestion.py    # Vector database ingestion\n├── requirements.txt       # Data pipeline dependencies\n└── README.md\n```\n\nCreate `.env` file in the `app/` directory (copy from `app/.env.example`):\n```env\nOPENAI_API_KEY=your_openai_api_key\nQDRANT_URI=your_qdrant_cloud_endpoint\nQDRANT_API_KEY=your_qdrant_api_key\nMONGODB_URI=your_mongodb_atlas_connection_string\nFIRECRAWL_API_KEY=your_firecrawl_api_key\n```\n\n\u003e **📁 Deployment-Ready Structure**: The `.env` file is located in the `app/` directory to enable standalone deployment. Root directory scripts (scrape.py, qdrant_ingestion.py) automatically load environment variables from `app/.env`, ensuring consistent configuration across the entire project while maintaining deployment flexibility for platforms like Streamlit Cloud.\n\n### Environment Variables Reference\n\n**Required for Core Functionality:**\n- `OPENAI_API_KEY`: OpenAI API key for GPT-4o classification, response generation, and query enhancement\n  - Obtain from: https://platform.openai.com/api-keys\n  - Required permissions: GPT-4o model access and sufficient credits\n  - Usage: Classification, RAG responses, and optional query enhancement\n\n- `QDRANT_URI`: Qdrant Cloud vector database endpoint URL\n  - Format: `https://your-cluster-id.europe-west3-0.gcp.cloud.qdrant.io:6333`\n  - Obtain from: Qdrant Cloud dashboard after cluster creation\n  - Usage: Vector similarity search and hybrid search operations\n\n- `QDRANT_API_KEY`: Authentication key for Qdrant Cloud instance\n  - Obtain from: Qdrant Cloud cluster settings → API Keys\n  - Usage: Secure access to vector database operations\n\n- `MONGODB_URI`: MongoDB Atlas connection string\n  - Format: `mongodb+srv://username:password@cluster.mongodb.net/database`\n  - Obtain from: MongoDB Atlas dashboard → Connect → Application\n  - Usage: Document storage, metadata persistence, and backup operations\n\n**Optional (Data Pipeline Only):**\n- `FIRECRAWL_API_KEY`: Firecrawl API key for web scraping (only needed for custom data ingestion)\n  - Obtain from: https://www.firecrawl.dev/\n  - Usage: Automated web scraping of documentation sites\n  - Note: Pre-processed data is included, so this is optional for basic deployment\n\n### 2. Install Dependencies\n\n**For deployment (Streamlit app):**\n```bash\npip install -r app/requirements.txt\n```\n\n**For data pipeline (if running scraping/ingestion):**\n```bash\npip install -r requirements.txt\n```\n\n### 3. Data Pipeline Setup (Optional - for custom data)\n\n**Step 1: Web Scraping with Firecrawl**\n```bash\n# Basic scraping (pre-completed for Atlan docs)\npython scrape.py https://docs.atlan.com --limit 3000\npython scrape.py https://developer.atlan.com --limit 1000\n\n# Custom scraping examples\npython scrape.py https://your-docs.com --limit 500 --collection custom_docs\n```\n\n\u003e **Note**: The limits above (3000 for docs.atlan.com, 1000 for developers.atlan.com) are set higher than the actual page counts (~1078 and ~611 respectively) to ensure complete data scraping during development. You can use lower limits based on your needs - the crawler will stop when all available pages are scraped regardless of the limit setting.\n\n*All scraped content automatically stored in MongoDB with metadata and backup files.*\n\n**Step 2: Enhanced Vector Database Ingestion**\n```bash\n# Create enhanced collection with advanced chunking\npython qdrant_ingestion.py --qdrant-collection \"atlan_docs_enhanced\" --recreate\n\n# Advanced ingestion with source filtering\npython qdrant_ingestion.py --source-url \"https://docs.atlan.com\" --qdrant-collection \"atlan_docs_enhanced\"\n\n# Incremental updates (recommended for production)\npython qdrant_ingestion.py --qdrant-collection \"atlan_docs_enhanced\"\n```\n*Enhanced chunking preserves code blocks, creates quality metrics, and generates embeddings with FastEmbed BGE-small for hybrid search.*\n\n**Note**: The application comes with pre-processed data, so this step is only needed for custom datasets or updates. For advanced configuration options, see the \"Advanced Pipeline Options\" section below.\n\n## 🔧 Advanced Pipeline Options\n\n### Scraping Configuration (scrape.py)\n\n**Basic Command Structure:**\n```bash\npython scrape.py \u003cURL\u003e [OPTIONS]\n```\n\n**Available Options:**\n- `--limit \u003cnumber\u003e`: Maximum pages to crawl (default: 3000)\n- `--collection \u003cname\u003e`: MongoDB collection name (default: atlan_developer_docs)\n\n**Common Scraping Scenarios:**\n```bash\n# Scrape with custom page limit\npython scrape.py https://docs.atlan.com --limit 500\n\n# Scrape to custom MongoDB collection\npython scrape.py https://docs.atlan.com --collection custom_docs\n\n# Scrape developer docs with different limits\npython scrape.py https://developer.atlan.com --limit 200 --collection dev_docs\n```\n\n### Ingestion Configuration (qdrant_ingestion.py)\n\n**Advanced Command Structure:**\n```bash\npython qdrant_ingestion.py [OPTIONS]\n```\n\n**Available Options:**\n- `--source-url \u003curl\u003e`: Filter documents by specific source URL\n- `--collection \u003cname\u003e`: MongoDB collection name (default: atlan_developer_docs)\n- `--qdrant-collection \u003cname\u003e`: Qdrant collection name (default: atlan_docs)\n- `--recreate`: Delete and recreate Qdrant collection (removes existing data)\n- `--no-incremental`: Process all documents (skip duplicate checking)\n\n**Advanced Ingestion Examples:**\n```bash\n# Process only developer documentation\npython qdrant_ingestion.py --source-url \"https://developer.atlan.com\"\n\n# Process only general documentation\npython qdrant_ingestion.py --source-url \"https://docs.atlan.com\"\n\n# Recreate collection (fresh start)\npython qdrant_ingestion.py --recreate\n\n# Process all documents without incremental checking\npython qdrant_ingestion.py --no-incremental\n\n# Process custom collection with filtering\npython qdrant_ingestion.py --collection custom_docs --source-url \"https://example.com\"\n\n# Create custom Qdrant collection\npython qdrant_ingestion.py --qdrant-collection \"developer_vectors\"\n\n# Process custom MongoDB collection to custom Qdrant collection\npython qdrant_ingestion.py --collection dev_docs --qdrant-collection \"dev_vectors\"\n\n# Full rebuild with specific source and custom collection\npython qdrant_ingestion.py --recreate --source-url \"https://developer.atlan.com\" --qdrant-collection \"dev_only\"\n```\n\n## 📂 Document Filtering \u0026 Collection Management\n\n### Source URL Filtering Benefits\n\n**Selective Processing:**\n- Update only specific documentation domains\n- Test pipeline with subset of data\n- Separate processing schedules for different sites\n\n**Document Type Classification:**\n- Automatic categorization: `developer.atlan.com` → \"developer\" type\n- All other sources → \"docs\" type\n- Enables filtered search and analytics\n\n**Performance Optimization:**\n- Process only changed documentation\n- Reduce vector database update time\n- Minimize embedding generation costs\n\n**Custom Qdrant Collections:**\n- Separate vector collections for different projects\n- Independent collection lifecycle management\n- Isolated testing and production environments\n- Multiple documentation versions in parallel\n\n### Collection Management Workflows\n\n**Development \u0026 Testing:**\n```bash\n# Create test collection with limited data\npython scrape.py https://docs.atlan.com --limit 50 --collection test_docs\npython qdrant_ingestion.py --collection test_docs --qdrant-collection test_vectors --recreate\n```\n\n**Production Updates:**\n```bash\n# Re-scrape updated content (overwrites existing URLs)\npython scrape.py https://docs.atlan.com --limit 3000\n\n# Incremental ingestion (only new/changed documents)\npython qdrant_ingestion.py\n```\n\n**Multi-Source Management:**\n```bash\n# Separate ingestion for different documentation types\npython qdrant_ingestion.py --source-url \"https://developer.atlan.com\"\npython qdrant_ingestion.py --source-url \"https://docs.atlan.com\"\n```\n\n### Incremental Processing\n\n**How It Works:**\n- Checks MongoDB document IDs already in Qdrant\n- Skips processing of existing documents\n- Only processes new or updated content\n\n**When to Use `--no-incremental`:**\n- After modifying chunking parameters\n- When reprocessing is needed due to embedding model changes\n- For debugging or validation purposes\n\n### 4. Run the Application\n\n**Run the Streamlit app:**\n```bash\ncd app\nstreamlit run main.py\n```\n\nThe application will open automatically in your browser at `http://localhost:8501`\n\n### 5. Streamlit Deployment\n\n**Deploy to Streamlit Community Cloud:**\n1. Push your repository to GitHub\n2. Visit [share.streamlit.io](https://share.streamlit.io)\n3. Connect your GitHub repository\n4. Set main file path: `app/main.py`\n5. Add environment variables in Streamlit Cloud settings\n6. Deploy your application\n\n## 📖 Usage Guide\n\n### Dashboard Page\n1. Navigate to \"📊 Dashboard\" in the sidebar\n2. Click \"Load \u0026 Classify All Tickets\"\n3. View AI-generated classifications for all 30+ sample tickets\n4. Analyze summary statistics and topic distributions\n5. Search and examine individual ticket classifications\n\n### Interactive Agent Page\n1. Navigate to \"💬 Chat Agent\" in the sidebar\n2. Enter your question in the chat interface\n3. Toggle \"Show internal analysis\" to view classification details\n4. Get intelligent responses with source citations\n5. Experience context-aware conversations with memory\n6. Use the \"Conversation Management\" sidebar to view memory stats or clear history\n7. Try sample questions or submit your own tickets\n\n### Analytics Page\n1. Navigate to \"📈 Analytics\" in the sidebar\n2. **Performance Metrics**: View real-time search performance statistics\n   - Response times for different search methods (vector, hybrid, keyword)\n   - Query enhancement usage and effectiveness metrics\n   - Average retrieval quality scores and user satisfaction\n3. **Usage Analytics**: Monitor system utilization patterns\n   - Daily/weekly query volume trends\n   - Most common topic classifications and routing decisions\n   - Memory usage statistics across active sessions\n4. **Search Method Distribution**: Analyze search strategy effectiveness\n   - Breakdown of vector vs. hybrid vs. keyword search usage\n   - Success rates and fallback patterns for different methods\n   - Quality metrics per search type with comparative analysis\n5. **System Health Monitoring**: Track infrastructure performance\n   - Qdrant collection performance and vector database health\n   - MongoDB connection status and query response times\n   - OpenAI API usage, rate limits, and cost optimization insights\n\n## 💬 Conversational Memory Features\n\n### Context-Aware Conversations\nThe system maintains conversation history to provide context-aware responses:\n- **Follow-up Questions**: Ask related questions without repeating context\n- **Reference Previous Answers**: The AI remembers what it told you earlier\n- **Natural Flow**: Conversations feel more natural and coherent\n\n### Memory Management\n**Conversation Management Sidebar:**\n- **Memory Statistics**: View active sessions and total message count\n- **Current Session Info**: See number of exchanges in current conversation\n- **Clear History**: Manually reset conversation memory when needed\n\n**Automatic Features:**\n- **Session Isolation**: Each browser session has its own conversation memory\n- **Message Limits**: Automatically trims to last 20 messages to prevent token overflow\n- **Auto Expiry**: Sessions expire after 60 minutes of inactivity\n- **Smart Trimming**: Removes oldest messages while preserving conversation pairs\n\n### Example Conversation Flow\n```\nUser: \"How do I connect Snowflake to Atlan?\"\nAI: \"To connect Snowflake to Atlan, you need to configure...\" [provides detailed steps]\n\nUser: \"What permissions do I need for this?\"\nAI: \"For the Snowflake connection we discussed, you'll need...\" [remembers previous context]\n\nUser: \"Are there any security considerations?\"\nAI: \"Yes, for your Snowflake-Atlan integration, consider...\" [builds on conversation]\n```\n\n### Technical Implementation\n- **Backend**: LangChain's `InMemoryChatMessageHistory` for pure RAM storage\n- **No Database**: Conversations stored in Python dictionaries (no external dependencies)\n- **Session Management**: UUID-based session identification with Streamlit session state\n- **Context Integration**: Previous conversation included in RAG prompts for better responses\n\n### Memory Manager Implementation (`memory_manager.py`)\nThe `ConversationMemoryManager` class provides advanced conversation memory features:\n\n**Core Features:**\n- **Session Isolation**: Each browser session maintains separate conversation history\n- **Automatic Cleanup**: Configurable auto-cleanup removes expired sessions (default: every 100 operations)\n- **Message Trimming**: Automatically limits conversations to last 20 messages per session\n- **Session Timeout**: Sessions expire after 60 minutes of inactivity\n- **Smart Trimming**: Preserves conversation pairs (human + AI messages) when trimming\n\n**Configuration Options:**\n- `max_messages_per_session`: Maximum messages per conversation (default: 20)\n- `session_timeout_minutes`: Session expiration time (default: 60 minutes)\n- `auto_cleanup_interval`: Operations between automatic cleanup (default: 100)\n\n**Memory Statistics:**\n- Active session count and total message tracking\n- Per-session message counts and last activity timestamps\n- Memory usage optimization with automatic garbage collection\n- Real-time memory health monitoring via the Analytics page\n\n### Settings Page\n1. Navigate to \"⚙️ Settings\" in the sidebar\n2. **Collection Management**: Select from available Qdrant collections with real-time discovery\n3. **Collection Information**: View collection points, vector size, and distance metrics\n4. Configure search parameters (TOP_K, score thresholds, configurable hybrid weights)\n5. Adjust model settings (temperature, max tokens, model selection)\n6. Toggle features (hybrid search, query enhancement)\n7. Customize UI preferences (show analysis default)\n8. Apply settings in real-time without restarting the application\n9. Export/import settings configurations as JSON files\n10. View configuration warnings for potentially problematic settings\n11. **Troubleshooting**: Built-in connection diagnostics and collection validation\n\n## 🧠 Advanced AI Pipeline Details\n\n### Enhanced Classification Logic\nThe system analyzes tickets using structured prompts to generate:\n1. **Topic Tags**: Multiple relevant categories with high accuracy\n2. **Sentiment**: Emotional tone analysis for prioritization\n3. **Priority**: Business impact assessment with context awareness\n\n### Advanced RAG Response Logic\n- **RAG Topics**: How-to, Product, Best practices, API/SDK, SSO → Generate answers using hybrid search\n- **Routing Topics**: Connector, Lineage, Glossary, Sensitive data → Route to specialized teams\n- **Query Processing**: Optional GPT-4o enhancement expands technical terms\n- **Search Strategy**: Hybrid vector + keyword search with smart reranking\n- **Response Generation**: Context-aware answers with source attribution\n\n### Advanced RAG Pipeline Components\n\n#### 1. Query Enhancement Pipeline (Optional)\n- **Input**: Raw user query (e.g., \"How to setup SSO?\")\n- **Processing**: GPT-4o expands technical terms and acronyms\n- **Output**: Enhanced query (e.g., \"How to configure SAML single sign-on authentication in Atlan?\")\n- **Benefits**: Better retrieval for technical documentation\n- **Toggle**: Configurable via `ENABLE_QUERY_ENHANCEMENT`\n\n#### 2. Hybrid Search System\n- **Vector Search**: Semantic similarity using FastEmbed BGE-small (384 dim)\n- **Keyword Search**: BM25 algorithm for exact term matching\n- **Fusion Strategy**: Configurable weighted combination with smart deduplication\n- **Reranking**: Boosts documents found by both methods\n- **Fallback**: Graceful degradation to vector-only if BM25 fails\n\n#### 3. Enhanced Chunking Strategy\n- **Structure Preservation**: Special handling for code blocks and headers\n- **Smart Separators**: 15+ separator types for optimal boundaries\n- **Quality Metrics**: Tracks code presence, headers, and word count\n- **Metadata Enhancement**: Chunk-level quality indicators\n- **Context Maintenance**: Preserves related content together\n\n### Enhanced Chunking Strategy\n- **Chunk Size**: 1200 tokens with 200 token overlap\n- **Method**: Advanced recursive character splitting with enhanced separators\n- **Code Preservation**: Special handling for ``` code blocks and indented code\n- **Structure Awareness**: Preserves headers, lists, procedures, and markdown formatting\n- **Quality Metrics**: Tracks code blocks, headers, word count, and chunk quality scores\n- **Smart Boundaries**: 15+ separator types for optimal semantic chunking\n\n### Hybrid Search System\n- **Vector Search**: BAAI/bge-small-en-v1.5 (384 dimensions) with cosine similarity\n- **Keyword Search**: BM25 algorithm for exact term matching\n- **Search Fusion**: Configurable weighted combination of vector and keyword results\n- **Smart Reranking**: Deduplication and relevance scoring with boost for multi-method matches\n- **Score Threshold**: 0.3 minimum similarity for vector results\n- **Top-K Retrieval**: 5 most relevant chunks from hybrid results\n\n### Conversational Memory System\n- **Memory Backend**: LangChain's `InMemoryChatMessageHistory` for pure RAM storage\n- **Session Management**: Unique session IDs for each browser session with automatic timeout\n- **Context Window**: Last 5 message exchanges included in RAG prompts for continuity\n- **Memory Features**:\n  - Automatic message trimming (max 20 messages per session)\n  - Session cleanup and expiration (60-minute timeout)\n  - Manual conversation clearing via UI\n  - Memory usage statistics and monitoring\n- **No External Dependencies**: Pure in-memory storage without databases\n\n## 🔧 Configuration Options\n\n### Environment Variables (app/.env)\n- `OPENAI_API_KEY`: Required for GPT-4o classification, response generation, and query enhancement\n- `QDRANT_URI`: Qdrant Cloud vector database endpoint for hybrid search\n- `QDRANT_API_KEY`: Authentication for Qdrant Cloud instance\n- `MONGODB_URI`: MongoDB Atlas connection string for document storage\n- `FIRECRAWL_API_KEY`: Firecrawl API key for web scraping (data pipeline only)\n\n### Advanced RAG Configuration (app/rag_pipeline.py)\n- `ENABLE_QUERY_ENHANCEMENT`: Toggle GPT-4o query expansion (default: False)\n- `ENABLE_HYBRID_SEARCH`: Toggle vector + BM25 hybrid search (default: True)\n- `HYBRID_VECTOR_WEIGHT`: Configurable weight for vector search results (default: 1.0)\n- `HYBRID_KEYWORD_WEIGHT`: Configurable weight for BM25 keyword results (default: 0.0)\n- `COLLECTION_NAME`: Qdrant collection name (default: \"atlan_docs_enhanced\")\n- `SCORE_THRESHOLD`: Minimum similarity threshold (default: 0.3)\n- `TOP_K`: Number of search results to retrieve (default: 5)\n- `MAX_TOKENS`: Maximum response length (default: 1000)\n- `TEMPERATURE`: Response creativity level (default: 0.3)\n- `LLM_MODEL`: OpenAI model for responses (default: \"gpt-4o\")\n\n### Dynamic Settings Management\n- **Real-time Updates**: All configuration changes apply immediately without restart\n- **Collection Management**: Dynamic Qdrant collection discovery and switching\n- **Settings Validation**: Built-in warnings for potentially problematic configurations\n- **Import/Export**: JSON-based settings backup and sharing capabilities\n- **UI Integration**: Settings page with tabbed interface for different parameter categories\n- **Configuration Persistence**: Settings stored in session state and applied to pipeline\n- **Connection Diagnostics**: Real-time collection validation and troubleshooting\n- **Fallback Handling**: Graceful degradation when settings cause issues\n\n### Data Pipeline Configuration\n- **Scraping Parameters**: Use `--limit` and `--collection` options in scrape.py for custom URLs and crawl limits\n- **Source Filtering**: Use `--source-url` in qdrant_ingestion.py for selective document processing\n- **Collection Management**: Use `--qdrant-collection`, `--recreate` and `--no-incremental` options for collection lifecycle\n- **Custom Collections**: Use `--qdrant-collection` to create separate vector collections for different projects\n- **Chunk Configuration**: Adjust size and overlap in qdrant_ingestion.py (default: 1200 tokens, 200 overlap)\n- **Vector Search**: Modify threshold and top-K in app/rag_pipeline.py (default: 0.3 threshold, 5 chunks)\n\n## 📊 Qdrant Collections \u0026 Chunking Strategies\n\nThis project implements two different Qdrant collections with distinct chunking strategies to optimize for different use cases and document types.\n\n### Collections Overview\n\n| Collection | Chunking Strategy | Branch Availability | Best For |\n|------------|------------------|-------------------|----------|\n| **`atlan_docs`** | Basic Chunking | Development branch | Plain text, fast processing |\n| **`atlan_docs_enhanced`** | Enhanced Chunking | Main \u0026 advanced-rag-enhancements branches | Technical docs with code |\n\n### Basic Chunking Strategy (`atlan_docs`)\n\n**Implementation Location**: Available in the development branch of this repository\n\n**Technical Details**:\n- Uses simple `RecursiveCharacterTextSplitter` with basic separators: `[\"\\n\\n\", \"\\n\", \" \", \"\"]`\n- Chunk size: ~1200 characters with 200 character overlap\n- Keeps separators in chunks (`keep_separator=True`)\n- Fast, straightforward processing\n\n**Metadata Structure**:\n```json\n{\n  \"text\": \"chunk content...\",\n  \"source_url\": \"https://docs.atlan.com/...\",\n  \"title\": \"Document Title\",\n  \"doc_type\": \"docs\" | \"developer\",\n  \"chunk_index\": 0,\n  \"total_chunks\": 5\n}\n```\n\n**Characteristics**:\n- ✅ **Pros**: Simple, fast, works well for plain text documents\n- ❌ **Cons**:\n  - Doesn't preserve code blocks (may split ```` blocks)\n  - Markdown structure (headers, lists) may be broken\n  - Chunks may cut through semantic boundaries → lower retrieval quality\n\n### Enhanced Chunking Strategy (`atlan_docs_enhanced`)\n\n**Implementation Location**: Current implementation in main and advanced-rag-enhancements branches\n\n**Technical Details**:\n- **Code Block Preservation**: Uses `preserve_code_blocks()` function to surround code blocks with newlines\n- **Rich Separators**: 15+ separator types for optimal semantic boundaries:\n  ```python\n  separators=[\n      \"\\n\\n\\n\",          # Major section breaks\n      \"\\n\\n\",            # Paragraph breaks\n      \"\\n```\\n\", \"```\\n\", # Code block boundaries\n      \"\\n# \", \"\\n## \", \"\\n### \", \"\\n#### \",  # Headers\n      \"\\n- \", \"\\n* \", \"\\n1. \", \"\\n2. \",      # Lists\n      \"\\n\", \". \", \"? \", \"! \", \"; \", \", \",     # Sentences \u0026 punctuation\n      \" \", \"\"             # Words \u0026 characters\n  ]\n  ```\n- **Quality Metrics**: Analyzes chunk content for optimization\n\n**Enhanced Metadata Structure**:\n```json\n{\n  \"text\": \"chunk content...\",\n  \"source_url\": \"https://docs.atlan.com/...\",\n  \"title\": \"Document Title\",\n  \"doc_type\": \"docs\" | \"developer\",\n  \"chunk_index\": 0,\n  \"total_chunks\": 5,\n  \"word_count\": 150,\n  \"has_code\": true,\n  \"has_headers\": true,\n  \"chunk_quality\": \"high\" | \"medium\"\n}\n```\n\n**Characteristics**:\n- ✅ **Pros**:\n  - Preserves semantic meaning better (respects headers, lists, sentences)\n  - Code blocks are chunked as whole units (maintains functional examples)\n  - Quality metadata enables downstream filtering and optimization\n  - Better context preservation for technical documentation\n- ❌ **Cons**:\n  - More complex processing (slower)\n  - Slightly larger metadata footprint\n\n### Key Differences Summary\n\n| Aspect | Basic Chunking | Enhanced Chunking |\n|--------|----------------|-------------------|\n| **Speed** | Fast ⚡ | Moderate ⏱️ |\n| **Code Preservation** | ❌ May split code blocks | ✅ Preserves complete code blocks |\n| **Markdown Awareness** | ❌ Basic line/paragraph splitting | ✅ Respects headers, lists, structure |\n| **Quality Tracking** | ❌ No quality metrics | ✅ Chunk quality indicators |\n| **Use Case** | Raw text ingestion | Developer documentation |\n| **Retrieval Quality** | Good for simple text | Superior for technical content |\n\n### Collection Selection Guidance\n\n**Choose `atlan_docs` (Basic) when**:\n- Processing large volumes of plain text\n- Speed is critical over quality\n- Documents don't contain code examples\n- Simple question-answering scenarios\n\n**Choose `atlan_docs_enhanced` (Enhanced) when**:\n- Processing technical documentation\n- Documents contain code examples and structured content\n- Quality of retrieval is more important than speed\n- Need chunk-level quality metrics for optimization\n\n**Switching Collections**:\n1. Navigate to \"⚙️ Settings\" in the sidebar\n2. Use the \"Collection Management\" section\n3. Select from available Qdrant collections\n4. Apply changes in real-time without restart\n\n### Performance Implications\n\n- **Basic Chunking**: ~40% faster processing, smaller storage footprint\n- **Enhanced Chunking**: Higher retrieval accuracy for technical queries, better context preservation\n\nChoose based on your specific use case: speed vs. quality trade-off.\n\n### Common Pipeline Workflows\n\n**Complete Fresh Setup:**\n```bash\n# Scrape new documentation\npython scrape.py https://new-docs.com --limit 500 --collection new_docs\n\n# Create fresh vector database\npython qdrant_ingestion.py --collection new_docs --recreate\n```\n\n**Incremental Updates (Recommended):**\n```bash\n# Re-scrape updated content (overwrites existing URLs)\npython scrape.py https://docs.atlan.com --limit 700\n\n# Incremental ingestion (only new/changed documents)\npython qdrant_ingestion.py\n```\n\n**Domain-Specific Processing:**\n```bash\n# Update only developer documentation vectors\npython qdrant_ingestion.py --source-url \"https://developer.atlan.com\"\n\n# Update only general documentation vectors\npython qdrant_ingestion.py --source-url \"https://docs.atlan.com\"\n```\n\n**Testing and Development:**\n```bash\n# Create test dataset\npython scrape.py https://docs.atlan.com --limit 20 --collection test_data\n\n# Test ingestion pipeline with custom Qdrant collection\npython qdrant_ingestion.py --collection test_data --qdrant-collection test_vectors --recreate\n```\n\n**Multiple Project Management:**\n```bash\n# Project A: Customer documentation\npython scrape.py https://customer-docs.com --collection customer_docs\npython qdrant_ingestion.py --collection customer_docs --qdrant-collection customer_vectors\n\n# Project B: Internal documentation\npython scrape.py https://internal-docs.com --collection internal_docs\npython qdrant_ingestion.py --collection internal_docs --qdrant-collection internal_vectors\n```\n\n### Application Customization\n- **Classification Prompts**: Edit prompts in app/rag_pipeline.py for custom categorization\n- **Response Templates**: Modify RAG and routing responses in app/main.py\n- **UI Styling**: Update custom CSS in app/main.py for branding\n\n## 📊 Performance Metrics \u0026 Advanced Features\n\n### Enhanced Data Pipeline Efficiency\n- **Smart Scraping**: Firecrawl with automated content extraction and metadata preservation\n- **Persistent Storage**: MongoDB with backup capabilities and incremental processing\n- **Advanced Vector Ingestion**: Batch processing with enhanced chunking and quality metrics\n- **Hybrid Search Performance**: Combined vector + keyword search with intelligent reranking\n\n### Advanced Response Quality Measures\n- **Multi-Method Retrieval**: Hybrid search combines semantic and keyword matching\n- **Query Enhancement**: GPT-4o expands technical terms for better retrieval (configurable)\n- **Smart Reranking**: Configurable weighted fusion of vector and BM25 results\n- **Source Attribution**: All RAG responses include original documentation URLs\n- **Relevance Scoring**: Vector similarity + BM25 scoring with threshold 0.3\n- **Context Quality**: Top-5 chunks from hybrid results for comprehensive answers\n- **Search Transparency**: Real-time indicators showing search methods used\n\n### Advanced Scalability Features\n- **Feature Toggles**: Configurable query enhancement and hybrid search\n- **Collection Management**: Separate enhanced and standard collections\n- **Incremental Processing**: Skip already processed documents for efficiency\n- **Quality Metrics**: Chunk-level quality indicators (code detection, headers, word count)\n- **Error Resilience**: Graceful fallbacks for all advanced features\n- **Performance Monitoring**: Search method tracking and optimization insights\n\n## 🚨 Troubleshooting\n\n### Data Pipeline Issues\n\n**1. Firecrawl Scraping Problems**\n- Verify Firecrawl API key in environment variables\n- Check rate limits and adjust scraping delays in scrape.py\n- Monitor MongoDB connection for storage issues\n\n**2. MongoDB Storage Issues**\n- Validate MongoDB Atlas connection string\n- Check database and collection permissions\n- Verify network access to MongoDB cluster\n\n**3. Vector Database Problems**\n- Verify Qdrant Cloud instance is accessible\n- Check embedding dimensions match (384 for BGE-small)\n- Validate collection exists and has correct configuration\n\n## 🔄 Backup and Recovery Procedures\n\n### MongoDB Data Backup\nThe system includes automatic backup mechanisms for data protection:\n\n**Automatic Backup Features:**\n- **Scraping Backup**: All scraped content is automatically saved as backup files during data collection\n- **Document Persistence**: MongoDB Atlas provides built-in automated backups (snapshots every 24 hours)\n- **Metadata Preservation**: Full document metadata, URLs, and timestamps stored for recovery\n\n**Manual Backup Procedures:**\n```bash\n# Export specific collection to JSON backup\n# Use MongoDB Compass or Atlas export functionality\n# Or use mongodump for command-line backup:\nmongodump --uri \"your_mongodb_uri\" --collection scraped_pages --db Cluster0 --out ./backup/\n\n# Export custom collection with date stamp\nmongodump --uri \"your_mongodb_uri\" --collection custom_docs --db Cluster0 --out ./backup/$(date +%Y%m%d)/\n```\n\n**Recovery Procedures:**\n```bash\n# Restore from MongoDB Atlas snapshot (via Atlas UI)\n# 1. Go to Atlas Dashboard → Clusters → Backup\n# 2. Select snapshot date and restore to new cluster\n# 3. Update MONGODB_URI in environment variables\n\n# Restore from local backup\nmongorestore --uri \"your_mongodb_uri\" --db Cluster0 ./backup/dump/Cluster0/\n```\n\n### Vector Database Recovery\n**Qdrant Collection Backup:**\n- Vector collections can be recreated using `qdrant_ingestion.py --recreate`\n- All embedding data is regenerated from MongoDB source documents\n- Collection metadata and configuration preserved in code\n\n**Recovery Process:**\n```bash\n# Full vector database recreation from MongoDB\npython qdrant_ingestion.py --recreate --qdrant-collection atlan_docs_enhanced\n\n# Partial recovery for specific sources\npython qdrant_ingestion.py --source-url \"https://docs.atlan.com\" --recreate\n\n# Verify collection health after recovery\npython -c \"\nfrom app.rag_pipeline import RAGPipeline\npipeline = RAGPipeline()\nprint(f'Collection status: {pipeline.qdrant_client.get_collection(\\\"atlan_docs_enhanced\\\")}')\"\n```\n\n### Disaster Recovery Checklist\n1. **Environment Variables**: Ensure `.env` files are backed up securely\n2. **MongoDB**: Verify Atlas automated backups are enabled\n3. **Source Code**: Regular git commits with configuration files\n4. **API Keys**: Secure storage of all service credentials\n5. **Documentation**: Keep setup instructions updated for recovery scenarios\n\n### Application Issues\n\n**4. Streamlit Deployment**\n- Ensure all environment variables are set in app/.env\n- Check that app/requirements.txt includes all dependencies\n- Verify OpenAI API key has sufficient credits\n\n**5. Classification Errors**\n- Review prompt templates in app/rag_pipeline.py\n- Check JSON parsing logic for malformed responses\n- Monitor OpenAI API rate limits\n\n### Debugging Tips\n- Check Streamlit logs for detailed error messages\n- Validate environment variable loading in app directory\n- Test individual pipeline components (MongoDB, Qdrant, OpenAI)\n- Monitor API usage and rate limits across all services\n\n## 📝 Development Notes\n\n### Advanced Project Structure Philosophy\n- **Separation of Concerns**: Clean separation between data pipeline (root) and deployment (app/)\n- **Environment Isolation**: Each tier has independent requirements and configuration\n- **Feature Modularity**: Advanced RAG features can be toggled independently\n- **Data Persistence**: MongoDB enables reprocessing and experimentation\n- **Deployment Ready**: Enhanced app/ folder with advanced search capabilities\n\n### Enhanced Architecture Decisions\n1. **Advanced Data Pipeline**: Firecrawl → MongoDB → Enhanced Qdrant → Advanced Streamlit\n2. **Hybrid Search System**: Vector + BM25 keyword search with intelligent fusion\n3. **Query Enhancement**: Optional GPT-4o query expansion for technical terms\n4. **Enhanced Chunking**: Code-aware splitting with quality metrics\n5. **Smart Configuration**: Feature toggles for different deployment scenarios\n6. **Dual Collection Strategy**: Standard vs enhanced collections for comparison\n7. **Performance Optimization**: Configurable search weights and thresholds\n\n### Advanced Trade-offs \u0026 Design Decisions\n- **Query Enhancement**: Optional GPT-4o expansion vs direct search (configurable)\n- **Hybrid Search**: Vector + keyword complexity vs pure vector simplicity\n- **Enhanced Chunking**: Structure preservation vs simple character splitting\n- **Feature Toggles**: Flexibility vs configuration complexity\n- **Dual Collections**: Comparison capability vs storage overhead\n- **Search Transparency**: User insight vs UI complexity\n- **Performance vs Features**: Configurable enhancement levels for different use cases\n\n### Production-Ready Enhancements\n- **Collection Management**: Enhanced vs standard collections for A/B testing\n- **Feature Flags**: Runtime configuration of advanced features\n- **Quality Metrics**: Chunk-level quality indicators for optimization\n- **Search Analytics**: Real-time method tracking and performance insights\n- **Graceful Degradation**: Fallbacks ensure system reliability\n\n### Enhanced RAG Implementation (advanced-rag-enhancements branch)\n| Feature | Implementation |\n|---------|----------------|\n| **Search Method** | Hybrid vector + BM25 keyword search |\n| **Query Processing** | Optional GPT-4o query enhancement |\n| **Chunking** | Code-aware splitting with quality metrics |\n| **Results** | Smart reranking with configurable fusion weights |\n| **UI Feedback** | Search method indicators + transparency |\n| **Collection** | `atlan_docs_enhanced` with enhanced metadata |\n| **Configurability** | Feature toggles for all enhancements |\n| **Performance** | Graceful degradation and fallbacks |\n| **Settings Management** | Dynamic configuration with real-time updates |\n| **Configuration** | Import/export, validation, and persistence |\n\n### Key Improvements\n1. **✅ Better Technical Term Handling**: Hybrid search excels at exact matches\n2. **✅ Enhanced Code Examples**: Preserved code blocks in chunking\n3. **✅ Query Expansion**: GPT-4o expands acronyms and technical terms\n4. **✅ Search Transparency**: Users see which methods found their answers\n5. **✅ Quality Metrics**: Chunk-level indicators for optimization\n6. **✅ Configurable Features**: Toggle enhancements based on needs\n7. **✅ Dynamic Settings Management**: Real-time configuration without restart\n8. **✅ Settings Import/Export**: JSON-based configuration sharing and backup\n9. **✅ Collection Management**: Real-time Qdrant collection discovery and switching\n10. **✅ Connection Diagnostics**: Built-in troubleshooting for collection issues\n\n## 🛠️ Developer Utilities\n\n### Database Utility Functions (`utils.py`)\nThe project includes utility functions for MongoDB operations in the data pipeline:\n\n**Core Functions:**\n- `get_mongodb_client()`: Creates authenticated MongoDB client using environment variables\n  - Returns: `MongoClient` instance configured with `MONGODB_URI`\n  - Handles connection string validation and error handling\n  - Usage: For direct database operations and debugging\n\n- `get_mongodb_collection(database_name, collection_name)`: Gets MongoDB client, database, and collection\n  - Args:\n    - `database_name`: Target database (default: \"Cluster0\")\n    - `collection_name`: Target collection (default: \"scraped_pages\")\n  - Returns: Tuple of `(client, database, collection)`\n  - Usage: For pipeline scripts that need database access\n\n- `close_mongodb_client(client)`: Safely closes MongoDB connections\n  - Args: `client` - MongoDB client instance to close\n  - Includes error handling for cleanup operations\n  - Usage: Ensures proper resource cleanup in scripts\n\n**Constants:**\n- `DEFAULT_DATABASE = \"Cluster0\"`: Default MongoDB database name\n- `DEFAULT_COLLECTION = \"scraped_pages\"`: Default collection for scraped content\n\n**Example Usage:**\n```python\nfrom utils import get_mongodb_collection, close_mongodb_client\n\n# Get database components\nclient, db, collection = get_mongodb_collection(\"Cluster0\", \"custom_docs\")\n\n# Perform operations\ndocuments = collection.find({\"source_url\": {\"$regex\": \"docs.atlan.com\"}})\n\n# Cleanup\nclose_mongodb_client(client)\n```\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Implement changes with tests\n4. Update documentation\n5. Submit pull request\n\n## 📄 License\n\n[MIT License](LICENSE)\n\n## 🆘 Support\n\nFor issues and questions:\n1. Check the troubleshooting section\n2. Review API documentation\n3. Create an issue with detailed reproduction steps\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkanugurajesh%2Fassistly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkanugurajesh%2Fassistly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkanugurajesh%2Fassistly/lists"}