{"id":22083372,"url":"https://github.com/aplbrain/bossdb-rag-chatbot","last_synced_at":"2025-06-11T02:37:39.371Z","repository":{"id":264874453,"uuid":"890359184","full_name":"aplbrain/bossdb-rag-chatbot","owner":"aplbrain","description":"A Language Interface to the BossDB Ecosystem","archived":false,"fork":false,"pushed_at":"2025-01-13T14:34:52.000Z","size":267,"stargazers_count":1,"open_issues_count":4,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-01-29T04:44:39.848Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aplbrain.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-18T12:43:07.000Z","updated_at":"2025-01-13T14:29:04.000Z","dependencies_parsed_at":"2024-11-26T15:48:42.156Z","dependency_job_id":"980cec0e-6ca9-4d04-9774-c190a06c9fe6","html_url":"https://github.com/aplbrain/bossdb-rag-chatbot","commit_stats":null,"previous_names":["aplbrain/bossdb-rag-chatbot"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aplbrain%2Fbossdb-rag-chatbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aplbrain%2Fbossdb-rag-chatbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aplbrain%2Fbossdb-rag-chatbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aplbrain%2Fbossdb-rag-chatbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aplbrain","download_url":"https://codeload.github.com/aplbrain/bossdb-rag-chatbot/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245170570,"owners_count":20572093,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-01T00:12:50.296Z","updated_at":"2025-03-23T21:24:43.653Z","avatar_url":"https://github.com/aplbrain.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BossDB RAG System: A Language Interface to the BossDB Ecosystem\n\nA Retrieval-Augmented Generation (RAG) system specifically designed for BossDB documentation and data queries. This system combines the power of large language models with contextual document retrieval (embedding models) to provide accurate, source-backed answers about BossDB, its tools, and related neuroscience data.\n\n## 🌟 Features\n\n- **Intelligent Query Processing**: Combines vector search with LLM-based response generation\n- **Conversation Memory**: Maintains context across multiple queries with one of two memory modes:\n  - Summary-based memory using Claude 3 Haiku for efficient compression\n  - Window-based memory for maintaining recent conversation history\n- **Source Attribution**: Every response includes references to the source documents used\n- **Multi-Source Knowledge Base**: Automatically ingests and indexes content from:\n  - Documentation websites\n  - GitHub repositories\n  - API specifications\n  - Academic papers\n  - Jupyter notebooks\n- **Token Management**: Intelligent handling of token limits for both queries and conversation history\n- **Conversation Tracking**: Database-backed system for tracking user interactions and chat sessions\n\n## 🔧 Installation\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/aplbrain/bossdb-rag.git\ncd bossdb-rag\n```\n\n2. Create and activate a virtual environment:\n\n   - Requires Python 3.10 and above\n\n```bash\npython3 -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n```\n\n3. Install dependencies:\n\n   - May need to do the following on Ubuntu: `sudo apt-get install build-essential python3-dev` and `pip install wheel`.\n\n```bash\npip install -r requirements.txt\n```\n\n4. Set up environment variables:\n\n```bash\nexport AWS_ACCESS_KEY_ID=\"your_aws_access_key\"\nexport AWS_SECRET_ACCESS_KEY=\"your_aws_secret_key\"\nexport AWS_REGION=\"your_aws_region\"\nexport GITHUB_TOKEN=\"your_github_token\"\n```\n\n## ⚙️ Configuration\n\nAll system configuration is managed through `config.yaml` in the project root. The configuration is organized into three main sections:\n\n### LLM Configuration (`llm_config`)\n\n```yaml\nllm_config:\n  default_llm: \"anthropic.claude-3-5-sonnet-20240620-v1:0\"  # Main response generation model\n  fast_llm: \"anthropic.claude-3-haiku-20240307-v1:0\"        # Memory summarization model\n  embed_model: \"cohere.embed-english-v3\"                     # Text embedding model\n  \n  # AWS Bedrock credentials (loaded from environment variables)\n  aws_region: \"OS_ENV_AWS_REGION\"\n  aws_access_key_id: \"OS_ENV_AWS_ACCESS_KEY_ID\"\n  aws_secret_access_key: \"OS_ENV_AWS_SECRET_ACCESS_KEY\"\n  \n  # GitHub authentication\n  github_token: \"OS_ENV_GITHUB_TOKEN\"\n```\n\n### Usage Limits (`limits`)\n\n```yaml\nlimits:\n  max_questions: 1000          # Maximum questions per session\n  max_words: 100000           # Maximum total words per session\n  max_message_tokens: 4096    # Maximum tokens per message\n  max_total_tokens: 8192      # Maximum conversation history tokens\n```\n\n### Data Sources (`sources`)\n\n```yaml\nsources:\n  urls:                      # List of documentation sources\n    - \"https://docs.url\"\n    - \"https://api.url\"\n  github_orgs:              # GitHub organizations to index\n    - \"org1\"\n    - \"org2\"\n```\n\n### Environment Variables\n\nValues prefixed with `OS_ENV_` in the config file are loaded from environment variables. Required variables:\n\n- `AWS_REGION`\n- `AWS_ACCESS_KEY_ID`\n- `AWS_SECRET_ACCESS_KEY`\n- `GITHUB_TOKEN`\n\n### Updating Configuration\n\n1. Modify `config.yaml` as needed\n2. Ensure required environment variables are set\n3. Restart the application to apply changes\n\n### Storage\n\nThe system maintains its vector store in:\n\n- Vector store files (`./storage/`)\n\nThis is automatically created on first run.\n\n## 🚀 Usage\n\n### Starting the Application\n\nRun the application using Chainlit:\n\n```bash\nchainlit run main.py\n```\n\nThis will start the web interface on `http://localhost:8000`.\n\n## 💾 Data Storage\n\nThe system uses MongoDB for storing:\n\n- User information (profiles, usage metrics)\n- Chat threads (conversation sessions)\n- Message history (individual messages within threads)\n- Usage statistics\n\n### MongoDB Setup\n\n1. Start MongoDB (using Docker):\n```bash\ndocker run -d \\\n    --name mongodb \\\n    -p 27017:27017 \\\n    -e MONGO_INITDB_ROOT_USERNAME=admin \\\n    -e MONGO_INITDB_ROOT_PASSWORD=password123 \\\n    mongodb/mongodb-community-server:latest\n```\n\n2. The system will automatically:\n   - Create required collections\n   - Set up indexes for efficient querying\n   - Handle connection management\n\n### Collection Structure\n\n- **users**: Stores user profiles and usage statistics\n  ```json\n  {\n    \"user_identifier\": \"ip_sessionid\",\n    \"question_count\": 0,\n    \"word_count\": 0,\n    \"created_at\": \"timestamp\",\n    \"last_activity\": \"timestamp\"\n  }\n  ```\n\n- **chat_threads**: Tracks conversation sessions\n  ```json\n  {\n    \"user_id\": \"ObjectId\",\n    \"start_time\": \"timestamp\",\n    \"end_time\": \"timestamp\"\n  }\n  ```\n\n- **messages**: Stores individual messages\n  ```json\n  {\n    \"chat_thread_id\": \"ObjectId\",\n    \"content\": \"message text\",\n    \"is_user\": true/false,\n    \"timestamp\": \"timestamp\"\n  }\n  ```\n\n### Viewing Database Contents\n\nThe included `view_database.py` script provides a way to inspect the MongoDB collections:\n\n1. **Usage**:\n   ```bash\n   python view_database.py\n   ```\n\n2. **Configuration**:\n   - Default connection: `mongodb://admin:password123@localhost:27017`\n   - Default database: `bossdb_rag`\n   - To modify, update the connection string in the script or use environment variables\n\n3. **Features**:\n   - Displays all collections in the database\n   - Shows the first 5 documents from each collection\n   - Formats complex data types for readability\n   - Displays total document count per collection\n   - Handles nested objects and timestamps\n\n4. **Example Output**:\n   ```\n   Database: bossdb_rag\n   \n   Collection: users (Total documents: 25)\n   +------------------+----------------+-------------+-------------------------+-------------------------+\n   | _id             | user_identifier | word_count  | created_at             | last_activity          |\n   +------------------+----------------+-------------+-------------------------+-------------------------+\n   | ...             | 127.0.0.1_abc  | 150         | 2024-03-15T10:30:00Z   | 2024-03-15T11:45:00Z   |\n   [Additional rows...]\n\n   Collection: chat_threads (Total documents: 30)\n   [Thread information...]\n\n   Collection: messages (Total documents: 145)\n   [Message information...]\n   ```\n\n### Storage Cleanup\n\nTo reset the database and storage:\n\n1. Drop MongoDB collections:\n   ```bash\n   mongosh \"mongodb://admin:password123@localhost:27017/bossdb_rag\" --eval \"db.dropDatabase()\"\n   ```\n\n2. Remove vector store files:\n   ```bash\n   rm -rf ./storage/*\n   ```\n\n## 🔄 RAG Pipeline\n\n1. **Document Processing**\n   - Documents are loaded from configured sources\n   - Content is split into chunks using appropriate splitters\n   - Chunks are embedded and stored in a vector index\n\n2. **Query Processing**\n   - User query is processed\n   - Relevant documents are retrieved using vector similarity\n   - Context is combined with conversation history\n   - Response is generated using Claude 3.5 Sonnet\n   - Sources are processed and attached to response\n\n3. **Memory Management**\n   - Conversation history is maintained using either:\n     - Summary-based memory (using Claude 3 Haiku)\n     - Window-based memory (keeping recent messages)\n   - Token limits are enforced for both individual messages and total context\n\n## 🛠️ Development\n\n### Pipeline Structure\n\n![an image of the full pipeline](assets/pipeline.png)\n\n### Adding New Data Sources\n\nAdding new data sources is simple through the `config.yaml` file. No code changes are required - just update the configuration.\n\n1. Open `config.yaml` in the project root directory\n2. Navigate to the `sources` section:\n\n```yaml\nsources:\n  # Add new URLs to the urls list\n  urls:\n    - \"https://existing-url.com\"\n    - \"https://your-new-documentation-url.com\"     # Add new URLs here\n    - \"https://github.com/org/repo\"                # GitHub repositories are supported\n    - \"https://api-endpoint.com/openapi.json\"      # API specifications\n    \n  # Add new GitHub organizations to github_orgs list\n  github_orgs:\n    - \"existing-org\"\n    - \"your-new-org\"                              # Add new organizations here\n```\n\nThe system supports various types of sources:\n\n- Documentation websites\n- GitHub repositories (including wikis, specific files, or entire repos)\n- API specifications (OpenAPI/Swagger)\n- Academic papers\n- Jupyter notebooks\n- Markdown files\n- JSON endpoints\n\nWhen adding new sources, ensure they:\n\n- Are publicly accessible (or accessible with provided GitHub token)\n- Contain relevant BossDB-related content\n- Have stable URLs that won't frequently change\n\nAfter updating the configuration:\n\n1. Shutdown the application (if running)\n2. Delete the old `storage` folder\n3. Start up the application\n4. The system will automatically process and index the new sources\n5. New content will be immediately available for queries\n\n\n#### Incremental Updates\n\nThe system supports incremental updates to the knowledge base:\n\n1. **Document Processing**\n   - New documents are processed and added to the vector store\n   - Existing document hashes are compared to detect changes\n   - Only modified content is reprocessed\n\n2. **Storage Management**\n   - Vector store files in `./storage/` contain embeddings and index\n\n3. **Update Process**\n   ```python\n   # Example: Incremental update\n   app = BossDBRAGApplication(\n       urls=[\"https://docs.example.com\"],\n       orgs=[\"example-org\"],\n       incremental=True\n       force_reload=False,\n   )\n   await app.setup()\n   ```\n\n### Customizing Document Processing\n\nModify `Splitter` class to add new document types:\n\n```python\ndef split(self, document: Document) -\u003e List[Document]:\n    file_extension = self._get_file_extension(document)\n    if file_extension == \".new_type\":\n        return self.custom_splitter.get_nodes_from_documents([document])\n```\n\n### Stress Testing\n\nThe stress testing script (`stress_test.py`) can simulate multiple concurrent users interacting with the chatbot. The script uses Playwright for browser automation and provides detailed metrics about system performance.\n\n```bash\npython stress_test.py http://localhost:8000 --sessions 5 --questions \"What is BossDB?\" \"How do I download data?\"\n```\n\nFeatures:\n- Simulates multiple concurrent chat sessions\n- Measures response times, success rates, and throughput\n- Tracks concurrent session statistics\n- Generates detailed CSV reports with timing analysis\n- Provides comprehensive test results including:\n  - Total requests and success rate\n  - Maximum concurrent sessions\n  - Response time percentiles\n  - Requests per second\n  - Session-level timing analysis\n\nResults are saved to:\n- `stress_test_results_[timestamp].csv`: Detailed metrics for each request\n- `session_timings_[timestamp].csv`: Session-level timing analysis\n- `stress_test.log`: Detailed test execution log\n\n## 📊 Monitoring\n\nThis system includes comprehensive logging:\n\n- Application logs in console and `bossdb_rag.log`\n- Database viewer script `view_database.py`\n\n## ⚠️ Limitations\n\n- Requires valid AWS credentials for Bedrock access\n- GitHub token required for repository access\n- Protections against single user spamming\n- Multiple users are supported through Chainlit but is limited\n- Requires some level of local compute for calculating distances for vector search\n\n## 📄 License\n\n[Apache-2.0 license](LICENSE.txt)\n\n---\n\n\u003cp align=center\u003e\u003cb\u003eMade with 💙 at \u003ca href=\"https://jhuapl.edu\"\u003e\u003cimg alt=\"JHU APL\" src=\"https://user-images.githubusercontent.com/693511/116814564-9b268800-ab27-11eb-98bb-dfddb2e405a1.png\" height=\"23px\" /\u003e\u003c/a\u003e\u003c/b\u003e\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faplbrain%2Fbossdb-rag-chatbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faplbrain%2Fbossdb-rag-chatbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faplbrain%2Fbossdb-rag-chatbot/lists"}