{"id":32579528,"url":"https://github.com/ryan-m-bishop/docrag","last_synced_at":"2026-05-03T10:38:46.778Z","repository":{"id":321255982,"uuid":"1085130961","full_name":"ryan-m-bishop/docrag","owner":"ryan-m-bishop","description":"AI-powered documentation RAG system with MCP server for Claude Code. Search and retrieve technical documentation on-demand with vector embeddings and smart web scraping.","archived":false,"fork":false,"pushed_at":"2026-01-13T15:18:37.000Z","size":125,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-03T10:38:23.882Z","etag":null,"topics":["ai","anthropic","claude-code","cli","documentation","embeddings","lancedb","llm","mcp","mcp-server","python","rag","vector-database"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ryan-m-bishop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":null,"patreon":null,"open_collective":null,"ko_fi":"bishopgroupholdings","tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2025-10-28T16:15:19.000Z","updated_at":"2026-02-05T02:04:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ryan-m-bishop/docrag","commit_stats":null,"previous_names":["ryan-m-bishop/docrag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ryan-m-bishop/docrag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryan-m-bishop%2Fdocrag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryan-m-bishop%2Fdocrag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryan-m-bishop%2Fdocrag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryan-m-bishop%2Fdocrag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ryan-m-bishop","download_url":"https://codeload.github.com/ryan-m-bishop/docrag/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ryan-m-bishop%2Fdocrag/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32566444,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","anthropic","claude-code","cli","documentation","embeddings","lancedb","llm","mcp","mcp-server","python","rag","vector-database"],"created_at":"2025-10-29T15:27:10.991Z","updated_at":"2026-05-03T10:38:46.772Z","avatar_url":"https://github.com/ryan-m-bishop.png","language":"Python","funding_links":["https://ko-fi.com/bishopgroupholdings"],"categories":[],"sub_categories":[],"readme":"# DocRAG - AI Documentation RAG System\n\nA lightweight, installable Python package that provides RAG (Retrieval Augmented Generation) access to technical documentation through an MCP (Model Context Protocol) server. This enables LLMs to search and retrieve relevant documentation on-demand.\n\n## Features\n\n- 🚀 Single pip-installable package with CLI and MCP server\n- 📚 Project-based documentation collections (BrightSign, Venafi, Qumu, web frameworks)\n- 🔍 Local vector database with efficient embedding using LanceDB\n- 📥 Easy documentation ingestion from local files or scraped sources\n- 🤖 Designed for use with Claude Code via MCP\n\n## Installation\n\n### Prerequisites\n\n- Python 3.10+\n- pipx (recommended) or pip\n- git (for updates)\n\n### Recommended: Install globally with pipx\n\n```bash\n# Install globally with pipx in editable mode (keeps dependencies isolated)\npipx install -e /opt/claude-ops/doc-rag\n\n# Verify installation\ndocrag --help\n\n# Optional: Install Playwright browsers (for scraping)\npipx runpip docrag install playwright\npipx run --spec docrag playwright install chromium\n```\n\n**Note:** The `-e` flag installs in \"editable\" mode, which means changes to the source code are immediately reflected without reinstalling.\n\n### Alternative: Install from source (development)\n\n```bash\n# Clone or navigate to the project directory\ncd /opt/claude-ops/doc-rag\n\n# Create and activate virtual environment\npython3 -m venv venv\nsource venv/bin/activate\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Install Playwright browsers (for scraping)\nplaywright install chromium\n```\n\n## Updating DocRAG\n\n### Option 1: Using the Update Script (Recommended)\n\n```bash\ncd /opt/claude-ops/doc-rag\n./update.sh\n```\n\nThis script will:\n- Pull latest changes from git\n- Detect your installation method (pipx or pip)\n- Reinstall only if necessary (non-editable installs)\n- Handle editable installs automatically\n\n### Option 2: Using Make\n\n```bash\ncd /opt/claude-ops/doc-rag\nmake update\n```\n\n### Option 3: Manual Update\n\nFor **editable installs** (installed with `-e`):\n```bash\ncd /opt/claude-ops/doc-rag\ngit pull origin main\n# No reinstall needed - changes are already active!\n```\n\nFor **regular installs** (installed without `-e`):\n```bash\ncd /opt/claude-ops/doc-rag\ngit pull origin main\npipx uninstall docrag \u0026\u0026 pipx install -e .\n# or for pip: pip install -e . --force-reinstall\n```\n\n### Verifying Updates\n\n```bash\n# Check git status\ncd /opt/claude-ops/doc-rag\ngit log -1 --oneline\n\n# Test the installation\ndocrag --version\ndocrag --help\n```\n\n## Quick Start\n\n### 1. Initialize DocRAG\n\n```bash\ndocrag init\n```\n\nThis creates the configuration directory at `~/.docrag/` with the following structure:\n\n```\n~/.docrag/\n├── config.json           # Global configuration\n├── collections/          # Documentation collections\n└── vectordb/            # LanceDB storage\n```\n\n### 2. Add a Documentation Collection\n\n```bash\n# Add documentation from a local directory\ndocrag add brightsign --source /path/to/brightsign/docs --description \"BrightSign player documentation\"\n\n# Or add without source initially\ndocrag add venafi --description \"Venafi TPP API documentation\"\n```\n\n### 3. List Collections\n\n```bash\ndocrag list\n```\n\n### 4. Search Documentation (CLI Testing)\n\n```bash\n# Search across all active collections\ndocrag search \"how to initialize the player\"\n\n# Search a specific collection\ndocrag search \"authentication methods\" --collection venafi --limit 10\n```\n\n### 5. Start the MCP Server\n\n```bash\ndocrag serve\n```\n\nThe server will listen on stdio for connections from Claude Code.\n\n## CLI Commands\n\n### `docrag init`\nInitialize DocRAG configuration directory.\n\n### `docrag add \u003cname\u003e`\nAdd a new documentation collection.\n\nOptions:\n- `-s, --source PATH` - Source directory containing documentation\n- `-d, --description TEXT` - Description of the collection\n\nExample:\n```bash\ndocrag add qumu --source ~/docs/qumu --description \"Qumu video platform docs\"\n```\n\n### `docrag list`\nList all documentation collections with their status.\n\n### `docrag update \u003cname\u003e \u003csource\u003e`\nUpdate an existing collection with new documents.\n\nExample:\n```bash\ndocrag update brightsign ~/docs/brightsign/updated\n```\n\n### `docrag remove \u003cname\u003e`\nRemove a documentation collection (with confirmation).\n\n### `docrag search \u003cquery\u003e`\nSearch documentation from the CLI for testing.\n\nOptions:\n- `-c, --collection TEXT` - Specific collection to search\n- `-l, --limit INTEGER` - Number of results (default: 5)\n\nExample:\n```bash\ndocrag search \"websocket connection\" --collection brightsign\n```\n\n### `docrag serve`\nStart the MCP server for Claude Code integration.\n\n### `docrag scrape \u003curl\u003e`\nScrape documentation from websites.\n\nOptions:\n- `-o, --output PATH` - Output directory (required)\n- `--smart, --use-crawl4ai` - Use AI-powered Crawl4AI scraper (recommended)\n- `--no-llm` - Disable LLM extraction (faster, still better than basic)\n- `--llm-provider TEXT` - LLM provider (default: openai/gpt-4o-mini)\n- `--playwright` - Use Playwright for dynamic content (basic scraper)\n- `--max-pages INTEGER` - Maximum pages to scrape (default: 1000)\n\nExamples:\n```bash\n# Basic scraping\ndocrag scrape https://docs.example.com --output ./docs\n\n# Smart scraping with AI (recommended)\ndocrag scrape https://docs.example.com --output ./docs --smart\n\n# Smart scraping without LLM (faster, no API key needed)\ndocrag scrape https://docs.example.com --output ./docs --smart --no-llm\n\n# Limit pages\ndocrag scrape https://docs.example.com --output ./docs --max-pages 100\n```\n\n**Smart Scraping Features:**\n- ✨ AI-powered content extraction\n- 🎯 Automatically removes navigation and boilerplate\n- 📊 Better handling of complex layouts\n- 🧠 Semantic understanding of documentation structure\n- ⚡ Faster and more accurate than basic scraping\n\n**To enable smart scraping:**\n```bash\n# Install Crawl4AI\npipx inject docrag crawl4ai\n\n# Optional: Set OpenAI API key for LLM-powered extraction\nexport OPENAI_API_KEY='your-key-here'\n```\n\n## Using with Claude Code\n\n### 1. Configure Claude Code MCP Settings\n\nAdd DocRAG to your Claude Code MCP configuration (`~/.config/claude-code/mcp_settings.json` or similar):\n\n```json\n{\n  \"mcpServers\": {\n    \"docrag\": {\n      \"command\": \"docrag\",\n      \"args\": [\"serve\"],\n      \"env\": {}\n    }\n  }\n}\n```\n\nIf using the full path:\n```json\n{\n  \"mcpServers\": {\n    \"docrag\": {\n      \"command\": \"/home/claude-admin/.local/bin/docrag\",\n      \"args\": [\"serve\"],\n      \"env\": {}\n    }\n  }\n}\n```\n\n### 2. Restart Claude Code\n\nAfter adding the configuration, restart Claude Code to load the MCP server.\n\n### 3. Use in Claude Code\n\nOnce connected, Claude Code can use two tools:\n\n**search_docs**: Search through indexed documentation collections\n```\nQuery: \"how to handle authentication in BrightSign\"\nCollection: (optional) \"brightsign\"\nLimit: (optional) 5\n```\n\n**list_collections**: List all available documentation collections\n\nClaude will automatically use these tools when working on projects that need documentation access.\n\n## Architecture\n\n### Core Components\n\n1. **ConfigManager** (`config.py`) - Manages configuration and collection metadata\n2. **EmbeddingGenerator** (`embeddings.py`) - Generates embeddings using sentence-transformers\n3. **VectorDB** (`vectordb.py`) - LanceDB wrapper for vector storage and search\n4. **DocumentIndexer** (`indexer.py`) - Intelligent document chunking and indexing\n5. **DocRAGServer** (`server.py`) - MCP server implementation\n6. **CLI** (`cli.py`) - Command-line interface\n\n### Technical Stack\n\n- **MCP Framework**: Official Anthropic MCP package\n- **Vector Database**: LanceDB (lightweight, file-based, performant)\n- **Embeddings**: sentence-transformers with all-MiniLM-L6-v2 model (384 dims, fast, local)\n- **Text Processing**: langchain-text-splitters for intelligent chunking\n- **CLI**: Click for user-friendly commands\n- **Web Scraping**: Playwright + BeautifulSoup4 for scraping\n\n## Data Structure\n\n```\n~/.docrag/\n├── config.json                 # Global configuration\n│   └── {\n│         \"active_collections\": [\"brightsign\", \"venafi\"],\n│         \"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\",\n│         \"chunk_size\": 512,\n│         \"chunk_overlap\": 50\n│       }\n├── collections/\n│   ├── brightsign/\n│   │   ├── metadata.json       # Collection metadata\n│   │   └── source_docs/        # Original documents\n│   ├── venafi/\n│   └── qumu/\n└── vectordb/\n    └── lancedb/                # Vector storage (one table per collection)\n```\n\n## Configuration\n\nGlobal configuration is stored in `~/.docrag/config.json`:\n\n```json\n{\n  \"active_collections\": [\"brightsign\", \"venafi\"],\n  \"embedding_model\": \"sentence-transformers/all-MiniLM-L6-v2\",\n  \"chunk_size\": 512,\n  \"chunk_overlap\": 50\n}\n```\n\nCollection metadata is stored in `~/.docrag/collections/\u003cname\u003e/metadata.json`:\n\n```json\n{\n  \"name\": \"brightsign\",\n  \"source_type\": \"local\",\n  \"source_path\": \"/path/to/docs\",\n  \"created_at\": \"2025-10-28T10:00:00\",\n  \"updated_at\": \"2025-10-28T10:00:00\",\n  \"doc_count\": 150,\n  \"description\": \"BrightSign player documentation\"\n}\n```\n\n## Development\n\n### Project Structure\n\n```\ndocrag/\n├── docrag/\n│   ├── __init__.py\n│   ├── cli.py              # CLI commands\n│   ├── server.py           # MCP server\n│   ├── indexer.py          # Document indexing\n│   ├── vectordb.py         # Vector database\n│   ├── embeddings.py       # Embeddings\n│   ├── config.py           # Configuration\n│   └── scrapers/           # Web scrapers\n│       ├── __init__.py\n│       ├── base.py\n│       └── generic.py\n├── tests/\n├── pyproject.toml\n├── README.md\n└── DOCRAG_MVP_BUILD_GUIDE.md\n```\n\n### Running Tests\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n```\n\n### Code Formatting\n\n```bash\n# Format with black\nblack docrag/\n\n# Lint with ruff\nruff check docrag/\n```\n\n## Troubleshooting\n\n### \"DocRAG not initialized\"\nRun `docrag init` first to create the configuration directory.\n\n### \"No collections found\"\nAdd a collection with `docrag add \u003cname\u003e --source \u003cpath\u003e`.\n\n### \"Model download fails\"\nThe first time you run DocRAG, it will download the sentence-transformers model (~100MB). Ensure you have internet connectivity.\n\n### \"Playwright not installed\"\nIf using scrapers, run `playwright install chromium`.\n\n## Future Enhancements\n\n- [ ] Web scraper CLI commands\n- [ ] Support for more file types (PDF, HTML, RST)\n- [ ] Incremental indexing (only index changed files)\n- [ ] Collection activation/deactivation\n- [ ] Collection statistics and health checks\n- [ ] Export/import collections\n- [ ] Cloud sync for collections\n- [ ] Advanced search filters\n\n## License\n\nMIT\n\n## Author\n\nRyan - Built for homelab and Claude Code integration\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryan-m-bishop%2Fdocrag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fryan-m-bishop%2Fdocrag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fryan-m-bishop%2Fdocrag/lists"}