An open API service indexing awesome lists of open source software.

https://github.com/ryan-m-bishop/docrag

AI-powered documentation RAG system with MCP server for Claude Code. Search and retrieve technical documentation on-demand with vector embeddings and smart web scraping.
https://github.com/ryan-m-bishop/docrag

ai anthropic claude-code cli documentation embeddings lancedb llm mcp mcp-server python rag vector-database

Last synced: about 2 months ago
JSON representation

AI-powered documentation RAG system with MCP server for Claude Code. Search and retrieve technical documentation on-demand with vector embeddings and smart web scraping.

Awesome Lists containing this project

README

          

# DocRAG - AI Documentation RAG System

A lightweight, installable Python package that provides RAG (Retrieval Augmented Generation) access to technical documentation through an MCP (Model Context Protocol) server. This enables LLMs to search and retrieve relevant documentation on-demand.

## Features

- 🚀 Single pip-installable package with CLI and MCP server
- 📚 Project-based documentation collections (BrightSign, Venafi, Qumu, web frameworks)
- 🔍 Local vector database with efficient embedding using LanceDB
- 📥 Easy documentation ingestion from local files or scraped sources
- 🤖 Designed for use with Claude Code via MCP

## Installation

### Prerequisites

- Python 3.10+
- pipx (recommended) or pip
- git (for updates)

### Recommended: Install globally with pipx

```bash
# Install globally with pipx in editable mode (keeps dependencies isolated)
pipx install -e /opt/claude-ops/doc-rag

# Verify installation
docrag --help

# Optional: Install Playwright browsers (for scraping)
pipx runpip docrag install playwright
pipx run --spec docrag playwright install chromium
```

**Note:** The `-e` flag installs in "editable" mode, which means changes to the source code are immediately reflected without reinstalling.

### Alternative: Install from source (development)

```bash
# Clone or navigate to the project directory
cd /opt/claude-ops/doc-rag

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Install Playwright browsers (for scraping)
playwright install chromium
```

## Updating DocRAG

### Option 1: Using the Update Script (Recommended)

```bash
cd /opt/claude-ops/doc-rag
./update.sh
```

This script will:
- Pull latest changes from git
- Detect your installation method (pipx or pip)
- Reinstall only if necessary (non-editable installs)
- Handle editable installs automatically

### Option 2: Using Make

```bash
cd /opt/claude-ops/doc-rag
make update
```

### Option 3: Manual Update

For **editable installs** (installed with `-e`):
```bash
cd /opt/claude-ops/doc-rag
git pull origin main
# No reinstall needed - changes are already active!
```

For **regular installs** (installed without `-e`):
```bash
cd /opt/claude-ops/doc-rag
git pull origin main
pipx uninstall docrag && pipx install -e .
# or for pip: pip install -e . --force-reinstall
```

### Verifying Updates

```bash
# Check git status
cd /opt/claude-ops/doc-rag
git log -1 --oneline

# Test the installation
docrag --version
docrag --help
```

## Quick Start

### 1. Initialize DocRAG

```bash
docrag init
```

This creates the configuration directory at `~/.docrag/` with the following structure:

```
~/.docrag/
├── config.json # Global configuration
├── collections/ # Documentation collections
└── vectordb/ # LanceDB storage
```

### 2. Add a Documentation Collection

```bash
# Add documentation from a local directory
docrag add brightsign --source /path/to/brightsign/docs --description "BrightSign player documentation"

# Or add without source initially
docrag add venafi --description "Venafi TPP API documentation"
```

### 3. List Collections

```bash
docrag list
```

### 4. Search Documentation (CLI Testing)

```bash
# Search across all active collections
docrag search "how to initialize the player"

# Search a specific collection
docrag search "authentication methods" --collection venafi --limit 10
```

### 5. Start the MCP Server

```bash
docrag serve
```

The server will listen on stdio for connections from Claude Code.

## CLI Commands

### `docrag init`
Initialize DocRAG configuration directory.

### `docrag add `
Add a new documentation collection.

Options:
- `-s, --source PATH` - Source directory containing documentation
- `-d, --description TEXT` - Description of the collection

Example:
```bash
docrag add qumu --source ~/docs/qumu --description "Qumu video platform docs"
```

### `docrag list`
List all documentation collections with their status.

### `docrag update `
Update an existing collection with new documents.

Example:
```bash
docrag update brightsign ~/docs/brightsign/updated
```

### `docrag remove `
Remove a documentation collection (with confirmation).

### `docrag search `
Search documentation from the CLI for testing.

Options:
- `-c, --collection TEXT` - Specific collection to search
- `-l, --limit INTEGER` - Number of results (default: 5)

Example:
```bash
docrag search "websocket connection" --collection brightsign
```

### `docrag serve`
Start the MCP server for Claude Code integration.

### `docrag scrape `
Scrape documentation from websites.

Options:
- `-o, --output PATH` - Output directory (required)
- `--smart, --use-crawl4ai` - Use AI-powered Crawl4AI scraper (recommended)
- `--no-llm` - Disable LLM extraction (faster, still better than basic)
- `--llm-provider TEXT` - LLM provider (default: openai/gpt-4o-mini)
- `--playwright` - Use Playwright for dynamic content (basic scraper)
- `--max-pages INTEGER` - Maximum pages to scrape (default: 1000)

Examples:
```bash
# Basic scraping
docrag scrape https://docs.example.com --output ./docs

# Smart scraping with AI (recommended)
docrag scrape https://docs.example.com --output ./docs --smart

# Smart scraping without LLM (faster, no API key needed)
docrag scrape https://docs.example.com --output ./docs --smart --no-llm

# Limit pages
docrag scrape https://docs.example.com --output ./docs --max-pages 100
```

**Smart Scraping Features:**
- ✨ AI-powered content extraction
- 🎯 Automatically removes navigation and boilerplate
- 📊 Better handling of complex layouts
- 🧠 Semantic understanding of documentation structure
- ⚡ Faster and more accurate than basic scraping

**To enable smart scraping:**
```bash
# Install Crawl4AI
pipx inject docrag crawl4ai

# Optional: Set OpenAI API key for LLM-powered extraction
export OPENAI_API_KEY='your-key-here'
```

## Using with Claude Code

### 1. Configure Claude Code MCP Settings

Add DocRAG to your Claude Code MCP configuration (`~/.config/claude-code/mcp_settings.json` or similar):

```json
{
"mcpServers": {
"docrag": {
"command": "docrag",
"args": ["serve"],
"env": {}
}
}
}
```

If using the full path:
```json
{
"mcpServers": {
"docrag": {
"command": "/home/claude-admin/.local/bin/docrag",
"args": ["serve"],
"env": {}
}
}
}
```

### 2. Restart Claude Code

After adding the configuration, restart Claude Code to load the MCP server.

### 3. Use in Claude Code

Once connected, Claude Code can use two tools:

**search_docs**: Search through indexed documentation collections
```
Query: "how to handle authentication in BrightSign"
Collection: (optional) "brightsign"
Limit: (optional) 5
```

**list_collections**: List all available documentation collections

Claude will automatically use these tools when working on projects that need documentation access.

## Architecture

### Core Components

1. **ConfigManager** (`config.py`) - Manages configuration and collection metadata
2. **EmbeddingGenerator** (`embeddings.py`) - Generates embeddings using sentence-transformers
3. **VectorDB** (`vectordb.py`) - LanceDB wrapper for vector storage and search
4. **DocumentIndexer** (`indexer.py`) - Intelligent document chunking and indexing
5. **DocRAGServer** (`server.py`) - MCP server implementation
6. **CLI** (`cli.py`) - Command-line interface

### Technical Stack

- **MCP Framework**: Official Anthropic MCP package
- **Vector Database**: LanceDB (lightweight, file-based, performant)
- **Embeddings**: sentence-transformers with all-MiniLM-L6-v2 model (384 dims, fast, local)
- **Text Processing**: langchain-text-splitters for intelligent chunking
- **CLI**: Click for user-friendly commands
- **Web Scraping**: Playwright + BeautifulSoup4 for scraping

## Data Structure

```
~/.docrag/
├── config.json # Global configuration
│ └── {
│ "active_collections": ["brightsign", "venafi"],
│ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
│ "chunk_size": 512,
│ "chunk_overlap": 50
│ }
├── collections/
│ ├── brightsign/
│ │ ├── metadata.json # Collection metadata
│ │ └── source_docs/ # Original documents
│ ├── venafi/
│ └── qumu/
└── vectordb/
└── lancedb/ # Vector storage (one table per collection)
```

## Configuration

Global configuration is stored in `~/.docrag/config.json`:

```json
{
"active_collections": ["brightsign", "venafi"],
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"chunk_size": 512,
"chunk_overlap": 50
}
```

Collection metadata is stored in `~/.docrag/collections//metadata.json`:

```json
{
"name": "brightsign",
"source_type": "local",
"source_path": "/path/to/docs",
"created_at": "2025-10-28T10:00:00",
"updated_at": "2025-10-28T10:00:00",
"doc_count": 150,
"description": "BrightSign player documentation"
}
```

## Development

### Project Structure

```
docrag/
├── docrag/
│ ├── __init__.py
│ ├── cli.py # CLI commands
│ ├── server.py # MCP server
│ ├── indexer.py # Document indexing
│ ├── vectordb.py # Vector database
│ ├── embeddings.py # Embeddings
│ ├── config.py # Configuration
│ └── scrapers/ # Web scrapers
│ ├── __init__.py
│ ├── base.py
│ └── generic.py
├── tests/
├── pyproject.toml
├── README.md
└── DOCRAG_MVP_BUILD_GUIDE.md
```

### Running Tests

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest
```

### Code Formatting

```bash
# Format with black
black docrag/

# Lint with ruff
ruff check docrag/
```

## Troubleshooting

### "DocRAG not initialized"
Run `docrag init` first to create the configuration directory.

### "No collections found"
Add a collection with `docrag add --source `.

### "Model download fails"
The first time you run DocRAG, it will download the sentence-transformers model (~100MB). Ensure you have internet connectivity.

### "Playwright not installed"
If using scrapers, run `playwright install chromium`.

## Future Enhancements

- [ ] Web scraper CLI commands
- [ ] Support for more file types (PDF, HTML, RST)
- [ ] Incremental indexing (only index changed files)
- [ ] Collection activation/deactivation
- [ ] Collection statistics and health checks
- [ ] Export/import collections
- [ ] Cloud sync for collections
- [ ] Advanced search filters

## License

MIT

## Author

Ryan - Built for homelab and Claude Code integration