https://github.com/joungminsung/opendocuments

Self-hosted RAG platform for AI document search across GitHub, Notion, Google Drive, local files, and web sources with citations.
https://github.com/joungminsung/opendocuments
ai ai-search document-qa document-search embeddings enterprise-search github google-drive knowledge-base llm mcp notion ollama open-source rag retrieval-augmented-generation self-hosted semantic-search typescript vector-search
Last synced: about 1 month ago
JSON representation
Self-hosted RAG platform for AI document search across GitHub, Notion, Google Drive, local files, and web sources with citations.
Host: GitHub
URL: https://github.com/joungminsung/opendocuments
Owner: joungminsung
License: mit
Created: 2026-03-27T03:23:14.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-05-21T04:56:45.000Z (2 months ago)
Last Synced: 2026-05-21T07:54:45.027Z (2 months ago)
Topics: ai, ai-search, document-qa, document-search, embeddings, enterprise-search, github, google-drive, knowledge-base, llm, mcp, notion, ollama, open-source, rag, retrieval-augmented-generation, self-hosted, semantic-search, typescript, vector-search
Language: TypeScript
Homepage: https://joungminsung.github.io/OpenDocuments/
Size: 2.7 MB
Stars: 70
Watchers: 1
Forks: 13
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project

README

          


  
OpenDocuments

  Self-hosted RAG platform for AI document search across GitHub, Notion, Google Drive, Confluence, S3, local files, and web sources




  

  

  

  

  

  

  





  English | 한국어





  



---

## What is OpenDocuments?

**OpenDocuments is an open source, self-hosted RAG (Retrieval-Augmented Generation) platform that turns scattered company documents into an AI-searchable knowledge base.** It connects to sources like GitHub, Notion, Google Drive, Confluence, S3, Swagger/OpenAPI, local files, and web pages, indexes them with hybrid vector + keyword search, and answers natural-language questions with cited sources.

Use OpenDocuments when you want:

- A **self-hosted alternative to enterprise AI search** and proprietary knowledge-base search tools

- **AI document search with citations** for engineering docs, product specs, policies, spreadsheets, API docs, and meeting notes

- A **local-first RAG stack** that can run with Ollama so sensitive documents stay on your own infrastructure

- A **knowledge base for AI coding assistants** through MCP, including Claude Code, Cursor, Windsurf, and other MCP clients

- A **TypeScript-first RAG platform** with a CLI, Web UI, HTTP API, SDK, plugin system, and embeddable widget

```bash

npm install -g opendocuments

opendocuments init

opendocuments start

```

Open `http://localhost:3000`, index your documents, and ask questions with source citations.

## Why OpenDocuments?

Your team's knowledge is trapped in silos:

- **Engineering docs** live in GitHub READMEs and Wiki pages

- **Product specs** are scattered across Notion databases

- **Budget reports** sit in Excel files on Google Drive

- **API docs** are auto-generated Swagger specs nobody reads

- **Meeting notes** rot in Confluence spaces

- **Onboarding guides** are buried in `.docx` files on S3

When someone asks _"How does our auth system work?"_ or _"What was the Q3 budget for the AI team?"_, they spend 15 minutes hunting through five different tools. OpenDocuments centralizes that search without forcing all of your content into a hosted vendor.

## How OpenDocuments Answers Questions

OpenDocuments **connects to your document sources**, **parses and chunks each document**, **stores metadata in SQLite and vectors in LanceDB**, then **retrieves, reranks, and generates grounded answers**. Every answer can include source citations, confidence scores, and links back to the underlying documents.

In short: **OpenDocuments is a private AI search engine for your organization's documents.**

## Key Features

| Feature | What it means |

|---------|---------------|

| **Self-hosted RAG** | Run the full document search stack on your own infrastructure |

| **Cited AI answers** | Ask natural-language questions and see exactly which documents support the answer |

| **Hybrid retrieval** | Combine vector search, FTS5 keyword search, reranking, HyDE, multi-query retrieval, and parent-document recall |

| **Broad source coverage** | Index GitHub, Notion, Google Drive, Confluence, S3/GCS, Swagger/OpenAPI, web pages, web search, uploads, and local files |

| **Many file formats** | Parse Markdown, PDF, DOCX, XLSX, CSV, HTML, Jupyter notebooks, email, code, PPTX, JSON, YAML, TOML, and more |

| **Local or cloud models** | Use Ollama locally or cloud providers such as OpenAI, Anthropic, Google, and xAI |

| **MCP server** | Let Claude Code, Cursor, Windsurf, and other MCP clients search your internal knowledge base |

| **Team mode** | Add API keys, roles, rate limits, PII redaction, audit logs, alerts, OAuth SSO, and workspace isolation |

| **Extensible plugins** | Build custom parsers, connectors, model providers, and middleware in TypeScript |

## OpenDocuments vs Alternatives

| If you are comparing... | Choose OpenDocuments when you need... |

|-------------------------|----------------------------------------|

| **OpenDocuments vs hosted enterprise search** | A self-hosted, open source AI search platform with control over infrastructure and data flow |

| **OpenDocuments vs a vector database** | A complete RAG application layer: connectors, parsers, chunking, retrieval, chat, citations, auth, CLI, Web UI, and MCP |

| **OpenDocuments vs a chatbot wrapper** | Source-grounded answers over your real document corpus, not a generic chat UI |

| **OpenDocuments vs building RAG from scratch** | A TypeScript monorepo with batteries included, while still keeping plugin-level extensibility |

| **OpenDocuments vs local-only scripts** | A production-oriented system with team mode, API access, syncable connectors, backups, and admin tooling |

### Recent Improvements

- **RAG accuracy overhaul**: Structure-preserving chunking, contextual prefixes, HyDE + multi-query retrieval, parent-document recall, proposition augmentation, reranking, and adaptive context fitting

- **Workspace-scoped team mode**: Admin/chat/document APIs stay inside the authenticated workspace, with shared conversation links plus session and API-key auth support

- **Backup & restore CLI**: Snapshot SQLite + LanceDB data and recover an instance with one command

- **Plugin hardening**: Plugin search/install routes are admin-only and use validated npm argument execution

- **One-touch Ollama setup**: `init` auto-detects Ollama, offers to pull missing models

- **`.env` auto-loading**: API keys in `.env` are loaded automatically (no manual export needed)

- **Multi-turn conversations**: Chat remembers previous context for follow-up questions

- **Degraded mode warnings**: Clear banners when models aren't configured, with fix instructions

- **Enhanced diagnostics**: `opendocuments doctor` checks Ollama connectivity, model availability, and config validity

- **Security hardening**: FTS5 injection prevention, file upload sanitization, OAuth state limits, workspace isolation

---

## Real-World Use Cases

### For Engineering Teams

> _"How do I authenticate against our internal API?"_

OpenDocuments pulls the answer from your GitHub repo's `docs/auth.md`, links to the relevant Swagger endpoint, and includes a code example from the codebase -- all in one response.

```bash

# Index your repo and API docs

opendocuments index ./docs

opendocuments connector sync github

opendocuments ask "How does JWT token refresh work in our API?"

```

### For Operations & HR Teams

> _"What's the remote work policy for the Tokyo office?"_

OpenDocuments searches across your Confluence HR space, the employee handbook on Google Drive, and the latest policy update email -- even if some documents are in Korean and others in English.

```bash

opendocuments ask "도쿄 오피스 원격 근무 정책이 뭐야?" --profile precise

# Cross-lingual search finds both Korean and English documents

```

### For Product Managers

> _"Compare the feature specs of v2.0 vs v3.0"_

OpenDocuments decomposes the question, searches both versions' specs, and presents a structured comparison table -- citing each source document.

### For AI-Assisted Development (MCP)

Use OpenDocuments as a knowledge base for **Claude Code**, **Cursor**, or any MCP-compatible AI tool:

```json

{

  "mcpServers": {

    "opendocuments": {

      "command": "opendocuments",

      "args": ["start", "--mcp-only"]

    }

  }

}

```

Now your AI coding assistant can search your organization's entire document corpus while writing code.

### For Self-Hosted Knowledge Bases

Deploy on your own infrastructure. Your data **never leaves your network** when using a local LLM via Ollama. No cloud dependency, no vendor lock-in, no subscription fees.

```bash

docker compose --profile with-ollama up -d

# Everything runs locally: LLM, embeddings, vector search, web UI

```

---

## Quick Start

This is the fastest way to run a local AI document search engine with the OpenDocuments CLI.

### 1. Install

```bash

npm install -g opendocuments

```

### 2. Initialize

```bash

opendocuments init

```

The interactive wizard will:

- Detect your hardware (CPU, RAM) and recommend the optimal LLM

- Let you choose between **local** (Ollama) or **cloud** (OpenAI, Claude, Gemini, Grok) models

- **Auto-detect Ollama** and offer to pull missing models automatically

- **Validate cloud API keys** before saving

- Select a plugin preset: `Developer`, `Enterprise`, `All`, or `Custom`

- Generate `opendocuments.config.ts` and `.env` (API keys loaded automatically)

### 3. Start

```bash

opendocuments start

```

Open **http://localhost:3000** -- you'll see a chat UI, document manager, and admin dashboard.

> **First time?** If Ollama isn't running, you'll see a clear **DEGRADED MODE** banner with step-by-step fix instructions. Run `opendocuments doctor` for full diagnostics.

### 4. Index Your Documents

```bash

# Index a local directory (recursively finds all supported files)

opendocuments index ./docs

# Watch mode: auto-reindex when files change

opendocuments index ./docs --watch

# Or drag-and-drop files in the Web UI

```

### 5. Ask Questions

```bash

opendocuments ask "What's our deployment process?"

```

---

## How It Works

OpenDocuments uses a standard RAG architecture with practical production pieces around it: source connectors, format parsers, chunking, embeddings, metadata storage, vector storage, retrieval profiles, answer generation, citations, and security controls.

```

    Your Documents                    OpenDocuments                     You

    ─────────────                    ──────────────                    ───

    GitHub repos ──┐

    Notion pages ──┤                ┌─────────────┐

    Google Drive ──┤  ── Ingest ──► │ Parse        │

    Confluence   ──┤                │ Chunk        │     "How does

    S3 buckets   ──┤                │ Embed        │      auth work?"

    Swagger specs──┤                │ Store        │          │

    Local files  ──┤                └──────┬───────┘          │

    Web pages    ──┘                       │                  ▼

                                    ┌──────┴───────┐  ┌─────────────┐

                                    │  SQLite      │  │ RAG Engine  │

                                    │  (metadata)  │◄─┤ Search      │

                                    │              │  │ Rerank      │

                                    │  LanceDB     │  │ Generate    │

                                    │  (vectors)   │  │ Cite sources│

                                    └──────────────┘  └──────┬──────┘

                                                             │

                                                             ▼

                                                      "Auth uses JWT

                                                       tokens with

                                                       refresh flow.

                                                       [Source: auth.md]"

```

### The RAG Pipeline

1. **Intent Classification** -- Understands whether you're asking about code, concepts, data, or want a comparison

2. **Query Decomposition** -- Breaks complex questions into sub-queries for better retrieval

3. **Cross-Lingual Search** -- Finds documents in both Korean and English regardless of query language

4. **Hybrid Search** -- Combines dense vector search (semantic) with FTS5 sparse search (keyword) via Reciprocal Rank Fusion

5. **Reranking** -- Scores results by keyword overlap and model-based relevance

6. **Confidence Scoring** -- Tells you honestly when it's not sure about an answer

7. **Hallucination Guard** -- Verifies each sentence is grounded in the retrieved sources

8. **3-Tier Caching** -- L1 query cache (5min), L2 embedding cache (24h), L3 web search cache (1h)

---

## Supported File Formats

| Format | Extensions | How It's Parsed |

|--------|-----------|-----------------|

| Markdown | `.md`, `.mdx` | Heading hierarchy, code block separation |

| Plain Text | `.txt` | Direct text indexing |

| PDF | `.pdf` | Page-level extraction, OCR fallback for scanned docs |

| Word | `.docx` | HTML conversion with heading detection |

| Excel / CSV | `.xlsx`, `.xls`, `.csv` | Sheet-aware table chunking (header + rows) |

| HTML | `.html`, `.htm` | Structure-preserving extraction, script/nav stripping |

| Jupyter Notebook | `.ipynb` | Markdown cells + code cells with language detection |

| Email | `.eml` | Header parsing (from/to/subject/date) + body extraction |

| Source Code | `.js`, `.ts`, `.py`, `.java`, `.go`, `.rs`, `.rb`, `.php`, `.swift`, `.kt` + more | Function/class-level chunking with import extraction |

| PowerPoint | `.pptx` | Slide-level text extraction |

| Structured Data | `.json`, `.yaml`, `.yml`, `.toml` | Config and schema indexing |

| Archive | `.zip` | Placeholder (full extraction planned) |

**Fallback Chains**: If a parser fails, the next one tries automatically:

```typescript

parserFallbacks: {

  '.pdf': ['@opendocuments/parser-pdf', '@opendocuments/parser-ocr'],

}

```

---

## Data Sources

| Source | What It Indexes | Auth | How It Syncs |

|--------|----------------|------|-------------|

| **Local Files** | Any supported format on your filesystem | None | File watching (`--watch`) |

| **File Upload** | Drag-and-drop in Web UI | None | Instant |

| **GitHub** | README, Wiki, code files, Issues | Personal Access Token | Polling / webhook |

| **Notion** | Pages, databases, all block types | Integration Token | Polling |

| **Google Drive** | Docs, Sheets, Slides, uploaded files | OAuth / Service Account | Polling |

| **Amazon S3 / Google Cloud Storage** | Any supported format in buckets | AWS / GCP credentials | Polling |

| **Confluence** | Wiki pages across spaces | API Token + Email | Polling |

| **Swagger / OpenAPI** | API endpoints with parameters and schemas | None (public specs) | Manual |

| **Web Crawler** | Any URL you register | Optional (cookies/headers) | Periodic |

| **Web Search (Tavily)** | Real-time web results merged into answers | Tavily API Key | Query-time |

---

## Model Providers

### Cloud Providers

| Provider | Models | Embedding | Best For |

|----------|--------|-----------|----------|

| **OpenAI** | GPT-5.4, GPT-5.4-mini, GPT-4.1, o3, o4-mini | text-embedding-3-small/large | General purpose, vision, reasoning |

| **Anthropic** | Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5 | -- (use separate provider) | Long context (1M), coding, analysis |

| **Google** | Gemini 3.1 Pro, Gemini 3.1 Flash Lite, Gemini 3.0 Deep Think | text-embedding-005 | Multimodal, multilingual |

| **xAI** | Grok 4, Grok 4 Heavy, Grok 4.1 Fast | Grok embedding | Real-time knowledge, code |

| **DeepSeek** | DeepSeek-V3.2, DeepSeek-R1, DeepSeek-V4 (upcoming) | -- (use separate provider) | Cost-efficient reasoning, 164K context |

| **Mistral** | Mistral Small 4 (MoE), Large 2.1, Codestral, Pixtral | mistral-embed (1024) | European data residency, coding, vision |

| **OpenAI-compatible** | Any OpenAI-compatible endpoint | Depends on endpoint | vLLM, LM Studio, Together, Fireworks, Groq, DeepInfra, SiliconFlow, OpenRouter |

### Local Models (via Ollama)

| Model | Active Params | Total Params | Vision | Korean | Best For |

|-------|-------------|-------------|--------|--------|----------|

| **Qwen 3.5 27B** | 27B (dense) | 27B | Yes | Excellent | General purpose (32GB+ RAM) |

| **Qwen 3.5 9B** | 9B (dense) | 9B | Yes | Excellent | Mid-range (16GB RAM) |

| **Qwen 3.5-122B-A10B** | 10B (MoE) | 122B | Yes | Excellent | High quality, efficient |

| **Llama 4 Scout** | 17B (MoE) | 109B | Yes | Good | 10M context window |

| **Llama 4 Maverick** | 17B (MoE) | 400B | Yes | Good | Top open-source quality |

| **DeepSeek V3.2** | 37B (MoE) | 671B | No | Good | Coding, reasoning |

| **Gemma 4** | 27B / 12B / 4B / 1B | dense | Yes | Good | Latest Google open model, 128K context, 140+ languages |

| **Gemma 3 27B** | 27B | 27B | Yes | Good | Lightweight, 140+ languages |

| **Gemma 3 4B** | 4B | 4B | Yes | Good | Low-spec machines (8GB RAM) |

| **K-EXAONE** | 23B (MoE) | 236B | No | Best | Korean-specialized |

| **EXAONE Deep 32B** | 32B | 32B | No | Best | Korean reasoning |

| **Phi-4 Reasoning Vision** | 15B | 15B | Yes | Fair | Compact multimodal |

### Embedding Models

| Model | Dimensions | Korean | Multimodal | Where |

|-------|-----------|--------|-----------|-------|

| **BGE-M3** | 1024 | Excellent | No | Ollama (default) |

| **text-embedding-3-large** | 3072 | Good | No | OpenAI |

| **text-embedding-005** | 768 | Good | No | Google |

| **nomic-embed-text** | 768 | Fair | No | Ollama (lightweight) |

### Auto-Recommendation

`opendocuments init` detects your hardware and recommends the best model:

| Your Hardware | Recommended Model | Recommended Embedding |

|--------------|-------------------|----------------------|

| 32GB+ RAM, GPU | Qwen 3.5 27B or Llama 4 Scout | BGE-M3 |

| 16GB RAM | Qwen 3.5 9B | BGE-M3 |

| 8GB RAM | Gemma 3 4B | nomic-embed-text |

| Any (cloud) | Claude Sonnet 4.6 or GPT-5.4-mini | text-embedding-3-large |

---

## Three Ways to Use

### 1. Web UI

Full-featured dashboard at `http://localhost:3000`:

| Page | What You Can Do |

|------|-----------------|

| **Chat** | Ask questions with streaming answers, source citations, confidence scores, feedback buttons. Switch between fast/balanced/precise profiles. |

| **Documents** | Browse indexed documents, drag-and-drop upload, view document details, soft-delete with trash/restore. |

| **Connectors** | See connector sync status and last sync times. |

| **Plugins** | View installed plugins with health indicators. |

| **Settings** | Toggle dark/light theme, change RAG profile, view server version. |

| **Admin** | Stats dashboard, search quality metrics, paginated query logs, plugin health, connector status, audit logs. |

**Keyboard shortcuts**: `Cmd+K` opens the Command Palette. `Cmd+1-5` navigates between pages.

### 2. CLI

17 commands for power users and automation:

```bash

# Ask questions

opendocuments ask "What's the deploy process?"

opendocuments ask                              # Interactive REPL mode

opendocuments search "auth middleware" --top 10 # Vector search, no LLM

# Manage documents

opendocuments index ./docs --watch    # Index + auto-reindex on changes

opendocuments document list           # See all indexed docs

opendocuments document delete     # Soft-delete

# Manage connectors

opendocuments connector sync          # Sync all connectors

opendocuments connector status        # Check sync status

# Pipe support for scripting

cat README.md | opendocuments ask "Summarize this" --stdin

opendocuments ask "List endpoints" --json | jq '.sources[].sourcePath'

# Administration

opendocuments doctor                  # Health check (per-provider API ping)

opendocuments auth create-key --name "ci-bot" --role member

opendocuments export --output ./backup

# Model management

opendocuments model list --suggestions          # Show installed + curated models

opendocuments model install-ollama              # One-shot Ollama install (macOS/Linux)

opendocuments model pull gemma3:27b bge-m3      # Batch pull with disk-space check

opendocuments model set-key deepseek            # Prompt + save API key to .env

opendocuments model test                        # Round-trip test against configured LLM

opendocuments model switch                      # Change provider without editing config

```

### 3. MCP Server

19 tools for AI-assisted workflows. Works with Claude Code, Cursor, Windsurf, and any MCP client.

```bash

opendocuments start --mcp-only

```

Your AI assistant can then:

- Search your organization's documents while coding

- Index new files as they're created

- Check document status and connector health

- Query configuration

---

## RAG Profiles

| | `fast` | `balanced` | `precise` |

|--|--------|------------|-----------|

| **Speed** | ~1s | ~3s | ~5s+ |

| **Search depth** | 10 docs | 20 docs | 50 docs |

| **Semantic chunking** | On | On | On |

| **Reranking** | Off | On | On |

| **Cross-encoder** | Off | Off | On |

| **Cross-lingual** | Off | Korean + English | Korean + English |

| **Contextual prefix** | Off | On | On |

| **Multi-query expansion** | Off | 3x paraphrases | 5x paraphrases |

| **HyDE** | Off | Off | On |

| **Parent-document retrieval** | Off | On | On |

| **Chunk augmentation** (propositions/HQs) | Off | Off | On |

| **Query decomposition** | Off | Off | Splits complex queries |

| **Web search** | Off | Fallback when local results are weak | Always merged |

| **Hallucination guard** | Off | Checks source grounding | Strict mode (annotates unverified) |

| **Best for** | Quick lookups, 8B local models | Daily use, 14B+ models | Critical questions, cloud LLMs |

Switch anytime: CLI flag (`--profile precise`), Web UI toggle, or config file.

### Retrieval quality

OpenDocuments ships a redesigned RAG pipeline with structure-preserving chunking, contextual retrieval, HyDE + multi-query + parent-document retrieval, proposition augmentation, and a cross-encoder reranker — all profile-gated via the table above. See [`packages/core/CHANGELOG.md`](packages/core/CHANGELOG.md) for the full list of additions.

Benchmark against your own dataset with the evaluation harness:

```bash

cd packages/core && npx tsx tests/_fixtures/run-eval.ts

```

Metrics reported: hit@3, hit@5, MRR, nDCG@5 — per-intent and aggregate.

---

## Security

### Personal Mode (default)

Zero configuration. No auth. Localhost only. Just works.

### Team Mode

```typescript

// opendocuments.config.ts

export default defineConfig({ mode: 'team' })

```

| Feature | How It Works |

|---------|-------------|

| **API Keys** | `od_live_` prefix, SHA-256 hashed, never stored in plaintext. Scoped to specific operations, with optional expiration. |

| **Roles** | `admin` (everything), `member` (read + write), `viewer` (read only) |

| **Rate Limiting** | 60 req/min default, per-key override. In-memory with lazy cleanup. |

| **PII Redaction** | Automatically masks emails, phone numbers, credit cards, IPs before sending to cloud LLMs. Configurable patterns and methods (replace/hash/remove). |

| **Audit Log** | Records auth events, document access, config changes. Queryable via admin API. |

| **Security Alerts** | Detects brute-force attempts, unusual data exports, API key abuse. |

| **OAuth SSO** | Google and GitHub login with HttpOnly cookie sessions. |

| **Workspace Isolation** | Every vector search is enforced with `workspace_id` filter. Documents, conversations, and API keys are scoped to workspaces. |

---

## Configuration

```typescript

// opendocuments.config.ts

import { defineConfig } from 'opendocuments-core'

export default defineConfig({

  workspace: 'my-team',

  mode: 'personal',

  model: {

    provider: 'ollama',

    llm: 'qwen3.5:27b',

    embedding: 'bge-m3',

  },

  rag: { profile: 'balanced' },

  connectors: [

    { type: 'github', repo: 'org/repo', token: process.env.GITHUB_TOKEN },

    { type: 'notion', token: process.env.NOTION_TOKEN },

    { type: 'web-crawler', urls: ['https://docs.example.com'] },

  ],

  plugins: ['@opendocuments/parser-pdf', '@opendocuments/parser-docx'],

  security: {

    dataPolicy: {

      autoRedact: { enabled: true, patterns: ['email', 'phone', 'credit-card'] },

    },

    audit: { enabled: true },

  },

  storage: { db: 'sqlite', vectorDb: 'lancedb', dataDir: '~/.opendocuments' },

})

```

---

## Docker Deployment

```bash

# Basic (cloud LLM)

docker compose up -d

# With local LLM (Ollama)

docker compose --profile with-ollama up -d

# With .env file for API keys

docker compose --env-file .env up -d

```

The Docker image includes all packages and plugins. Data persists in a named volume. Mount your config:

```bash

docker run -v ./opendocuments.config.ts:/app/opendocuments.config.ts \

  -v opendocuments-data:/data -p 3000:3000 opendocuments

```

---

## Plugin Development

Create custom parsers, connectors, or model providers:

```bash

opendocuments plugin create my-parser --type parser

cd my-parser

npm install

npm run test

npm run dev       # Watch mode

opendocuments plugin publish  # Publish to npm

```

Four plugin types: `parser`, `connector`, `model`, `middleware`. Each has a typed interface with lifecycle hooks (`setup`, `teardown`, `healthCheck`, `metrics`).

Community plugins follow the naming convention: `opendocuments-plugin-*`

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full plugin development guide.

---

## TypeScript SDK

```typescript

import { OpenDocumentsClient } from '@opendocuments/client'

const client = new OpenDocumentsClient({

  baseUrl: 'http://localhost:3000',

  apiKey: 'od_live_...',

})

const result = await client.ask('How does auth work?')

console.log(result.answer)    // "Auth uses JWT tokens with..."

console.log(result.sources)   // [{ sourcePath: 'docs/auth.md', score: 0.92 }]

console.log(result.confidence) // { level: 'high', score: 0.87 }

```

---

## Embeddable Widget

Add a chat widget to your internal tools:

```html

  OpenDocuments.widget({

    server: 'http://localhost:3000',

    apiKey: 'od_live_...',

    workspace: 'public-docs',

  })

```

---

## Development

```bash

git clone https://github.com/joungminsung/OpenDocuments.git

cd OpenDocuments

npm run setup    # Install + build (one command)

npm run test     # 51 test suites, ~300 tests

npm run dev      # Watch mode

```

### Architecture

| Package | Role | Tests |

|---------|------|-------|

| `@opendocuments/core` | Plugin system, RAG engine, ingest pipeline, storage, auth, security | 159 |

| `@opendocuments/server` | HTTP API (Hono), MCP server, auth middleware, widget | 27 |

| `@opendocuments/cli` | 17 CLI commands (Commander.js) | 3 |

| `@opendocuments/web` | React SPA with 7 pages (Vite + Tailwind) | -- |

| `@opendocuments/client` | TypeScript SDK | 3 |

| 8 model plugins | Ollama, OpenAI, Anthropic, Google, Grok, DeepSeek, Mistral, OpenAI-compatible | 41 |

| 9 parser plugins | PDF, DOCX, XLSX, HTML, Jupyter, Email, Code, PPTX, Structured | 37 |

| 8 connector plugins | GitHub, Notion, GDrive, S3, Confluence, Swagger, WebCrawler, WebSearch | 38 |

See [CONTRIBUTING.md](CONTRIBUTING.md) for conventions, test patterns, and plugin development guide.

---

## Documentation

| Guide | Description |

|-------|-------------|

| [Quick Start](#quick-start) | Install and run in 5 minutes |

| [Architecture](docs/architecture.md) | Package structure, data flow, design decisions |

| [Plugin API: Parsers](docs-site/plugins/parser-api.md) | Create custom document parsers |

| [Plugin API: Connectors](docs-site/plugins/connector-api.md) | Connect external data sources |

| [Plugin API: Models](docs-site/plugins/model-api.md) | Add custom AI providers |

| [TypeScript SDK](docs-site/sdk/guide.md) | Programmatic API client |

| [Security Policy](SECURITY.md) | Vulnerability reporting |

| [Contributing](CONTRIBUTING.md) | Development setup, conventions, plugin guide |

---

## Frequently Asked Questions

### What is OpenDocuments used for?

OpenDocuments is used to build a private AI search engine over company documents. Teams use it to ask questions across GitHub repositories, Notion pages, Google Drive files, Confluence spaces, S3 buckets, API specs, local files, and web pages, then receive answers with citations.

### Is OpenDocuments open source?

Yes. OpenDocuments is open source and released under the [MIT License](LICENSE).

### Is OpenDocuments self-hosted?

Yes. OpenDocuments is designed for self-hosted deployment. You can run it locally during development, deploy it with Docker, or host it on your own infrastructure.

### Can OpenDocuments run without sending data to a cloud LLM?

Yes. When configured with Ollama and local embedding models, OpenDocuments can run the LLM, embeddings, vector search, metadata database, Web UI, CLI, and MCP server on your own infrastructure.

### What data sources does OpenDocuments support?

OpenDocuments supports local files, file uploads, GitHub, Notion, Google Drive, Amazon S3, Google Cloud Storage, Confluence, Swagger/OpenAPI specs, registered web pages, and Tavily-backed web search.

### What file formats can OpenDocuments index?

OpenDocuments can index Markdown, plain text, PDF, DOCX, XLSX, CSV, HTML, Jupyter notebooks, email, source code, PPTX, JSON, YAML, TOML, and other supported plugin formats.

### Does OpenDocuments work with Claude Code or Cursor?

Yes. OpenDocuments includes an MCP server, so MCP-compatible AI tools such as Claude Code, Cursor, Windsurf, and similar clients can search your indexed document corpus while assisting with development.

### What makes OpenDocuments different from a vector database?

A vector database stores embeddings. OpenDocuments provides the surrounding RAG platform: connectors, parsers, document chunking, hybrid retrieval, reranking, answer generation, citations, Web UI, CLI, HTTP API, SDK, MCP server, authentication, and plugins.

### What makes OpenDocuments different from hosted enterprise search?

OpenDocuments is open source and self-hosted. It is built for teams that want AI document search, source citations, plugin extensibility, and control over where their documents, embeddings, metadata, and model calls run.

---

## License

[MIT](LICENSE)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/joungminsung/opendocuments

Awesome Lists containing this project

README

OpenDocuments