{"id":50734111,"url":"https://github.com/joungminsung/opendocuments","last_synced_at":"2026-06-10T12:00:55.739Z","repository":{"id":347227303,"uuid":"1193264621","full_name":"joungminsung/OpenDocuments","owner":"joungminsung","description":"Self-hosted RAG platform for AI document search across GitHub, Notion, Google Drive, local files, and web sources with citations.","archived":false,"fork":false,"pushed_at":"2026-05-21T04:56:45.000Z","size":2833,"stargazers_count":70,"open_issues_count":8,"forks_count":13,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-21T07:54:45.027Z","etag":null,"topics":["ai","ai-search","document-qa","document-search","embeddings","enterprise-search","github","google-drive","knowledge-base","llm","mcp","notion","ollama","open-source","rag","retrieval-augmented-generation","self-hosted","semantic-search","typescript","vector-search"],"latest_commit_sha":null,"homepage":"https://joungminsung.github.io/OpenDocuments/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joungminsung.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"joungminsung"}},"created_at":"2026-03-27T03:23:14.000Z","updated_at":"2026-05-21T04:56:50.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/joungminsung/OpenDocuments","commit_stats":null,"previous_names":["joungminsung/opendocuments"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/joungminsung/OpenDocuments","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joungminsung%2FOpenDocuments","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joungminsung%2FOpenDocuments/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joungminsung%2FOpenDocuments/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joungminsung%2FOpenDocuments/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joungminsung","download_url":"https://codeload.github.com/joungminsung/OpenDocuments/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joungminsung%2FOpenDocuments/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34151276,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-search","document-qa","document-search","embeddings","enterprise-search","github","google-drive","knowledge-base","llm","mcp","notion","ollama","open-source","rag","retrieval-augmented-generation","self-hosted","semantic-search","typescript","vector-search"],"created_at":"2026-06-10T12:00:54.534Z","updated_at":"2026-06-10T12:00:55.711Z","avatar_url":"https://github.com/joungminsung.png","language":"TypeScript","funding_links":["https://github.com/sponsors/joungminsung"],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ch1 align=\"center\"\u003eOpenDocuments\u003c/h1\u003e\n  \u003cp align=\"center\"\u003e\u003cstrong\u003eSelf-hosted RAG platform for AI document search across GitHub, Notion, Google Drive, Confluence, S3, local files, and web sources\u003c/strong\u003e\u003c/p\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/joungminsung/OpenDocuments/actions\"\u003e\u003cimg src=\"https://github.com/joungminsung/OpenDocuments/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-blue.svg\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://nodejs.org\"\u003e\u003cimg src=\"https://img.shields.io/badge/Node.js-20%2B-green.svg\" alt=\"Node.js\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.typescriptlang.org\"\u003e\u003cimg src=\"https://img.shields.io/badge/TypeScript-5.5%2B-blue.svg\" alt=\"TypeScript\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.npmjs.com/package/opendocuments\"\u003e\u003cimg src=\"https://img.shields.io/npm/v/opendocuments.svg\" alt=\"npm\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.npmjs.com/package/opendocuments\"\u003e\u003cimg src=\"https://img.shields.io/npm/dm/opendocuments.svg\" alt=\"npm downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/joungminsung/OpenDocuments/stargazers\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/joungminsung/OpenDocuments.svg?style=social\" alt=\"GitHub stars\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  English | \u003ca href=\"README.ko.md\"\u003e한국어\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/demo.gif\" alt=\"OpenDocuments Demo\" width=\"800\"\u003e\n\u003c/p\u003e\n\n---\n\n## What is OpenDocuments?\n\n**OpenDocuments is an open source, self-hosted RAG (Retrieval-Augmented Generation) platform that turns scattered company documents into an AI-searchable knowledge base.** It connects to sources like GitHub, Notion, Google Drive, Confluence, S3, Swagger/OpenAPI, local files, and web pages, indexes them with hybrid vector + keyword search, and answers natural-language questions with cited sources.\n\nUse OpenDocuments when you want:\n\n- A **self-hosted alternative to enterprise AI search** and proprietary knowledge-base search tools\n- **AI document search with citations** for engineering docs, product specs, policies, spreadsheets, API docs, and meeting notes\n- A **local-first RAG stack** that can run with Ollama so sensitive documents stay on your own infrastructure\n- A **knowledge base for AI coding assistants** through MCP, including Claude Code, Cursor, Windsurf, and other MCP clients\n- A **TypeScript-first RAG platform** with a CLI, Web UI, HTTP API, SDK, plugin system, and embeddable widget\n\n```bash\nnpm install -g opendocuments\nopendocuments init\nopendocuments start\n```\n\nOpen `http://localhost:3000`, index your documents, and ask questions with source citations.\n\n## Why OpenDocuments?\n\nYour team's knowledge is trapped in silos:\n\n- **Engineering docs** live in GitHub READMEs and Wiki pages\n- **Product specs** are scattered across Notion databases\n- **Budget reports** sit in Excel files on Google Drive\n- **API docs** are auto-generated Swagger specs nobody reads\n- **Meeting notes** rot in Confluence spaces\n- **Onboarding guides** are buried in `.docx` files on S3\n\nWhen someone asks _\"How does our auth system work?\"_ or _\"What was the Q3 budget for the AI team?\"_, they spend 15 minutes hunting through five different tools. OpenDocuments centralizes that search without forcing all of your content into a hosted vendor.\n\n## How OpenDocuments Answers Questions\n\nOpenDocuments **connects to your document sources**, **parses and chunks each document**, **stores metadata in SQLite and vectors in LanceDB**, then **retrieves, reranks, and generates grounded answers**. Every answer can include source citations, confidence scores, and links back to the underlying documents.\n\nIn short: **OpenDocuments is a private AI search engine for your organization's documents.**\n\n## Key Features\n\n| Feature | What it means |\n|---------|---------------|\n| **Self-hosted RAG** | Run the full document search stack on your own infrastructure |\n| **Cited AI answers** | Ask natural-language questions and see exactly which documents support the answer |\n| **Hybrid retrieval** | Combine vector search, FTS5 keyword search, reranking, HyDE, multi-query retrieval, and parent-document recall |\n| **Broad source coverage** | Index GitHub, Notion, Google Drive, Confluence, S3/GCS, Swagger/OpenAPI, web pages, web search, uploads, and local files |\n| **Many file formats** | Parse Markdown, PDF, DOCX, XLSX, CSV, HTML, Jupyter notebooks, email, code, PPTX, JSON, YAML, TOML, and more |\n| **Local or cloud models** | Use Ollama locally or cloud providers such as OpenAI, Anthropic, Google, and xAI |\n| **MCP server** | Let Claude Code, Cursor, Windsurf, and other MCP clients search your internal knowledge base |\n| **Team mode** | Add API keys, roles, rate limits, PII redaction, audit logs, alerts, OAuth SSO, and workspace isolation |\n| **Extensible plugins** | Build custom parsers, connectors, model providers, and middleware in TypeScript |\n\n## OpenDocuments vs Alternatives\n\n| If you are comparing... | Choose OpenDocuments when you need... |\n|-------------------------|----------------------------------------|\n| **OpenDocuments vs hosted enterprise search** | A self-hosted, open source AI search platform with control over infrastructure and data flow |\n| **OpenDocuments vs a vector database** | A complete RAG application layer: connectors, parsers, chunking, retrieval, chat, citations, auth, CLI, Web UI, and MCP |\n| **OpenDocuments vs a chatbot wrapper** | Source-grounded answers over your real document corpus, not a generic chat UI |\n| **OpenDocuments vs building RAG from scratch** | A TypeScript monorepo with batteries included, while still keeping plugin-level extensibility |\n| **OpenDocuments vs local-only scripts** | A production-oriented system with team mode, API access, syncable connectors, backups, and admin tooling |\n\n### Recent Improvements\n- **RAG accuracy overhaul**: Structure-preserving chunking, contextual prefixes, HyDE + multi-query retrieval, parent-document recall, proposition augmentation, reranking, and adaptive context fitting\n- **Workspace-scoped team mode**: Admin/chat/document APIs stay inside the authenticated workspace, with shared conversation links plus session and API-key auth support\n- **Backup \u0026 restore CLI**: Snapshot SQLite + LanceDB data and recover an instance with one command\n- **Plugin hardening**: Plugin search/install routes are admin-only and use validated npm argument execution\n- **One-touch Ollama setup**: `init` auto-detects Ollama, offers to pull missing models\n- **`.env` auto-loading**: API keys in `.env` are loaded automatically (no manual export needed)\n- **Multi-turn conversations**: Chat remembers previous context for follow-up questions\n- **Degraded mode warnings**: Clear banners when models aren't configured, with fix instructions\n- **Enhanced diagnostics**: `opendocuments doctor` checks Ollama connectivity, model availability, and config validity\n- **Security hardening**: FTS5 injection prevention, file upload sanitization, OAuth state limits, workspace isolation\n\n---\n\n## Real-World Use Cases\n\n### For Engineering Teams\n\n\u003e _\"How do I authenticate against our internal API?\"_\n\nOpenDocuments pulls the answer from your GitHub repo's `docs/auth.md`, links to the relevant Swagger endpoint, and includes a code example from the codebase -- all in one response.\n\n```bash\n# Index your repo and API docs\nopendocuments index ./docs\nopendocuments connector sync github\nopendocuments ask \"How does JWT token refresh work in our API?\"\n```\n\n### For Operations \u0026 HR Teams\n\n\u003e _\"What's the remote work policy for the Tokyo office?\"_\n\nOpenDocuments searches across your Confluence HR space, the employee handbook on Google Drive, and the latest policy update email -- even if some documents are in Korean and others in English.\n\n```bash\nopendocuments ask \"도쿄 오피스 원격 근무 정책이 뭐야?\" --profile precise\n# Cross-lingual search finds both Korean and English documents\n```\n\n### For Product Managers\n\n\u003e _\"Compare the feature specs of v2.0 vs v3.0\"_\n\nOpenDocuments decomposes the question, searches both versions' specs, and presents a structured comparison table -- citing each source document.\n\n### For AI-Assisted Development (MCP)\n\nUse OpenDocuments as a knowledge base for **Claude Code**, **Cursor**, or any MCP-compatible AI tool:\n\n```json\n{\n  \"mcpServers\": {\n    \"opendocuments\": {\n      \"command\": \"opendocuments\",\n      \"args\": [\"start\", \"--mcp-only\"]\n    }\n  }\n}\n```\n\nNow your AI coding assistant can search your organization's entire document corpus while writing code.\n\n### For Self-Hosted Knowledge Bases\n\nDeploy on your own infrastructure. Your data **never leaves your network** when using a local LLM via Ollama. No cloud dependency, no vendor lock-in, no subscription fees.\n\n```bash\ndocker compose --profile with-ollama up -d\n# Everything runs locally: LLM, embeddings, vector search, web UI\n```\n\n---\n\n## Quick Start\n\nThis is the fastest way to run a local AI document search engine with the OpenDocuments CLI.\n\n### 1. Install\n\n```bash\nnpm install -g opendocuments\n```\n\n### 2. Initialize\n\n```bash\nopendocuments init\n```\n\nThe interactive wizard will:\n- Detect your hardware (CPU, RAM) and recommend the optimal LLM\n- Let you choose between **local** (Ollama) or **cloud** (OpenAI, Claude, Gemini, Grok) models\n- **Auto-detect Ollama** and offer to pull missing models automatically\n- **Validate cloud API keys** before saving\n- Select a plugin preset: `Developer`, `Enterprise`, `All`, or `Custom`\n- Generate `opendocuments.config.ts` and `.env` (API keys loaded automatically)\n\n### 3. Start\n\n```bash\nopendocuments start\n```\n\nOpen **http://localhost:3000** -- you'll see a chat UI, document manager, and admin dashboard.\n\n\u003e **First time?** If Ollama isn't running, you'll see a clear **DEGRADED MODE** banner with step-by-step fix instructions. Run `opendocuments doctor` for full diagnostics.\n\n### 4. Index Your Documents\n\n```bash\n# Index a local directory (recursively finds all supported files)\nopendocuments index ./docs\n\n# Watch mode: auto-reindex when files change\nopendocuments index ./docs --watch\n\n# Or drag-and-drop files in the Web UI\n```\n\n### 5. Ask Questions\n\n```bash\nopendocuments ask \"What's our deployment process?\"\n```\n\n---\n\n## How It Works\n\nOpenDocuments uses a standard RAG architecture with practical production pieces around it: source connectors, format parsers, chunking, embeddings, metadata storage, vector storage, retrieval profiles, answer generation, citations, and security controls.\n\n```\n    Your Documents                    OpenDocuments                     You\n    ─────────────                    ──────────────                    ───\n\n    GitHub repos ──┐\n    Notion pages ──┤                ┌─────────────┐\n    Google Drive ──┤  ── Ingest ──► │ Parse        │\n    Confluence   ──┤                │ Chunk        │     \"How does\n    S3 buckets   ──┤                │ Embed        │      auth work?\"\n    Swagger specs──┤                │ Store        │          │\n    Local files  ──┤                └──────┬───────┘          │\n    Web pages    ──┘                       │                  ▼\n                                    ┌──────┴───────┐  ┌─────────────┐\n                                    │  SQLite      │  │ RAG Engine  │\n                                    │  (metadata)  │◄─┤ Search      │\n                                    │              │  │ Rerank      │\n                                    │  LanceDB     │  │ Generate    │\n                                    │  (vectors)   │  │ Cite sources│\n                                    └──────────────┘  └──────┬──────┘\n                                                             │\n                                                             ▼\n                                                      \"Auth uses JWT\n                                                       tokens with\n                                                       refresh flow.\n                                                       [Source: auth.md]\"\n```\n\n### The RAG Pipeline\n\n1. **Intent Classification** -- Understands whether you're asking about code, concepts, data, or want a comparison\n2. **Query Decomposition** -- Breaks complex questions into sub-queries for better retrieval\n3. **Cross-Lingual Search** -- Finds documents in both Korean and English regardless of query language\n4. **Hybrid Search** -- Combines dense vector search (semantic) with FTS5 sparse search (keyword) via Reciprocal Rank Fusion\n5. **Reranking** -- Scores results by keyword overlap and model-based relevance\n6. **Confidence Scoring** -- Tells you honestly when it's not sure about an answer\n7. **Hallucination Guard** -- Verifies each sentence is grounded in the retrieved sources\n8. **3-Tier Caching** -- L1 query cache (5min), L2 embedding cache (24h), L3 web search cache (1h)\n\n---\n\n## Supported File Formats\n\n| Format | Extensions | How It's Parsed |\n|--------|-----------|-----------------|\n| Markdown | `.md`, `.mdx` | Heading hierarchy, code block separation |\n| Plain Text | `.txt` | Direct text indexing |\n| PDF | `.pdf` | Page-level extraction, OCR fallback for scanned docs |\n| Word | `.docx` | HTML conversion with heading detection |\n| Excel / CSV | `.xlsx`, `.xls`, `.csv` | Sheet-aware table chunking (header + rows) |\n| HTML | `.html`, `.htm` | Structure-preserving extraction, script/nav stripping |\n| Jupyter Notebook | `.ipynb` | Markdown cells + code cells with language detection |\n| Email | `.eml` | Header parsing (from/to/subject/date) + body extraction |\n| Source Code | `.js`, `.ts`, `.py`, `.java`, `.go`, `.rs`, `.rb`, `.php`, `.swift`, `.kt` + more | Function/class-level chunking with import extraction |\n| PowerPoint | `.pptx` | Slide-level text extraction |\n| Structured Data | `.json`, `.yaml`, `.yml`, `.toml` | Config and schema indexing |\n| Archive | `.zip` | Placeholder (full extraction planned) |\n\n**Fallback Chains**: If a parser fails, the next one tries automatically:\n\n```typescript\nparserFallbacks: {\n  '.pdf': ['@opendocuments/parser-pdf', '@opendocuments/parser-ocr'],\n}\n```\n\n---\n\n## Data Sources\n\n| Source | What It Indexes | Auth | How It Syncs |\n|--------|----------------|------|-------------|\n| **Local Files** | Any supported format on your filesystem | None | File watching (`--watch`) |\n| **File Upload** | Drag-and-drop in Web UI | None | Instant |\n| **GitHub** | README, Wiki, code files, Issues | Personal Access Token | Polling / webhook |\n| **Notion** | Pages, databases, all block types | Integration Token | Polling |\n| **Google Drive** | Docs, Sheets, Slides, uploaded files | OAuth / Service Account | Polling |\n| **Amazon S3 / Google Cloud Storage** | Any supported format in buckets | AWS / GCP credentials | Polling |\n| **Confluence** | Wiki pages across spaces | API Token + Email | Polling |\n| **Swagger / OpenAPI** | API endpoints with parameters and schemas | None (public specs) | Manual |\n| **Web Crawler** | Any URL you register | Optional (cookies/headers) | Periodic |\n| **Web Search (Tavily)** | Real-time web results merged into answers | Tavily API Key | Query-time |\n\n---\n\n## Model Providers\n\n### Cloud Providers\n\n| Provider | Models | Embedding | Best For |\n|----------|--------|-----------|----------|\n| **OpenAI** | GPT-5.4, GPT-5.4-mini, GPT-4.1, o3, o4-mini | text-embedding-3-small/large | General purpose, vision, reasoning |\n| **Anthropic** | Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5 | -- (use separate provider) | Long context (1M), coding, analysis |\n| **Google** | Gemini 3.1 Pro, Gemini 3.1 Flash Lite, Gemini 3.0 Deep Think | text-embedding-005 | Multimodal, multilingual |\n| **xAI** | Grok 4, Grok 4 Heavy, Grok 4.1 Fast | Grok embedding | Real-time knowledge, code |\n| **DeepSeek** | DeepSeek-V3.2, DeepSeek-R1, DeepSeek-V4 (upcoming) | -- (use separate provider) | Cost-efficient reasoning, 164K context |\n| **Mistral** | Mistral Small 4 (MoE), Large 2.1, Codestral, Pixtral | mistral-embed (1024) | European data residency, coding, vision |\n| **OpenAI-compatible** | Any OpenAI-compatible endpoint | Depends on endpoint | vLLM, LM Studio, Together, Fireworks, Groq, DeepInfra, SiliconFlow, OpenRouter |\n\n### Local Models (via Ollama)\n\n| Model | Active Params | Total Params | Vision | Korean | Best For |\n|-------|-------------|-------------|--------|--------|----------|\n| **Qwen 3.5 27B** | 27B (dense) | 27B | Yes | Excellent | General purpose (32GB+ RAM) |\n| **Qwen 3.5 9B** | 9B (dense) | 9B | Yes | Excellent | Mid-range (16GB RAM) |\n| **Qwen 3.5-122B-A10B** | 10B (MoE) | 122B | Yes | Excellent | High quality, efficient |\n| **Llama 4 Scout** | 17B (MoE) | 109B | Yes | Good | 10M context window |\n| **Llama 4 Maverick** | 17B (MoE) | 400B | Yes | Good | Top open-source quality |\n| **DeepSeek V3.2** | 37B (MoE) | 671B | No | Good | Coding, reasoning |\n| **Gemma 4** | 27B / 12B / 4B / 1B | dense | Yes | Good | Latest Google open model, 128K context, 140+ languages |\n| **Gemma 3 27B** | 27B | 27B | Yes | Good | Lightweight, 140+ languages |\n| **Gemma 3 4B** | 4B | 4B | Yes | Good | Low-spec machines (8GB RAM) |\n| **K-EXAONE** | 23B (MoE) | 236B | No | Best | Korean-specialized |\n| **EXAONE Deep 32B** | 32B | 32B | No | Best | Korean reasoning |\n| **Phi-4 Reasoning Vision** | 15B | 15B | Yes | Fair | Compact multimodal |\n\n### Embedding Models\n\n| Model | Dimensions | Korean | Multimodal | Where |\n|-------|-----------|--------|-----------|-------|\n| **BGE-M3** | 1024 | Excellent | No | Ollama (default) |\n| **text-embedding-3-large** | 3072 | Good | No | OpenAI |\n| **text-embedding-005** | 768 | Good | No | Google |\n| **nomic-embed-text** | 768 | Fair | No | Ollama (lightweight) |\n\n### Auto-Recommendation\n\n`opendocuments init` detects your hardware and recommends the best model:\n\n| Your Hardware | Recommended Model | Recommended Embedding |\n|--------------|-------------------|----------------------|\n| 32GB+ RAM, GPU | Qwen 3.5 27B or Llama 4 Scout | BGE-M3 |\n| 16GB RAM | Qwen 3.5 9B | BGE-M3 |\n| 8GB RAM | Gemma 3 4B | nomic-embed-text |\n| Any (cloud) | Claude Sonnet 4.6 or GPT-5.4-mini | text-embedding-3-large |\n\n---\n\n## Three Ways to Use\n\n### 1. Web UI\n\nFull-featured dashboard at `http://localhost:3000`:\n\n| Page | What You Can Do |\n|------|-----------------|\n| **Chat** | Ask questions with streaming answers, source citations, confidence scores, feedback buttons. Switch between fast/balanced/precise profiles. |\n| **Documents** | Browse indexed documents, drag-and-drop upload, view document details, soft-delete with trash/restore. |\n| **Connectors** | See connector sync status and last sync times. |\n| **Plugins** | View installed plugins with health indicators. |\n| **Settings** | Toggle dark/light theme, change RAG profile, view server version. |\n| **Admin** | Stats dashboard, search quality metrics, paginated query logs, plugin health, connector status, audit logs. |\n\n**Keyboard shortcuts**: `Cmd+K` opens the Command Palette. `Cmd+1-5` navigates between pages.\n\n### 2. CLI\n\n17 commands for power users and automation:\n\n```bash\n# Ask questions\nopendocuments ask \"What's the deploy process?\"\nopendocuments ask                              # Interactive REPL mode\nopendocuments search \"auth middleware\" --top 10 # Vector search, no LLM\n\n# Manage documents\nopendocuments index ./docs --watch    # Index + auto-reindex on changes\nopendocuments document list           # See all indexed docs\nopendocuments document delete \u003cid\u003e    # Soft-delete\n\n# Manage connectors\nopendocuments connector sync          # Sync all connectors\nopendocuments connector status        # Check sync status\n\n# Pipe support for scripting\ncat README.md | opendocuments ask \"Summarize this\" --stdin\nopendocuments ask \"List endpoints\" --json | jq '.sources[].sourcePath'\n\n# Administration\nopendocuments doctor                  # Health check (per-provider API ping)\nopendocuments auth create-key --name \"ci-bot\" --role member\nopendocuments export --output ./backup\n\n# Model management\nopendocuments model list --suggestions          # Show installed + curated models\nopendocuments model install-ollama              # One-shot Ollama install (macOS/Linux)\nopendocuments model pull gemma3:27b bge-m3      # Batch pull with disk-space check\nopendocuments model set-key deepseek            # Prompt + save API key to .env\nopendocuments model test                        # Round-trip test against configured LLM\nopendocuments model switch                      # Change provider without editing config\n```\n\n### 3. MCP Server\n\n19 tools for AI-assisted workflows. Works with Claude Code, Cursor, Windsurf, and any MCP client.\n\n```bash\nopendocuments start --mcp-only\n```\n\nYour AI assistant can then:\n- Search your organization's documents while coding\n- Index new files as they're created\n- Check document status and connector health\n- Query configuration\n\n---\n\n## RAG Profiles\n\n| | `fast` | `balanced` | `precise` |\n|--|--------|------------|-----------|\n| **Speed** | ~1s | ~3s | ~5s+ |\n| **Search depth** | 10 docs | 20 docs | 50 docs |\n| **Semantic chunking** | On | On | On |\n| **Reranking** | Off | On | On |\n| **Cross-encoder** | Off | Off | On |\n| **Cross-lingual** | Off | Korean + English | Korean + English |\n| **Contextual prefix** | Off | On | On |\n| **Multi-query expansion** | Off | 3x paraphrases | 5x paraphrases |\n| **HyDE** | Off | Off | On |\n| **Parent-document retrieval** | Off | On | On |\n| **Chunk augmentation** (propositions/HQs) | Off | Off | On |\n| **Query decomposition** | Off | Off | Splits complex queries |\n| **Web search** | Off | Fallback when local results are weak | Always merged |\n| **Hallucination guard** | Off | Checks source grounding | Strict mode (annotates unverified) |\n| **Best for** | Quick lookups, 8B local models | Daily use, 14B+ models | Critical questions, cloud LLMs |\n\nSwitch anytime: CLI flag (`--profile precise`), Web UI toggle, or config file.\n\n### Retrieval quality\n\nOpenDocuments ships a redesigned RAG pipeline with structure-preserving chunking, contextual retrieval, HyDE + multi-query + parent-document retrieval, proposition augmentation, and a cross-encoder reranker — all profile-gated via the table above. See [`packages/core/CHANGELOG.md`](packages/core/CHANGELOG.md) for the full list of additions.\n\nBenchmark against your own dataset with the evaluation harness:\n\n```bash\ncd packages/core \u0026\u0026 npx tsx tests/_fixtures/run-eval.ts\n```\n\nMetrics reported: hit@3, hit@5, MRR, nDCG@5 — per-intent and aggregate.\n\n---\n\n## Security\n\n### Personal Mode (default)\n\nZero configuration. No auth. Localhost only. Just works.\n\n### Team Mode\n\n```typescript\n// opendocuments.config.ts\nexport default defineConfig({ mode: 'team' })\n```\n\n| Feature | How It Works |\n|---------|-------------|\n| **API Keys** | `od_live_` prefix, SHA-256 hashed, never stored in plaintext. Scoped to specific operations, with optional expiration. |\n| **Roles** | `admin` (everything), `member` (read + write), `viewer` (read only) |\n| **Rate Limiting** | 60 req/min default, per-key override. In-memory with lazy cleanup. |\n| **PII Redaction** | Automatically masks emails, phone numbers, credit cards, IPs before sending to cloud LLMs. Configurable patterns and methods (replace/hash/remove). |\n| **Audit Log** | Records auth events, document access, config changes. Queryable via admin API. |\n| **Security Alerts** | Detects brute-force attempts, unusual data exports, API key abuse. |\n| **OAuth SSO** | Google and GitHub login with HttpOnly cookie sessions. |\n| **Workspace Isolation** | Every vector search is enforced with `workspace_id` filter. Documents, conversations, and API keys are scoped to workspaces. |\n\n---\n\n## Configuration\n\n```typescript\n// opendocuments.config.ts\nimport { defineConfig } from 'opendocuments-core'\n\nexport default defineConfig({\n  workspace: 'my-team',\n  mode: 'personal',\n\n  model: {\n    provider: 'ollama',\n    llm: 'qwen3.5:27b',\n    embedding: 'bge-m3',\n  },\n\n  rag: { profile: 'balanced' },\n\n  connectors: [\n    { type: 'github', repo: 'org/repo', token: process.env.GITHUB_TOKEN },\n    { type: 'notion', token: process.env.NOTION_TOKEN },\n    { type: 'web-crawler', urls: ['https://docs.example.com'] },\n  ],\n\n  plugins: ['@opendocuments/parser-pdf', '@opendocuments/parser-docx'],\n\n  security: {\n    dataPolicy: {\n      autoRedact: { enabled: true, patterns: ['email', 'phone', 'credit-card'] },\n    },\n    audit: { enabled: true },\n  },\n\n  storage: { db: 'sqlite', vectorDb: 'lancedb', dataDir: '~/.opendocuments' },\n})\n```\n\n---\n\n## Docker Deployment\n\n```bash\n# Basic (cloud LLM)\ndocker compose up -d\n\n# With local LLM (Ollama)\ndocker compose --profile with-ollama up -d\n\n# With .env file for API keys\ndocker compose --env-file .env up -d\n```\n\nThe Docker image includes all packages and plugins. Data persists in a named volume. Mount your config:\n\n```bash\ndocker run -v ./opendocuments.config.ts:/app/opendocuments.config.ts \\\n  -v opendocuments-data:/data -p 3000:3000 opendocuments\n```\n\n---\n\n## Plugin Development\n\nCreate custom parsers, connectors, or model providers:\n\n```bash\nopendocuments plugin create my-parser --type parser\ncd my-parser\nnpm install\nnpm run test\nnpm run dev       # Watch mode\nopendocuments plugin publish  # Publish to npm\n```\n\nFour plugin types: `parser`, `connector`, `model`, `middleware`. Each has a typed interface with lifecycle hooks (`setup`, `teardown`, `healthCheck`, `metrics`).\n\nCommunity plugins follow the naming convention: `opendocuments-plugin-*`\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for the full plugin development guide.\n\n---\n\n## TypeScript SDK\n\n```typescript\nimport { OpenDocumentsClient } from '@opendocuments/client'\n\nconst client = new OpenDocumentsClient({\n  baseUrl: 'http://localhost:3000',\n  apiKey: 'od_live_...',\n})\n\nconst result = await client.ask('How does auth work?')\nconsole.log(result.answer)    // \"Auth uses JWT tokens with...\"\nconsole.log(result.sources)   // [{ sourcePath: 'docs/auth.md', score: 0.92 }]\nconsole.log(result.confidence) // { level: 'high', score: 0.87 }\n```\n\n---\n\n## Embeddable Widget\n\nAdd a chat widget to your internal tools:\n\n```html\n\u003cscript src=\"http://localhost:3000/widget.js\"\u003e\u003c/script\u003e\n\u003cscript\u003e\n  OpenDocuments.widget({\n    server: 'http://localhost:3000',\n    apiKey: 'od_live_...',\n    workspace: 'public-docs',\n  })\n\u003c/script\u003e\n```\n\n---\n\n## Development\n\n```bash\ngit clone https://github.com/joungminsung/OpenDocuments.git\ncd OpenDocuments\nnpm run setup    # Install + build (one command)\nnpm run test     # 51 test suites, ~300 tests\nnpm run dev      # Watch mode\n```\n\n### Architecture\n\n| Package | Role | Tests |\n|---------|------|-------|\n| `@opendocuments/core` | Plugin system, RAG engine, ingest pipeline, storage, auth, security | 159 |\n| `@opendocuments/server` | HTTP API (Hono), MCP server, auth middleware, widget | 27 |\n| `@opendocuments/cli` | 17 CLI commands (Commander.js) | 3 |\n| `@opendocuments/web` | React SPA with 7 pages (Vite + Tailwind) | -- |\n| `@opendocuments/client` | TypeScript SDK | 3 |\n| 8 model plugins | Ollama, OpenAI, Anthropic, Google, Grok, DeepSeek, Mistral, OpenAI-compatible | 41 |\n| 9 parser plugins | PDF, DOCX, XLSX, HTML, Jupyter, Email, Code, PPTX, Structured | 37 |\n| 8 connector plugins | GitHub, Notion, GDrive, S3, Confluence, Swagger, WebCrawler, WebSearch | 38 |\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for conventions, test patterns, and plugin development guide.\n\n---\n\n## Documentation\n\n| Guide | Description |\n|-------|-------------|\n| [Quick Start](#quick-start) | Install and run in 5 minutes |\n| [Architecture](docs/architecture.md) | Package structure, data flow, design decisions |\n| [Plugin API: Parsers](docs-site/plugins/parser-api.md) | Create custom document parsers |\n| [Plugin API: Connectors](docs-site/plugins/connector-api.md) | Connect external data sources |\n| [Plugin API: Models](docs-site/plugins/model-api.md) | Add custom AI providers |\n| [TypeScript SDK](docs-site/sdk/guide.md) | Programmatic API client |\n| [Security Policy](SECURITY.md) | Vulnerability reporting |\n| [Contributing](CONTRIBUTING.md) | Development setup, conventions, plugin guide |\n\n---\n\n## Frequently Asked Questions\n\n### What is OpenDocuments used for?\n\nOpenDocuments is used to build a private AI search engine over company documents. Teams use it to ask questions across GitHub repositories, Notion pages, Google Drive files, Confluence spaces, S3 buckets, API specs, local files, and web pages, then receive answers with citations.\n\n### Is OpenDocuments open source?\n\nYes. OpenDocuments is open source and released under the [MIT License](LICENSE).\n\n### Is OpenDocuments self-hosted?\n\nYes. OpenDocuments is designed for self-hosted deployment. You can run it locally during development, deploy it with Docker, or host it on your own infrastructure.\n\n### Can OpenDocuments run without sending data to a cloud LLM?\n\nYes. When configured with Ollama and local embedding models, OpenDocuments can run the LLM, embeddings, vector search, metadata database, Web UI, CLI, and MCP server on your own infrastructure.\n\n### What data sources does OpenDocuments support?\n\nOpenDocuments supports local files, file uploads, GitHub, Notion, Google Drive, Amazon S3, Google Cloud Storage, Confluence, Swagger/OpenAPI specs, registered web pages, and Tavily-backed web search.\n\n### What file formats can OpenDocuments index?\n\nOpenDocuments can index Markdown, plain text, PDF, DOCX, XLSX, CSV, HTML, Jupyter notebooks, email, source code, PPTX, JSON, YAML, TOML, and other supported plugin formats.\n\n### Does OpenDocuments work with Claude Code or Cursor?\n\nYes. OpenDocuments includes an MCP server, so MCP-compatible AI tools such as Claude Code, Cursor, Windsurf, and similar clients can search your indexed document corpus while assisting with development.\n\n### What makes OpenDocuments different from a vector database?\n\nA vector database stores embeddings. OpenDocuments provides the surrounding RAG platform: connectors, parsers, document chunking, hybrid retrieval, reranking, answer generation, citations, Web UI, CLI, HTTP API, SDK, MCP server, authentication, and plugins.\n\n### What makes OpenDocuments different from hosted enterprise search?\n\nOpenDocuments is open source and self-hosted. It is built for teams that want AI document search, source citations, plugin extensibility, and control over where their documents, embeddings, metadata, and model calls run.\n\n---\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoungminsung%2Fopendocuments","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoungminsung%2Fopendocuments","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoungminsung%2Fopendocuments/lists"}