{"id":49793908,"url":"https://github.com/VectifyAI/OpenKB","last_synced_at":"2026-05-28T22:00:42.156Z","repository":{"id":350467432,"uuid":"1201253077","full_name":"VectifyAI/OpenKB","owner":"VectifyAI","description":"OpenKB: Open LLM Knowledge Base","archived":false,"fork":false,"pushed_at":"2026-05-24T01:58:28.000Z","size":30602,"stargazers_count":1902,"open_issues_count":15,"forks_count":205,"subscribers_count":7,"default_branch":"main","last_synced_at":"2026-05-24T03:21:04.325Z","etag":null,"topics":["agents","ai","knowledge-base","llm","rag","retrieval"],"latest_commit_sha":null,"homepage":"https://pageindex.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VectifyAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-04T12:26:04.000Z","updated_at":"2026-05-24T02:03:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/VectifyAI/OpenKB","commit_stats":null,"previous_names":["vectifyai/openkb"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/VectifyAI/OpenKB","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VectifyAI%2FOpenKB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VectifyAI%2FOpenKB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VectifyAI%2FOpenKB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VectifyAI%2FOpenKB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VectifyAI","download_url":"https://codeload.github.com/VectifyAI/OpenKB/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VectifyAI%2FOpenKB/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33627938,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","ai","knowledge-base","llm","rag","retrieval"],"created_at":"2026-05-12T08:00:35.832Z","updated_at":"2026-05-28T22:00:42.139Z","avatar_url":"https://github.com/VectifyAI.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://openkb.ai\"\u003e\n  \u003cimg src=\"https://docs.pageindex.ai/images/openkb.png\" alt=\"OpenKB (by PageIndex)\" /\u003e\n\u003c/a\u003e\n\n# OpenKB — Open LLM Knowledge Base\n\n\u003cp align=\"center\"\u003e\u003ci\u003eScale to long documents\u0026nbsp; • \u0026nbsp;Reasoning-based retrieval\u0026nbsp; • \u0026nbsp;Native multi-modality\u0026nbsp; • \u0026nbsp;No Vector DB\u003c/i\u003e\u003c/p\u003e\n\n\u003c/div\u003e\n\n---\n\n# 📑 What is OpenKB\n\n**OpenKB (Open Knowledge Base)** is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by [**PageIndex**](https://github.com/VectifyAI/PageIndex) for vectorless long document retrieval.\n\nThe idea is based on a [concept](https://x.com/karpathy/status/2039805659525644595) described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.\n\n### Why not traditional RAG?\n\nTraditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.\n\nOpenKB has two layers: a **wiki foundation** that compiles and maintains your knowledge, and **generators** (query / chat / Skill Factory) that turn it into useful output. See [Usage](#️-usage) for the full command list.\n\n# 🚀 Getting Started\n\n### Install\n\n```bash\npip install openkb\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ci\u003eOther install options\u003c/i\u003e\u003c/summary\u003e\n\n- **Latest from GitHub:**\n\n  ```bash\n  pip install git+https://github.com/VectifyAI/OpenKB.git\n  ```\n\n- **Install from source** (editable, for development):\n\n  ```bash\n  git clone https://github.com/VectifyAI/OpenKB.git\n  cd OpenKB\n  pip install -e .\n  ```\n\n\u003c/details\u003e\n\n### Quick Start\n\n```bash\n# 1. Create a directory for your knowledge base\nmkdir my-kb \u0026\u0026 cd my-kb\n\n# 2. Initialize the knowledge base\nopenkb init\n\n# 3. Add documents\nopenkb add paper.pdf\nopenkb add ~/papers/                            # Add a whole directory\nopenkb add https://arxiv.org/pdf/2509.11420     # Or fetch from a URL\n\n# 4. Ask a question\nopenkb query \"What are the main findings?\"\n\n# 5. Or chat interactively\nopenkb chat\n\n# 6. Or distill your wiki into a redistributable skill\nopenkb skill new my-expert \"Reason like an expert on \u003ctopic-from-your-docs\u003e\"\n```\n\n### Set up your LLM\n\nOpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)).\n\nSet your model during `openkb init`, or in [`.openkb/config.yaml`](#configuration), using `provider/model` LiteLLM format (like `anthropic/claude-sonnet-4-6`). OpenAI models can omit the prefix (like `gpt-5.4`).\n\nCreate a `.env` file with your LLM API key:\n\n```bash\nLLM_API_KEY=your_llm_api_key\n```\n\n# 🧩 How OpenKB Works\n\n### Architecture\n\n```\nraw/                              You drop files here\n │\n ├─ Short docs ──→ markitdown ──→ LLM reads full text\n │                                     │\n ├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees\n │                                     │\n │                                     ▼\n │                         Wiki Compilation (using LLM)\n │                                     │\n ▼                                     ▼\nwiki/                                  │            ← the foundation\n ├── index.md            Knowledge base overview\n ├── log.md              Operations timeline\n ├── AGENTS.md           Wiki schema (LLM instructions)\n ├── sources/            Full-text conversions\n ├── summaries/          Per-document summaries\n ├── concepts/           Cross-document synthesis ← the good stuff\n ├── explorations/       Saved query results\n └── reports/            Lint reports\n                                       │\n                ┌──────────────────────┼──────────────────────┐\n                ▼                      ▼                      ▼\n            query / chat         Skill Factory          (future)\n          (LLM answers from     openkb skill new       ppt / podcast /\n            the wiki)           → output/skills/        report / …\n                                + marketplace.json\n```\n\n### Short vs. Long Document Handling\n\n| | Short documents | Long documents (PDF ≥ 20 pages) |\n|---|---|---|\n| **Convert** | markitdown → Markdown | PageIndex → tree index + summaries |\n| **Images** | Extracted inline (pymupdf) | Extracted by PageIndex |\n| **LLM reads** | Full text | Document trees |\n| **Result** | summary + concepts | summary + concepts |\n\nShort docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, enabling better retrieval from long documents.\n\n### Knowledge Compilation\n\nWhen you add a document, the LLM:\n\n1. Generates a **summary** page\n2. Reads existing **concept** pages\n3. Creates or updates concepts with cross-document synthesis\n4. Updates the **index** and **log**\n\nA single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.\n\n# ⚙️ Usage\n\nOpenKB commands fall into two layers: the **wiki foundation** (compile + manage your knowledge) and **generators** (turn that wiki into useful output).\n\n## 🧱 Wiki Foundation — compile and maintain\n\n| Command | Description |\n|---|---|\n| `openkb init` | Initialize a new knowledge base (interactive) |\n| \u003ccode\u003eopenkb\u0026nbsp;add\u0026nbsp;\u0026lt;file_or_dir_or_URL\u0026gt;\u003c/code\u003e | Add documents and compile to wiki. URL ingest auto-detects PDF (saved as `.pdf` → PageIndex / markitdown) vs HTML (trafilatura main-content extract → `.md`) |\n| \u003ccode\u003eopenkb\u0026nbsp;remove\u0026nbsp;\u0026lt;doc\u0026gt;\u003c/code\u003e | Remove a document and clean up its wiki pages, images, registry, and PageIndex state (use `--dry-run` to preview, `--keep-raw` / `--keep-empty-concepts` to retain artifacts) |\n| `openkb watch` | Watch `raw/` and auto-compile new files |\n| `openkb lint` | Run structural + knowledge health checks |\n| `openkb list` | List indexed documents and concepts |\n| `openkb status` | Show knowledge base stats |\n| \u003ccode\u003eopenkb\u0026nbsp;feedback\u0026nbsp;[\"msg\"]\u003c/code\u003e | File feedback by opening a prefilled GitHub issue (use `--type bug/feature/question` to tag the issue) |\n\n\u003c!-- | `openkb lint --fix` | Auto-fix what it can | --\u003e\n\n## ✨ Generators — turn the wiki into output\n\nA \"generator\" reads from the compiled wiki and produces something usable: an answer, a conversation, a skill folder. The wiki is the substrate; generators are the surfaces.\n\n| Command | Output |\n|---|---|\n| \u003ccode\u003eopenkb\u0026nbsp;query\u0026nbsp;\"question\"\u003c/code\u003e | A grounded answer with citations (use `--save` to persist to `wiki/explorations/`) |\n| `openkb chat` | Interactive multi-turn session over the wiki (use `--resume`, `--list`, `--delete` to manage sessions) |\n| \u003ccode\u003eopenkb\u0026nbsp;skill\u0026nbsp;new\u0026nbsp;\u0026lt;name\u0026gt;\u0026nbsp;\"\u0026lt;intent\u0026gt;\"\u003c/code\u003e | A redistributable Anthropic Skill at `\u003ckb\u003e/output/skills/\u003cname\u003e/` + auto-updated `marketplace.json` |\n| \u003ccode\u003eopenkb\u0026nbsp;skill\u0026nbsp;validate\u0026nbsp;[name]\u003c/code\u003e | Structural lint of compiled skills (frontmatter, file sizes, wikilinks, scripts/ stdlib check with `--strict`). Auto-runs at end of `skill new` |\n| \u003ccode\u003eopenkb\u0026nbsp;skill\u0026nbsp;eval\u0026nbsp;\u0026lt;name\u0026gt;\u003c/code\u003e | Trigger-accuracy evaluation — does the `description:` field actually fire? LLM generates eval prompts; grader LLM scores activation. `--save` persists the eval set |\n| \u003ccode\u003eopenkb\u0026nbsp;skill\u0026nbsp;history\u0026nbsp;\u0026lt;name\u0026gt;\u003c/code\u003e / \u003ccode\u003eopenkb\u0026nbsp;skill\u0026nbsp;rollback\u0026nbsp;\u0026lt;name\u0026gt;\u003c/code\u003e | Iteration workspace — every overwrite saves the previous version to `output/skills/\u003cname\u003e-workspace/iteration-N/` with a structural diff. Rollback restores any iteration |\n\n### Query \u0026 Chat — ask the wiki\n\n`openkb query \"...\"` answers a single question. `openkb chat` is interactive — each turn carries history, so you can dig into a topic without re-typing context. Both use the same underlying wiki and the same retrieval primitives (PageIndex for long docs, direct concept reads for short).\n\n```bash\nopenkb query \"What does the literature say about attention scaling?\"\n\nopenkb chat                       # start a new session\nopenkb chat --resume              # resume the most recent session\nopenkb chat --resume 20260411     # resume by id (unique prefix works)\nopenkb chat --list                # list all sessions\nopenkb chat --delete \u003cid\u003e         # delete a session\n```\n\nInside a chat, type `/` to access slash commands (Tab to complete):\n\n- `/help` — list available commands\n- `/status` — show knowledge base status\n- `/list` — list all documents\n- `/add \u003cpath\u003e` — add a document or directory without leaving the chat\n- `/skill new \u003cname\u003e \"\u003cintent\u003e\"` — compile a skill from this chat (see below)\n- `/save [name]` — export the transcript to `wiki/explorations/`\n- `/clear` — start a fresh session (the current one stays on disk)\n- `/lint` — run knowledge base lint\n- `/exit` — exit (Ctrl-D also works)\n\n### 🛠 Skill Factory — *Drop in a book. Out comes a digital expert.*\n\nThe newest generator. `openkb skill new` distills any subset of your wiki into an [Anthropic Skill](https://docs.claude.com/en/docs/build-with-claude/skills) — a portable folder that **Claude Code, Codex CLI, Gemini CLI, and Cursor** all install and load natively. Drop in a book's worth of papers; out comes a specialist that other agents can call on.\n\n```bash\nopenkb skill new karpathy-thinking \\\n  \"Reason about transformers and attention in Karpathy's style\"\n```\n\nThis produces:\n\n```\n\u003ckb\u003e/output/skills/karpathy-thinking/\n├── SKILL.md                   # YAML frontmatter + when-to-use + approach\n├── references/                # depth material the agent loads on demand\n│   ├── methodology.md\n│   └── key-quotes.md\n└── (scripts/)                 # optional, only if intent implies computation\n```\n\n…plus an auto-updated `\u003ckb\u003e/.claude-plugin/marketplace.json` so the whole KB is one-line installable.\n\n**Install locally:**\n\n```bash\ncp -r output/skills/karpathy-thinking ~/.claude/skills/\n```\n\n**Share with others** — push your KB to GitHub, then anyone runs:\n\n```bash\nnpx skills@latest add \u003cyour-org\u003e/\u003cyour-repo\u003e\n```\n\n**Iterate from chat** — compilation is one-shot, but follow-up edits aren't. Inside `openkb chat`, you can refine without re-running the whole pipeline:\n\n```\n/skill new karpathy-thinking \"Reason about transformers like Karpathy\"\n[generation streams]\n\u003e description is too generic, make it about transformer implementations specifically\n[agent edits SKILL.md frontmatter in place]\n```\n\n**Quality gates** — structural validation, trigger-accuracy + body-coverage evaluation, and full history/rollback:\n\n```bash\n# Lint structure (auto-runs at end of `skill new`)\nopenkb skill validate karpathy-thinking\nopenkb skill validate --strict          # treat warnings as failures\n\n# Does the description actually fire when it should?\nopenkb skill eval karpathy-thinking --save\n\n# History + rollback if a new iteration regresses\nopenkb skill history karpathy-thinking\nopenkb skill rollback karpathy-thinking --to 2\n```\n\n### Configuration\n\nSettings are initialized by `openkb init`, and stored in `.openkb/config.yaml`:\n\n```yaml\nmodel: gpt-5.4                   # LLM model (any LiteLLM-supported provider)\nlanguage: en                     # Wiki output language\npageindex_threshold: 20          # PDF pages threshold for PageIndex\n```\n\nModel names use `provider/model` LiteLLM [format](https://docs.litellm.ai/docs/providers) (OpenAI models can omit the prefix):\n\n| Provider | Model example |\n|---|---|\n| OpenAI | `gpt-5.4` |\n| Anthropic | `anthropic/claude-sonnet-4-6` |\n| Gemini | `gemini/gemini-3.1-pro-preview` |\n\n### PageIndex Integration\n\nLong documents are challenging for LLMs due to context limits, context rot, and summarization loss.\n[PageIndex](https://github.com/VectifyAI/PageIndex) solves this with vectorless, reasoning-based retrieval — building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.\n\nPageIndex runs locally by default using the [open-source version](https://github.com/VectifyAI/PageIndex), with no external dependencies required.\n\n#### Optional: Cloud Support\n\nFor large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:\n\n- OCR support for scanned PDFs (via hosted VLM models)\n- Faster structure generation\n- Scalable indexing for large documents\n\nSet `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:\n\n```\nPAGEINDEX_API_KEY=your_pageindex_api_key\n```\n\n### AGENTS.md\n\nThe `wiki/AGENTS.md` file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.\n\nAt runtime, the LLM reads `AGENTS.md` from disk, so your edits take effect immediately.\n\n### Using with Obsidian\n\nOpenKB's wiki is a directory of Markdown files with `[[wikilinks]]`. Obsidian renders it natively.\n\n1. Open `wiki/` as an Obsidian vault\n2. Browse summaries, concepts, and explorations\n3. Use graph view to see knowledge connections\n4. Use Obsidian Web Clipper to add web articles to `raw/`\n\n### Using with Claude Code / Codex / Gemini CLI\n\nOpenKB ships a `SKILL.md` so any agent CLI can read your compiled wiki — no extra runtime, no MCP setup, just install the skill once.\n\n**Claude Code**:\n\n```\n/plugin marketplace add VectifyAI/OpenKB\n/plugin install openkb@vectify\n```\n\n**Gemini CLI**:\n\n```bash\ngemini skills install https://github.com/VectifyAI/OpenKB.git --path skills/openkb --consent\n```\n\n**OpenAI Codex CLI** (no marketplace command yet — manual symlink):\n\n```bash\ngit clone https://github.com/VectifyAI/OpenKB.git ~/openkb-src\nmkdir -p ~/.agents/skills\nln -s ~/openkb-src/skills/openkb ~/.agents/skills/openkb\n```\n\nThe skill is read-only — it won't run `openkb add`, `remove`, or `lint --fix` without you asking. See [`skills/openkb/SKILL.md`](skills/openkb/SKILL.md) for the full instruction set.\n\n# 🧭 Learn More\n\n### Compared to Karpathy's Approach\n\n| | Karpathy's workflow | OpenKB |\n|---|---|---|\n| Short documents | LLM reads directly | markitdown → LLM reads |\n| Long documents | Context limits, context rot | PageIndex tree index |\n| Supported formats | Web clipper → .md | PDF, Word, PPT, Excel, HTML, text, CSV, .md |\n| Wiki compilation | LLM agent | LLM agent (same) |\n| Q\u0026A | Query over wiki | Wiki + PageIndex retrieval |\n\n### The Stack\n\n- [PageIndex](https://github.com/VectifyAI/PageIndex) — Vectorless, reasoning-based document indexing and retrieval\n- [markitdown](https://github.com/microsoft/markitdown) — Universal file-to-markdown conversion\n- [OpenAI Agents SDK](https://github.com/openai/openai-agents-python) — Agent framework (supports non-OpenAI models via LiteLLM)\n- [LiteLLM](https://github.com/BerriAI/litellm) — Multi-provider LLM gateway\n- [Click](https://click.palletsprojects.com/) — CLI framework\n- [watchdog](https://github.com/gorakhargosh/watchdog) — Filesystem monitoring\n\n### Roadmap\n\n- [ ] Extend long document handling to non-PDF formats\n- [ ] Scale to large document collections with nested folder support\n- [ ] Hierarchical concept (topic) indexing for massive knowledge bases\n- [ ] Database-backed storage engine\n- [ ] Web UI for browsing and managing wikis\n\n### Contributing\n\nContributions are welcome! Please submit a pull request, or open an [issue](https://github.com/VectifyAI/OpenKB/issues) for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.\n\n### License\n\nApache 2.0. See [LICENSE](LICENSE).\n\n### Support Us\n\nIf you find OpenKB useful, please give us a star 🌟 — and check out [PageIndex](https://github.com/VectifyAI/PageIndex) too!  \n\n\u003cdiv\u003e\n\n[![Twitter](https://img.shields.io/badge/Twitter-000000?style=for-the-badge\u0026logo=x\u0026logoColor=white)](https://x.com/PageIndexAI)\u0026ensp;\n[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge\u0026logo=linkedin\u0026logoColor=white)](https://www.linkedin.com/company/vectify-ai/)\u0026ensp;\n[![Contact Us](https://img.shields.io/badge/Contact_Us-3B82F6?style=for-the-badge\u0026logo=envelope\u0026logoColor=white)](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVectifyAI%2FOpenKB","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FVectifyAI%2FOpenKB","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FVectifyAI%2FOpenKB/lists"}