https://github.com/VectifyAI/OpenKB
OpenKB: Open LLM Knowledge Base
https://github.com/VectifyAI/OpenKB
agents ai knowledge-base llm rag retrieval
Last synced: 16 days ago
JSON representation
OpenKB: Open LLM Knowledge Base
- Host: GitHub
- URL: https://github.com/VectifyAI/OpenKB
- Owner: VectifyAI
- License: apache-2.0
- Created: 2026-04-04T12:26:04.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-24T01:58:28.000Z (20 days ago)
- Last Synced: 2026-05-24T03:21:04.325Z (20 days ago)
- Topics: agents, ai, knowledge-base, llm, rag, retrieval
- Language: Python
- Homepage: https://pageindex.ai
- Size: 29.2 MB
- Stars: 1,902
- Watchers: 7
- Forks: 205
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# OpenKB — Open LLM Knowledge Base
Scale to long documents • Reasoning-based retrieval • Native multi-modality • No Vector DB
---
# 📑 What is OpenKB
**OpenKB (Open Knowledge Base)** is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by [**PageIndex**](https://github.com/VectifyAI/PageIndex) for vectorless long document retrieval.
The idea is based on a [concept](https://x.com/karpathy/status/2039805659525644595) described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.
### Why not traditional RAG?
Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.
OpenKB has two layers: a **wiki foundation** that compiles and maintains your knowledge, and **generators** (query / chat / Skill Factory) that turn it into useful output. See [Usage](#️-usage) for the full command list.
# 🚀 Getting Started
### Install
```bash
pip install openkb
```
Other install options
- **Latest from GitHub:**
```bash
pip install git+https://github.com/VectifyAI/OpenKB.git
```
- **Install from source** (editable, for development):
```bash
git clone https://github.com/VectifyAI/OpenKB.git
cd OpenKB
pip install -e .
```
### Quick Start
```bash
# 1. Create a directory for your knowledge base
mkdir my-kb && cd my-kb
# 2. Initialize the knowledge base
openkb init
# 3. Add documents
openkb add paper.pdf
openkb add ~/papers/ # Add a whole directory
openkb add https://arxiv.org/pdf/2509.11420 # Or fetch from a URL
# 4. Ask a question
openkb query "What are the main findings?"
# 5. Or chat interactively
openkb chat
# 6. Or distill your wiki into a redistributable skill
openkb skill new my-expert "Reason like an expert on "
```
### Set up your LLM
OpenKB comes with [multi-LLM support](https://docs.litellm.ai/docs/providers) (e.g., OpenAI, Claude, Gemini) via [LiteLLM](https://github.com/BerriAI/litellm) (pinned to a [safe version](https://docs.litellm.ai/blog/security-update-march-2026)).
Set your model during `openkb init`, or in [`.openkb/config.yaml`](#configuration), using `provider/model` LiteLLM format (like `anthropic/claude-sonnet-4-6`). OpenAI models can omit the prefix (like `gpt-5.4`).
Create a `.env` file with your LLM API key:
```bash
LLM_API_KEY=your_llm_api_key
```
# 🧩 How OpenKB Works
### Architecture
```
raw/ You drop files here
│
├─ Short docs ──→ markitdown ──→ LLM reads full text
│ │
├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
│ │
│ ▼
│ Wiki Compilation (using LLM)
│ │
▼ ▼
wiki/ │ ← the foundation
├── index.md Knowledge base overview
├── log.md Operations timeline
├── AGENTS.md Wiki schema (LLM instructions)
├── sources/ Full-text conversions
├── summaries/ Per-document summaries
├── concepts/ Cross-document synthesis ← the good stuff
├── explorations/ Saved query results
└── reports/ Lint reports
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
query / chat Skill Factory (future)
(LLM answers from openkb skill new ppt / podcast /
the wiki) → output/skills/ report / …
+ marketplace.json
```
### Short vs. Long Document Handling
| | Short documents | Long documents (PDF ≥ 20 pages) |
|---|---|---|
| **Convert** | markitdown → Markdown | PageIndex → tree index + summaries |
| **Images** | Extracted inline (pymupdf) | Extracted by PageIndex |
| **LLM reads** | Full text | Document trees |
| **Result** | summary + concepts | summary + concepts |
Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, enabling better retrieval from long documents.
### Knowledge Compilation
When you add a document, the LLM:
1. Generates a **summary** page
2. Reads existing **concept** pages
3. Creates or updates concepts with cross-document synthesis
4. Updates the **index** and **log**
A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.
# ⚙️ Usage
OpenKB commands fall into two layers: the **wiki foundation** (compile + manage your knowledge) and **generators** (turn that wiki into useful output).
## 🧱 Wiki Foundation — compile and maintain
| Command | Description |
|---|---|
| `openkb init` | Initialize a new knowledge base (interactive) |
| openkb add <file_or_dir_or_URL> | Add documents and compile to wiki. URL ingest auto-detects PDF (saved as `.pdf` → PageIndex / markitdown) vs HTML (trafilatura main-content extract → `.md`) |
| openkb remove <doc> | Remove a document and clean up its wiki pages, images, registry, and PageIndex state (use `--dry-run` to preview, `--keep-raw` / `--keep-empty-concepts` to retain artifacts) |
| `openkb watch` | Watch `raw/` and auto-compile new files |
| `openkb lint` | Run structural + knowledge health checks |
| `openkb list` | List indexed documents and concepts |
| `openkb status` | Show knowledge base stats |
| openkb feedback ["msg"] | File feedback by opening a prefilled GitHub issue (use `--type bug/feature/question` to tag the issue) |
## ✨ Generators — turn the wiki into output
A "generator" reads from the compiled wiki and produces something usable: an answer, a conversation, a skill folder. The wiki is the substrate; generators are the surfaces.
| Command | Output |
|---|---|
| openkb query "question" | A grounded answer with citations (use `--save` to persist to `wiki/explorations/`) |
| `openkb chat` | Interactive multi-turn session over the wiki (use `--resume`, `--list`, `--delete` to manage sessions) |
| openkb skill new <name> "<intent>" | A redistributable Anthropic Skill at `/output/skills//` + auto-updated `marketplace.json` |
| openkb skill validate [name] | Structural lint of compiled skills (frontmatter, file sizes, wikilinks, scripts/ stdlib check with `--strict`). Auto-runs at end of `skill new` |
| openkb skill eval <name> | Trigger-accuracy evaluation — does the `description:` field actually fire? LLM generates eval prompts; grader LLM scores activation. `--save` persists the eval set |
| openkb skill history <name> / openkb skill rollback <name> | Iteration workspace — every overwrite saves the previous version to `output/skills/-workspace/iteration-N/` with a structural diff. Rollback restores any iteration |
### Query & Chat — ask the wiki
`openkb query "..."` answers a single question. `openkb chat` is interactive — each turn carries history, so you can dig into a topic without re-typing context. Both use the same underlying wiki and the same retrieval primitives (PageIndex for long docs, direct concept reads for short).
```bash
openkb query "What does the literature say about attention scaling?"
openkb chat # start a new session
openkb chat --resume # resume the most recent session
openkb chat --resume 20260411 # resume by id (unique prefix works)
openkb chat --list # list all sessions
openkb chat --delete # delete a session
```
Inside a chat, type `/` to access slash commands (Tab to complete):
- `/help` — list available commands
- `/status` — show knowledge base status
- `/list` — list all documents
- `/add ` — add a document or directory without leaving the chat
- `/skill new ""` — compile a skill from this chat (see below)
- `/save [name]` — export the transcript to `wiki/explorations/`
- `/clear` — start a fresh session (the current one stays on disk)
- `/lint` — run knowledge base lint
- `/exit` — exit (Ctrl-D also works)
### 🛠 Skill Factory — *Drop in a book. Out comes a digital expert.*
The newest generator. `openkb skill new` distills any subset of your wiki into an [Anthropic Skill](https://docs.claude.com/en/docs/build-with-claude/skills) — a portable folder that **Claude Code, Codex CLI, Gemini CLI, and Cursor** all install and load natively. Drop in a book's worth of papers; out comes a specialist that other agents can call on.
```bash
openkb skill new karpathy-thinking \
"Reason about transformers and attention in Karpathy's style"
```
This produces:
```
/output/skills/karpathy-thinking/
├── SKILL.md # YAML frontmatter + when-to-use + approach
├── references/ # depth material the agent loads on demand
│ ├── methodology.md
│ └── key-quotes.md
└── (scripts/) # optional, only if intent implies computation
```
…plus an auto-updated `/.claude-plugin/marketplace.json` so the whole KB is one-line installable.
**Install locally:**
```bash
cp -r output/skills/karpathy-thinking ~/.claude/skills/
```
**Share with others** — push your KB to GitHub, then anyone runs:
```bash
npx skills@latest add /
```
**Iterate from chat** — compilation is one-shot, but follow-up edits aren't. Inside `openkb chat`, you can refine without re-running the whole pipeline:
```
/skill new karpathy-thinking "Reason about transformers like Karpathy"
[generation streams]
> description is too generic, make it about transformer implementations specifically
[agent edits SKILL.md frontmatter in place]
```
**Quality gates** — structural validation, trigger-accuracy + body-coverage evaluation, and full history/rollback:
```bash
# Lint structure (auto-runs at end of `skill new`)
openkb skill validate karpathy-thinking
openkb skill validate --strict # treat warnings as failures
# Does the description actually fire when it should?
openkb skill eval karpathy-thinking --save
# History + rollback if a new iteration regresses
openkb skill history karpathy-thinking
openkb skill rollback karpathy-thinking --to 2
```
### Configuration
Settings are initialized by `openkb init`, and stored in `.openkb/config.yaml`:
```yaml
model: gpt-5.4 # LLM model (any LiteLLM-supported provider)
language: en # Wiki output language
pageindex_threshold: 20 # PDF pages threshold for PageIndex
```
Model names use `provider/model` LiteLLM [format](https://docs.litellm.ai/docs/providers) (OpenAI models can omit the prefix):
| Provider | Model example |
|---|---|
| OpenAI | `gpt-5.4` |
| Anthropic | `anthropic/claude-sonnet-4-6` |
| Gemini | `gemini/gemini-3.1-pro-preview` |
### PageIndex Integration
Long documents are challenging for LLMs due to context limits, context rot, and summarization loss.
[PageIndex](https://github.com/VectifyAI/PageIndex) solves this with vectorless, reasoning-based retrieval — building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.
PageIndex runs locally by default using the [open-source version](https://github.com/VectifyAI/PageIndex), with no external dependencies required.
#### Optional: Cloud Support
For large or complex PDFs, [PageIndex Cloud](https://docs.pageindex.ai/) can be used to access additional capabilities, including:
- OCR support for scanned PDFs (via hosted VLM models)
- Faster structure generation
- Scalable indexing for large documents
Set `PAGEINDEX_API_KEY` in your `.env` to enable cloud features:
```
PAGEINDEX_API_KEY=your_pageindex_api_key
```
### AGENTS.md
The `wiki/AGENTS.md` file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.
At runtime, the LLM reads `AGENTS.md` from disk, so your edits take effect immediately.
### Using with Obsidian
OpenKB's wiki is a directory of Markdown files with `[[wikilinks]]`. Obsidian renders it natively.
1. Open `wiki/` as an Obsidian vault
2. Browse summaries, concepts, and explorations
3. Use graph view to see knowledge connections
4. Use Obsidian Web Clipper to add web articles to `raw/`
### Using with Claude Code / Codex / Gemini CLI
OpenKB ships a `SKILL.md` so any agent CLI can read your compiled wiki — no extra runtime, no MCP setup, just install the skill once.
**Claude Code**:
```
/plugin marketplace add VectifyAI/OpenKB
/plugin install openkb@vectify
```
**Gemini CLI**:
```bash
gemini skills install https://github.com/VectifyAI/OpenKB.git --path skills/openkb --consent
```
**OpenAI Codex CLI** (no marketplace command yet — manual symlink):
```bash
git clone https://github.com/VectifyAI/OpenKB.git ~/openkb-src
mkdir -p ~/.agents/skills
ln -s ~/openkb-src/skills/openkb ~/.agents/skills/openkb
```
The skill is read-only — it won't run `openkb add`, `remove`, or `lint --fix` without you asking. See [`skills/openkb/SKILL.md`](skills/openkb/SKILL.md) for the full instruction set.
# 🧭 Learn More
### Compared to Karpathy's Approach
| | Karpathy's workflow | OpenKB |
|---|---|---|
| Short documents | LLM reads directly | markitdown → LLM reads |
| Long documents | Context limits, context rot | PageIndex tree index |
| Supported formats | Web clipper → .md | PDF, Word, PPT, Excel, HTML, text, CSV, .md |
| Wiki compilation | LLM agent | LLM agent (same) |
| Q&A | Query over wiki | Wiki + PageIndex retrieval |
### The Stack
- [PageIndex](https://github.com/VectifyAI/PageIndex) — Vectorless, reasoning-based document indexing and retrieval
- [markitdown](https://github.com/microsoft/markitdown) — Universal file-to-markdown conversion
- [OpenAI Agents SDK](https://github.com/openai/openai-agents-python) — Agent framework (supports non-OpenAI models via LiteLLM)
- [LiteLLM](https://github.com/BerriAI/litellm) — Multi-provider LLM gateway
- [Click](https://click.palletsprojects.com/) — CLI framework
- [watchdog](https://github.com/gorakhargosh/watchdog) — Filesystem monitoring
### Roadmap
- [ ] Extend long document handling to non-PDF formats
- [ ] Scale to large document collections with nested folder support
- [ ] Hierarchical concept (topic) indexing for massive knowledge bases
- [ ] Database-backed storage engine
- [ ] Web UI for browsing and managing wikis
### Contributing
Contributions are welcome! Please submit a pull request, or open an [issue](https://github.com/VectifyAI/OpenKB/issues) for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.
### License
Apache 2.0. See [LICENSE](LICENSE).
### Support Us
If you find OpenKB useful, please give us a star 🌟 — and check out [PageIndex](https://github.com/VectifyAI/PageIndex) too!
[](https://x.com/PageIndexAI)
[](https://www.linkedin.com/company/vectify-ai/)
[](https://ii2abc2jejf.typeform.com/to/tK3AXl8T)