https://github.com/kkollsga/kglite-docs
Agent-first knowledge base for documents. Built on kglite + BAAI/bge-m3. Multi-format ingest, cross-checked summaries, review kanban, grounding checks, and an MCP server.
https://github.com/kkollsga/kglite-docs
anthropic bge-m3 claude-code embeddings kglite knowledge-graph mcp pdf rag
Last synced: 26 days ago
JSON representation
Agent-first knowledge base for documents. Built on kglite + BAAI/bge-m3. Multi-format ingest, cross-checked summaries, review kanban, grounding checks, and an MCP server.
- Host: GitHub
- URL: https://github.com/kkollsga/kglite-docs
- Owner: kkollsga
- License: mit
- Created: 2026-05-28T08:26:33.000Z (28 days ago)
- Default Branch: main
- Last Pushed: 2026-05-28T08:33:18.000Z (28 days ago)
- Last Synced: 2026-05-28T10:21:34.294Z (28 days ago)
- Topics: anthropic, bge-m3, claude-code, embeddings, kglite, knowledge-graph, mcp, pdf, rag
- Language: Python
- Size: 148 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: docs/contributing.md
- License: LICENSE
Awesome Lists containing this project
README
# kglite-docs
> **Agent-first knowledge base for documents.** Ingest PDFs, Office files, Markdown, HTML, or images; chunk + embed them with [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3); cluster, tag, summarise, fact-check, translate, and review them β and serve the whole thing to AI agents over MCP.
[](https://pypi.org/project/kglite-docs/)
[](https://pypi.org/project/kglite-docs/)
[](https://kglite-docs.readthedocs.io/)
[](https://opensource.org/licenses/MIT)
Built on [`kglite`](https://github.com/kkollsga/kglite) (storage + vector search + clustering) and [`mcp-methods`](https://github.com/kkollsga/mcp-methods) (MCP framework).
---
## Why this and not generic RAG?
Most "RAG libraries" hand the agent `search(query) β list[chunk]` and stop. kglite-docs treats the corpus as a *living* knowledge graph that records who did what β and gives the agent typed tools to act on it.
- π **Multi-format ingest** β PDF, DOCX, PPTX, MD, HTML, TXT, images. All flow into the same `Document β Page β Chunk` shape.
- π€ **Agents are first-class nodes** β their views, tags, summaries, verifications, and reviews are all queryable.
- β
**Cross-checked summaries** β one agent writes, a *different* agent verifies. Self-verification is rejected server-side.
- π **Review kanban** β chunks move through `new β in_review β reviewed` with an immutable audit trail.
- π‘οΈ **Grounding checks** β score how well an agent's summary aligns with its sources. Catch hallucinations before they ship.
- π **Translations** β per-chunk, multi-translator, with author/reviewer provenance.
- πΌοΈ **Agent-driven OCR** β scanned pages handed back as rendered PNGs; agent transcribes and the graph absorbs the result.
- π **Local & private** β parsing, embedding, and analysis all run on your machine against a local `.kgl` file. The only network call is a one-time bge-m3 model download; your documents never leave the host. See [Confidentiality](https://kglite-docs.readthedocs.io/en/latest/privacy/).
## Install
```bash
pip install kglite-docs
```
## 30 seconds of Python
```python
from kglite_docs import Corpus
with Corpus.create("kb.kgl") as corpus: # auto-saves on exit
corpus.ingest_dir("./papers") # PDF / DOCX / PPTX / MD / HTML / images
hits = corpus.search("transformer attention", top_k=5, agent_id="me")
ctx = corpus.compose_context("transformer attention", max_tokens=3000)
# ctx["items"] is a ranked, token-budgeted bundle ready for your LLM prompt
```
## 30 seconds of agent loop
Cross-checked enrichment in five lines:
```python
sid = corpus.add_summary(
target_id=hits[0]["id"], text="DPR uses a dual BERT encoderβ¦",
agent_id="writer", model="opus-4.7",
)
# A different agent verifies β self-verification is rejected
corpus.verify_summary(sid, verdict="verified",
verifier_agent_id="reviewer", notes="checked p.5")
# Score how grounded the summary is in its source chunks
print(corpus.check_grounding(sid)["supported_fraction"]) # β 1.0
```
## Run it as an MCP server
```bash
kglite-docs-mcp --db kb.kgl
```
Register with Claude Code:
```bash
claude mcp add kglite-docs -- kglite-docs-mcp --db /abs/path/kb.kgl
```
The agent now sees ~30 typed tools (`search`, `compose_context`, `add_summary`, `verify_summary`, `tag_chunk`, `cluster_chunks`, `claim_next_review`, β¦) plus `cypher_query` as an escape hatch.
## Read the docs
π **Full documentation at [kglite-docs.readthedocs.io](https://kglite-docs.readthedocs.io/)**
- [Getting started](https://kglite-docs.readthedocs.io/en/latest/getting-started/) β 10 minutes from `pip install` to a running agent
- [Agent workflows](https://kglite-docs.readthedocs.io/en/latest/workflows/) β research, comparison, fact-checking, OCR loops, hallucination guards
- [Architecture](https://kglite-docs.readthedocs.io/en/latest/architecture/) β graph model, design rationale, the 30+ typed MCP tools
- [API reference](https://kglite-docs.readthedocs.io/en/latest/api/corpus/) β every method, every argument, IDE-friendly type stubs
- [Troubleshooting](https://kglite-docs.readthedocs.io/en/latest/troubleshooting/) β common failure modes
- [Confidentiality](https://kglite-docs.readthedocs.io/en/latest/privacy/) β everything runs local; what the one network call is (and isn't)
- [Changelog](https://kglite-docs.readthedocs.io/en/latest/changelog/)
## License
MIT.