An open API service indexing awesome lists of open source software.

https://github.com/kkollsga/kglite-docs

Agent-first knowledge base for documents. Built on kglite + BAAI/bge-m3. Multi-format ingest, cross-checked summaries, review kanban, grounding checks, and an MCP server.
https://github.com/kkollsga/kglite-docs

anthropic bge-m3 claude-code embeddings kglite knowledge-graph mcp pdf rag

Last synced: 26 days ago
JSON representation

Agent-first knowledge base for documents. Built on kglite + BAAI/bge-m3. Multi-format ingest, cross-checked summaries, review kanban, grounding checks, and an MCP server.

Awesome Lists containing this project

README

          

# kglite-docs

> **Agent-first knowledge base for documents.** Ingest PDFs, Office files, Markdown, HTML, or images; chunk + embed them with [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3); cluster, tag, summarise, fact-check, translate, and review them β€” and serve the whole thing to AI agents over MCP.

[![PyPI](https://img.shields.io/pypi/v/kglite-docs.svg)](https://pypi.org/project/kglite-docs/)
[![Python](https://img.shields.io/pypi/pyversions/kglite-docs.svg)](https://pypi.org/project/kglite-docs/)
[![Docs](https://readthedocs.org/projects/kglite-docs/badge/?version=latest)](https://kglite-docs.readthedocs.io/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Built on [`kglite`](https://github.com/kkollsga/kglite) (storage + vector search + clustering) and [`mcp-methods`](https://github.com/kkollsga/mcp-methods) (MCP framework).

---

## Why this and not generic RAG?

Most "RAG libraries" hand the agent `search(query) β†’ list[chunk]` and stop. kglite-docs treats the corpus as a *living* knowledge graph that records who did what β€” and gives the agent typed tools to act on it.

- πŸ“„ **Multi-format ingest** β€” PDF, DOCX, PPTX, MD, HTML, TXT, images. All flow into the same `Document β†’ Page β†’ Chunk` shape.
- 🀝 **Agents are first-class nodes** β€” their views, tags, summaries, verifications, and reviews are all queryable.
- βœ… **Cross-checked summaries** β€” one agent writes, a *different* agent verifies. Self-verification is rejected server-side.
- πŸ“‹ **Review kanban** β€” chunks move through `new β†’ in_review β†’ reviewed` with an immutable audit trail.
- πŸ›‘οΈ **Grounding checks** β€” score how well an agent's summary aligns with its sources. Catch hallucinations before they ship.
- 🌍 **Translations** β€” per-chunk, multi-translator, with author/reviewer provenance.
- πŸ–ΌοΈ **Agent-driven OCR** β€” scanned pages handed back as rendered PNGs; agent transcribes and the graph absorbs the result.
- πŸ”’ **Local & private** β€” parsing, embedding, and analysis all run on your machine against a local `.kgl` file. The only network call is a one-time bge-m3 model download; your documents never leave the host. See [Confidentiality](https://kglite-docs.readthedocs.io/en/latest/privacy/).

## Install

```bash
pip install kglite-docs
```

## 30 seconds of Python

```python
from kglite_docs import Corpus

with Corpus.create("kb.kgl") as corpus: # auto-saves on exit
corpus.ingest_dir("./papers") # PDF / DOCX / PPTX / MD / HTML / images
hits = corpus.search("transformer attention", top_k=5, agent_id="me")
ctx = corpus.compose_context("transformer attention", max_tokens=3000)
# ctx["items"] is a ranked, token-budgeted bundle ready for your LLM prompt
```

## 30 seconds of agent loop

Cross-checked enrichment in five lines:

```python
sid = corpus.add_summary(
target_id=hits[0]["id"], text="DPR uses a dual BERT encoder…",
agent_id="writer", model="opus-4.7",
)
# A different agent verifies β€” self-verification is rejected
corpus.verify_summary(sid, verdict="verified",
verifier_agent_id="reviewer", notes="checked p.5")
# Score how grounded the summary is in its source chunks
print(corpus.check_grounding(sid)["supported_fraction"]) # β†’ 1.0
```

## Run it as an MCP server

```bash
kglite-docs-mcp --db kb.kgl
```

Register with Claude Code:

```bash
claude mcp add kglite-docs -- kglite-docs-mcp --db /abs/path/kb.kgl
```

The agent now sees ~30 typed tools (`search`, `compose_context`, `add_summary`, `verify_summary`, `tag_chunk`, `cluster_chunks`, `claim_next_review`, …) plus `cypher_query` as an escape hatch.

## Read the docs

πŸ“– **Full documentation at [kglite-docs.readthedocs.io](https://kglite-docs.readthedocs.io/)**

- [Getting started](https://kglite-docs.readthedocs.io/en/latest/getting-started/) β€” 10 minutes from `pip install` to a running agent
- [Agent workflows](https://kglite-docs.readthedocs.io/en/latest/workflows/) β€” research, comparison, fact-checking, OCR loops, hallucination guards
- [Architecture](https://kglite-docs.readthedocs.io/en/latest/architecture/) β€” graph model, design rationale, the 30+ typed MCP tools
- [API reference](https://kglite-docs.readthedocs.io/en/latest/api/corpus/) β€” every method, every argument, IDE-friendly type stubs
- [Troubleshooting](https://kglite-docs.readthedocs.io/en/latest/troubleshooting/) β€” common failure modes
- [Confidentiality](https://kglite-docs.readthedocs.io/en/latest/privacy/) β€” everything runs local; what the one network call is (and isn't)
- [Changelog](https://kglite-docs.readthedocs.io/en/latest/changelog/)

## License

MIT.