An open API service indexing awesome lists of open source software.

https://github.com/ontos-ai/knowhere

Knowhere automates the complex pipeline of extracting, parsing, and transforming messy documents into structured, high-quality data optimized for AI Agents, Agentic RAG, and traditional vector-based RAG workflows.
https://github.com/ontos-ai/knowhere

agent ai-agents chromadb claude claude-code cursor elasticsearch gemini gpt langchain milvus qdrant rag rag-pipeline skills

Last synced: about 1 month ago
JSON representation

Knowhere automates the complex pipeline of extracting, parsing, and transforming messy documents into structured, high-quality data optimized for AI Agents, Agentic RAG, and traditional vector-based RAG workflows.

Awesome Lists containing this project

README

          

20260506-102713

Build AI Agent Memory from Real-World Documents



Python Version


GitHub stars


Build Status




Join the community on GitHub


Container Images


License: Apache 2.0


πŸ”— Website |
πŸ“„ Docs |
🏠 Self-Host |
πŸ–₯️ Dashboard

## What We Are

**We're not developing the next MinerU, instead, we're building document memory infrastructure that agents can effectively consume.**

Knowhere turns unstructured documents into persistent, navigable memory for AI agents. It handles parsing, hierarchy extraction, multi-modal structuring, and graph construction, giving your agents structured, high-quality context for *Agentic RAG*, *traditional RAG*, or any LLM workflow.

> [!TIP]
> Knowhere stands on the shoulders of giants like MinerU and Pymupdf. We take their output, optimize it, and then build **hierarchical structure** and **multi-modal cross-document graphs** on top. The result is a persistent, citable memory layer purpose-built for agent consumption.

> [!NOTE]
> **Get started in seconds with Knowhere Cloud.**
> Avoid the complexity of self-deployment. Use our managed API at [knowhereto.ai](https://knowhereto.ai) and enjoy **$5 in free credits** upon registration.

## πŸ“’ News

- **May 7, 2026**: πŸš€ **Knowhere is now Open Source!** We have open-sourced our entire stack for document ingestion, parsing, and agentic RAG. You can now self-host the full platform using [knowhere-self-hosted](https://github.com/Ontos-AI/knowhere-self-hosted). Check out our [Contribution Guide](CONTRIBUTING.md) to get involved!
- **Apr 30, 2026**: πŸ“¦ **Version [2026.04.30.1](https://github.com/Ontos-AI/knowhere/releases/tag/2026.04.30.1) has been released.** This update includes several stability improvements and initial support for the agentic RAG layer. See the [full changelog](https://github.com/Ontos-AI/knowhere/commits/2026.04.30.1) for details.

## How it Works

Knowhere turns raw documents into a structured memory store that AI agents can navigate and cite. The process follows two steps:

### Step 1: Parse and Build Memory


Step 1: Parse and Build Memory

Parsing, chunking, hierarchy extraction, and graph construction are unified into one outcome: a navigable memory layer for AI agents.

- **Parse**: Route PDFs, Office files, images, tables, Markdown, and text to specialized parsers.
- **Structure**: Preserve headings, section paths, multi-modal assets, and chunk relationships.
- **Build Memory**: Store chunks, navigation trees, summaries, and graph links as agent-ready context.

### Step 2: Agentic Retrieval


Step 2: Agentic Retrieval

Agents retrieve by navigating memory instead of depending on a single flat vector lookup.

- **Discover**: Fuse keyword, path, content, and semantic signals for broad first-pass coverage.
- **Navigate**: Walk section trees and graph links to drill into the most relevant document regions.
- **Cite Evidence**: Return traceable results with source document, section, chunk, and linked image or table assets.

## Performance Benchmark

Knowhere enhances the accuracy of AI agents when performing tasks (e.g., searching, modifying, and answering) in real-world data. Compared to providing raw documents directly to agents or .md/.json files produced by other parsers, Knowhere achieves higher success rates with fewer resources.


Benchmark Performance: Agent + Knowhere vs Others

### Key Advantages

- **Superior Accuracy**: Knowhere drastically improves both **First-time Accuracy** (+36% over raw docs) and **Recall** (+10%), ensuring agents find the right evidence faster.
- **Enhanced Reliability**: With user feedback, agents using Knowhere hit **79% accuracy**β€”a significant jump compared to the ~53% ceiling of raw documents.
- **Higher Efficiency**: Agents require **fewer loops**, consume **fewer tokens**, and spend **less time** searching. By navigating a structured memory graph instead of reading monolithic texts, the token overhead drops while precision increases.

*(Data generated from internal evaluation across identical agentic RAG tasks.)*

> [!NOTE]
> **πŸ“Š Benchmarks are actively expanding.** The comparison above currently covers MinerU as the baseline parser. We are continuously adding more parsing tools and retrieval baselines β€” stay tuned for updated results.

## Ecosystem

| Repository | Description |
|---|---|
| [knowhere](https://github.com/Ontos-AI/knowhere) | **This repo.** Backend API and worker β€” document ingestion, parsing, graph construction, and retrieval. |
| πŸ–₯️ [knowhere-dashboard](https://github.com/Ontos-AI/knowhere-dashboard) | The web UI. Connects to the API for the full product experience. |
| 🐳 [knowhere-self-hosted](https://github.com/Ontos-AI/knowhere-self-hosted) | Docker Compose stack for self-hosted deployments. Packages the API, worker, and dashboard together. |
| 🐍 [knowhere-python-sdk](https://github.com/Ontos-AI/knowhere-python-sdk) | Official Python SDK for the Knowhere Cloud API. |
| πŸ¦• [knowhere-node-sdk](https://github.com/Ontos-AI/knowhere-node-sdk) | Official Node.js SDK for the Knowhere Cloud API. |

## Features

- **Multi-modal Parsing**: High-fidelity extraction from PDF, Office, and images, preserving headings, tables, and hierarchical paths.
- **Lightweight Memory Graph**: Context-aware organization that links documents and chunks for better relationship understanding.
- **Agentic RAG**: A hybrid retrieval engine combining traditional search (RRF) with autonomous agent navigation.
- **Evidence-based Citations**: Every result is backed by traceable source paths, ensuring reliability for AI Agent decision-making.

## Frequently Asked Questions (FAQ)

**Q: Is MinerU strictly required for Knowhere to work?**
A: No. While MinerU is currently our default choice for parsing PDFs and PPT, because it performs the best in our experiments, any tool that can convert documents to Markdown works. Knowhere's real value lies in what happens *alongside and after* the initial conversion: memory-oriented parsing optimizations (fixing real-world parser deficiencies), reconstructing the hierarchical structure, normalizing multi-modal assets, and building the cross-document navigation graph.

**Q: What are the LLM / VLM dependencies?**
A: Knowhere requires standard language models to structure the document memory. By default, it uses DeepSeek (`deepseek-chat`) for text/table summarization and hierarchy generation, and Qwen-VL (`qwen3.5-flash`) for image OCR and visual descriptions. However, it is entirely model-agnosticβ€”you can easily configure it to use OpenAI, DashScope (Ali), Zhipu (GLM), or Volcengine (ARK) via environment variables.

**Q: How does Agentic Retrieval differ from traditional RAG?**
A: Traditional RAG relies on flat vector similarity, which often retrieves isolated, out-of-context text snippets. Knowhere's Agentic Retrieval instead uses a multi-agent workflow to actively navigate the hierarchical section tree and cross-document graph. Agents read the document structure like a human would, drilling down into relevant sections to find precise, well-contextualized evidence.

**Q: Can it handle multi-modal data like images and tables?**
A: Yes. Knowhere extracts inline images and tables, passes them through Vision-Language Models (VLMs) for summarization and feature extraction, and explicitly links them back to their original text chunks. This ensures that agents can retrieve and cite multi-modal assets accurately during inference.

## Supported Formats

**βœ… Supported**

- [x] `.pdf` `.docx` `.pptx` `.xlsx` `.csv`
- [x] `.jpg` `.png`
- [x] `.md` `.txt` `.json`

**⏳ Coming Soon**

- [ ] `.epub` `.html` `.xml`
- [ ] `.mp4` `.mp3`
- [ ] `.skills.md`

Want to see a new format supported? Adding a parser is a great first contribution. Check out [CONTRIBUTING.md](CONTRIBUTING.md) to get started.

## Prerequisites

- Python 3.11+
- `uv`
- Docker with `docker compose`

## Quick Start

1. Sync the workspace dependencies:

```bash
uv sync --all-packages
```

2. Copy the environment examples:

```bash
cp apps/api/.env.example apps/api/.env
cp apps/worker/.env.example apps/worker/.env
```

3. Update the copied `.env` files with the values you need for local work:

- database and Redis connection settings
- S3-compatible storage credentials
- at least one LLM provider key: `DS_KEY`, `ALI_API_KEYS`, `GPT_API_KEY`, or `GLM_API_KEY`
- `MINERU_API_KEYS` if you need PDF parsing
- a vision-capable model provider if you need image summaries, OCR, atlas classification, or image-aware retrieval
- any optional billing or webhook providers you want to enable

Most parser and retrieval tuning values have code defaults. Start with the
required external services first, then override model names, provider URLs,
budgets, or concurrency limits only when your deployment needs different
behavior. See [docs/external-services.md](docs/external-services.md) for the
full dependency matrix.

4. Start the local infrastructure stack:

```bash
./deploy/local-dev/start-dev.sh
```

5. Start the API and worker in separate terminals:

```bash
cd apps/api && uv run main.py
cd apps/worker && uv run worker.py
```

The API runs migrations during startup.

For API-only development without the dashboard, create an API-only user/key
after the API service starts:

```bash
cd apps/api
uv run scripts/init_user.py --email you@example.com
```

If you plan to use the dashboard, register through the dashboard instead of
using `scripts/init_user.py`.

The API is now running at `http://localhost:5005`. If you want the full product experience with a UI, run the [knowhere-dashboard](https://github.com/Ontos-AI/knowhere-dashboard) alongside it β€” it connects to this API out of the box.

## Quality Checks

Run lint checks from the repository root:

```bash
make lint
```

Apply safe Ruff fixes:

```bash
make lint-fix
```

Run type checks across the API, worker, and shared source code:

```bash
make typecheck
```

Run both lint and type checks:

```bash
make check
```

## Local Endpoints

- API: `http://localhost:5005`
- OpenAPI docs: `http://localhost:5005/docs`
- LocalStack: `http://localhost:4566`
- PostgreSQL: `localhost:5432`
- Redis: `localhost:6379`

## Additional Guides

- External dependency guide:
[docs/external-services.md](docs/external-services.md)

## Citation

If you use Knowhere in your research, please cite it as:

```bibtex
@software{knowhere2026,
author = {Ontos AI},
title = {Knowhere: Build AI Agent Memory from Real-World Documents},
year = {2026},
publisher = {GitHub},
url = {https://github.com/Ontos-AI/knowhere},
version = {2026.04.30.1},
license = {Apache-2.0}
}
```

## Communication

- [GitHub Discussions](https://github.com/Ontos-AI/knowhere/discussions) for questions, ideas, and general conversation.
- [GitHub Issues](https://github.com/Ontos-AI/knowhere/issues) for bug reports and feature requests.

## Contribution

Any contributions to Knowhere are more than welcome!

If you are new to the project, check out the [good first issues](https://github.com/Ontos-AI/knowhere/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22). They are well-defined, relatively simple, and a great way to get familiar with the codebase and the contribution workflow.

For general guidelines on branching, commit conventions, and the review process, take a look at [CONTRIBUTING.md](CONTRIBUTING.md).

Other useful references:

- [SECURITY.md](SECURITY.md) β€” how to report vulnerabilities responsibly.
- [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) β€” community behavior expectations.
- [LICENSE](LICENSE) and [NOTICE](NOTICE) β€” Apache 2.0.

## πŸ‘‹ We're Hiring!

We're building the knowledge layer for the Agent era. If that sounds like work you want to do, reach out β€” decode the address below and drop us a line:

```bash
echo 'dGVhbUBrbm93aGVyZXRvLmFp' | base64 --decode
```