https://github.com/codelined-ag/extracto
Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.
https://github.com/codelined-ag/extracto
agents bun claude docker document-processing mcp mcp-server mistral nextjs ocr ollama openrouter pdf-ocr rag self-hosted vector-database vision-models
Last synced: about 1 month ago
JSON representation
Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.
- Host: GitHub
- URL: https://github.com/codelined-ag/extracto
- Owner: codelined-ag
- License: mit
- Created: 2026-05-03T12:46:27.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-06T00:29:02.000Z (about 2 months ago)
- Last Synced: 2026-05-07T03:37:56.812Z (about 1 month ago)
- Topics: agents, bun, claude, docker, document-processing, mcp, mcp-server, mistral, nextjs, ocr, ollama, openrouter, pdf-ocr, rag, self-hosted, vector-database, vision-models
- Language: TypeScript
- Homepage: https://extracto.help
- Size: 8.1 MB
- Stars: 4
- Watchers: 0
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
Your private document brain.
PDFs in, RAG out. Self-hosted. Plug everywhere.
Quickstart ·
What you get ·
Plug everywhere ·
Docs ·
OpenAPI ·
Changelog
> **v1.1.0**: cloud integrations end-to-end. Connect Dropbox / Google Drive / OneDrive from the UI (paste your own OAuth client_id+secret if the operator hasn't), browse and import any file from the cloud, send any OCR result back as `md`, `docx`, `xlsx`, `obsidian`, or `zip`, and configure watched folders (cloud or local) that auto-submit new files for OCR. See the [changelog](./CHANGELOG.md).
>
> **v1.0.0**: side-by-side multi-model comparison with server-computed word-level diff, model recommendations from your own OCR history, PII auto-redaction with audit trail, form-field extraction, LaTeX equation extraction, and an E2E encryption scaffold (RSA SPKI public-key registration + AES-256-GCM envelope).
---
## Why
Most document-to-AI tools are SaaS. They cost per page, they see your documents, they lock you into one provider. Extracto is the opposite: one Docker container, your machine, any vision model (local or hosted), output goes wherever you want it. Browser, code, agent, vector store. You pick.
---
## What you get
A complete pipeline from raw document to retrievable knowledge, in one container:
1. **Ingest** any PDF, image, watched local folder, or watched Dropbox / Google Drive / OneDrive folder.
2. **Extract** with the vision model of your choice (Ollama, Mistral OCR, OpenRouter, any OpenAI-compatible endpoint).
3. **Post-process** with a second LLM pass (clean to markdown or strict JSON, with your own instruction).
4. **Chunk + embed + store** into Chroma, Qdrant, Weaviate, Milvus, OpenSearch, Pinecone, or Typesense.
5. **Retrieve** through a stable v1 REST API, an OpenAI-Chat-Completions adapter, an MCP server, a typed CLI, or the browser UI.
6. **Push** any result back to Dropbox / Google Drive / OneDrive, S3/MinIO, or download as `md`, `json`, `docx`, `rtf`, `csv`, `xlsx`, `obsidian`, or per-page `zip`.
Everything else (per-user accounts, scoped API keys, rate limits, signed webhooks, S3/MinIO offload, Prometheus metrics, multi-language UI, per-user OAuth credentials when the operator hasn't preconfigured them) is documented at [extracto.help](https://extracto.help).
---
## Quickstart
You need Docker. That's it.
```bash
curl -fsSL https://extracto.help/install.sh | bash
```
Pulls the prebuilt multi-arch image, runs a single container with an auto-generated `AUTH_SECRET` and a persistent SQLite volume, waits for the healthcheck, and prints the URL. Open , sign up, follow the tour.
For the full install (compose stack, Docker + Ollama provisioning, `extracto` CLI on PATH, Windows path), see [extracto.help/install](https://extracto.help/install).
---
## Plug everywhere
Same backend, five surfaces. Pick what fits.
| Surface | Use it when | Read |
|---|---|---|
| **Browser UI** | You're a human with a stack of PDFs | [How it works](https://extracto.help/how-it-works) |
| **REST API** (`/api/v1/*`) | You're building a document-intake pipeline | [API reference](https://extracto.help/api/overview) |
| **MCP server** | Your agent speaks MCP (Claude Desktop, Cursor, Codex, OpenClaw, Hermes) | [Agents](https://extracto.help/agents/overview) |
| **CLI + [`SKILL.md`](./SKILL.md)** | Your agent only has a shell tool | [Skill file](./SKILL.md) |
| **OpenAI-Chat adapter** | You already have OpenAI-SDK code; point it at Extracto | [OpenAI compat](https://extracto.help/api/openai-compat) |
OpenAPI 3.1 spec at [`openapi.yaml`](./openapi.yaml). Live Scalar reference at `/api/v1/docs` on every running instance.
---
## Star history
---
## License
[MIT](./LICENSE) © codelined