An open API service indexing awesome lists of open source software.

https://github.com/codelined-ag/extracto

Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.
https://github.com/codelined-ag/extracto

agents bun claude docker document-processing mcp mcp-server mistral nextjs ocr ollama openrouter pdf-ocr rag self-hosted vector-database vision-models

Last synced: about 1 month ago
JSON representation

Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.

Awesome Lists containing this project

README

          


Extracto


Your private document brain.

PDFs in, RAG out. Self-hosted. Plug everywhere.


Quickstart ·
What you get ·
Plug everywhere ·
Docs ·
OpenAPI ·
Changelog


CI
License
GHCR
Stars




Extracto workspace

> **v1.1.0**: cloud integrations end-to-end. Connect Dropbox / Google Drive / OneDrive from the UI (paste your own OAuth client_id+secret if the operator hasn't), browse and import any file from the cloud, send any OCR result back as `md`, `docx`, `xlsx`, `obsidian`, or `zip`, and configure watched folders (cloud or local) that auto-submit new files for OCR. See the [changelog](./CHANGELOG.md).
>
> **v1.0.0**: side-by-side multi-model comparison with server-computed word-level diff, model recommendations from your own OCR history, PII auto-redaction with audit trail, form-field extraction, LaTeX equation extraction, and an E2E encryption scaffold (RSA SPKI public-key registration + AES-256-GCM envelope).

---

## Why

Most document-to-AI tools are SaaS. They cost per page, they see your documents, they lock you into one provider. Extracto is the opposite: one Docker container, your machine, any vision model (local or hosted), output goes wherever you want it. Browser, code, agent, vector store. You pick.

---

## What you get

A complete pipeline from raw document to retrievable knowledge, in one container:

1. **Ingest** any PDF, image, watched local folder, or watched Dropbox / Google Drive / OneDrive folder.
2. **Extract** with the vision model of your choice (Ollama, Mistral OCR, OpenRouter, any OpenAI-compatible endpoint).
3. **Post-process** with a second LLM pass (clean to markdown or strict JSON, with your own instruction).
4. **Chunk + embed + store** into Chroma, Qdrant, Weaviate, Milvus, OpenSearch, Pinecone, or Typesense.
5. **Retrieve** through a stable v1 REST API, an OpenAI-Chat-Completions adapter, an MCP server, a typed CLI, or the browser UI.
6. **Push** any result back to Dropbox / Google Drive / OneDrive, S3/MinIO, or download as `md`, `json`, `docx`, `rtf`, `csv`, `xlsx`, `obsidian`, or per-page `zip`.

Everything else (per-user accounts, scoped API keys, rate limits, signed webhooks, S3/MinIO offload, Prometheus metrics, multi-language UI, per-user OAuth credentials when the operator hasn't preconfigured them) is documented at [extracto.help](https://extracto.help).

---

## Quickstart

You need Docker. That's it.

```bash
curl -fsSL https://extracto.help/install.sh | bash
```

Pulls the prebuilt multi-arch image, runs a single container with an auto-generated `AUTH_SECRET` and a persistent SQLite volume, waits for the healthcheck, and prints the URL. Open , sign up, follow the tour.

For the full install (compose stack, Docker + Ollama provisioning, `extracto` CLI on PATH, Windows path), see [extracto.help/install](https://extracto.help/install).

---

## Plug everywhere

Same backend, five surfaces. Pick what fits.

| Surface | Use it when | Read |
|---|---|---|
| **Browser UI** | You're a human with a stack of PDFs | [How it works](https://extracto.help/how-it-works) |
| **REST API** (`/api/v1/*`) | You're building a document-intake pipeline | [API reference](https://extracto.help/api/overview) |
| **MCP server** | Your agent speaks MCP (Claude Desktop, Cursor, Codex, OpenClaw, Hermes) | [Agents](https://extracto.help/agents/overview) |
| **CLI + [`SKILL.md`](./SKILL.md)** | Your agent only has a shell tool | [Skill file](./SKILL.md) |
| **OpenAI-Chat adapter** | You already have OpenAI-SDK code; point it at Extracto | [OpenAI compat](https://extracto.help/api/openai-compat) |

OpenAPI 3.1 spec at [`openapi.yaml`](./openapi.yaml). Live Scalar reference at `/api/v1/docs` on every running instance.

---

## Star history


Star History

---

## License

[MIT](./LICENSE) © codelined