https://github.com/codelined-ag/extracto

Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.
https://github.com/codelined-ag/extracto

agents bun claude docker document-processing mcp mcp-server mistral nextjs ocr ollama openrouter pdf-ocr rag self-hosted vector-database vision-models

Last synced: 2 months ago
JSON representation

Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.

Host: GitHub
URL: https://github.com/codelined-ag/extracto
Owner: codelined-ag
License: mit
Created: 2026-05-03T12:46:27.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-06T00:29:02.000Z (2 months ago)
Last Synced: 2026-05-07T03:37:56.812Z (2 months ago)
Topics: agents, bun, claude, docker, document-processing, mcp, mcp-server, mistral, nextjs, ocr, ollama, openrouter, pdf-ocr, rag, self-hosted, vector-database, vision-models
Language: TypeScript
Homepage: https://extracto.help
Size: 8.1 MB
Stars: 4
Watchers: 0
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

README

          


  





  Your private document brain.


  PDFs in, RAG out. Self-hosted. Plug everywhere.





  Quickstart ·

  What you get ·

  Plug everywhere ·

  Docs ·

  OpenAPI ·

  Changelog





  

  

  

  





  

    

    

  



> **v1.1.0**: cloud integrations end-to-end. Connect Dropbox / Google Drive / OneDrive from the UI (paste your own OAuth client_id+secret if the operator hasn't), browse and import any file from the cloud, send any OCR result back as `md`, `docx`, `xlsx`, `obsidian`, or `zip`, and configure watched folders (cloud or local) that auto-submit new files for OCR. See the [changelog](./CHANGELOG.md).

>

> **v1.0.0**: side-by-side multi-model comparison with server-computed word-level diff, model recommendations from your own OCR history, PII auto-redaction with audit trail, form-field extraction, LaTeX equation extraction, and an E2E encryption scaffold (RSA SPKI public-key registration + AES-256-GCM envelope).

---

## Why

Most document-to-AI tools are SaaS. They cost per page, they see your documents, they lock you into one provider. Extracto is the opposite: one Docker container, your machine, any vision model (local or hosted), output goes wherever you want it. Browser, code, agent, vector store. You pick.

---

## What you get

A complete pipeline from raw document to retrievable knowledge, in one container:

1. **Ingest** any PDF, image, watched local folder, or watched Dropbox / Google Drive / OneDrive folder.

2. **Extract** with the vision model of your choice (Ollama, Mistral OCR, OpenRouter, any OpenAI-compatible endpoint).

3. **Post-process** with a second LLM pass (clean to markdown or strict JSON, with your own instruction).

4. **Chunk + embed + store** into Chroma, Qdrant, Weaviate, Milvus, OpenSearch, Pinecone, or Typesense.

5. **Retrieve** through a stable v1 REST API, an OpenAI-Chat-Completions adapter, an MCP server, a typed CLI, or the browser UI.

6. **Push** any result back to Dropbox / Google Drive / OneDrive, S3/MinIO, or download as `md`, `json`, `docx`, `rtf`, `csv`, `xlsx`, `obsidian`, or per-page `zip`.

Everything else (per-user accounts, scoped API keys, rate limits, signed webhooks, S3/MinIO offload, Prometheus metrics, multi-language UI, per-user OAuth credentials when the operator hasn't preconfigured them) is documented at [extracto.help](https://extracto.help).

---

## Quickstart

You need Docker. That's it.

```bash

curl -fsSL https://extracto.help/install.sh | bash

```

Pulls the prebuilt multi-arch image, runs a single container with an auto-generated `AUTH_SECRET` and a persistent SQLite volume, waits for the healthcheck, and prints the URL. Open , sign up, follow the tour.

For the full install (compose stack, Docker + Ollama provisioning, `extracto` CLI on PATH, Windows path), see [extracto.help/install](https://extracto.help/install).

---

## Plug everywhere

Same backend, five surfaces. Pick what fits.

| Surface | Use it when | Read |

|---|---|---|

| **Browser UI** | You're a human with a stack of PDFs | [How it works](https://extracto.help/how-it-works) |

| **REST API** (`/api/v1/*`) | You're building a document-intake pipeline | [API reference](https://extracto.help/api/overview) |

| **MCP server** | Your agent speaks MCP (Claude Desktop, Cursor, Codex, OpenClaw, Hermes) | [Agents](https://extracto.help/agents/overview) |

| **CLI + [`SKILL.md`](./SKILL.md)** | Your agent only has a shell tool | [Skill file](./SKILL.md) |

| **OpenAI-Chat adapter** | You already have OpenAI-SDK code; point it at Extracto | [OpenAI compat](https://extracto.help/api/openai-compat) |

OpenAPI 3.1 spec at [`openapi.yaml`](./openapi.yaml). Live Scalar reference at `/api/v1/docs` on every running instance.

---

## Star history



  



---

## License

[MIT](./LICENSE) © codelined

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/codelined-ag/extracto

Awesome Lists containing this project

README