{"id":49619826,"url":"https://github.com/codelined-ag/extracto","last_synced_at":"2026-05-10T06:00:44.228Z","repository":{"id":355418353,"uuid":"1227986369","full_name":"codelined-ag/Extracto","owner":"codelined-ag","description":"Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.","archived":false,"fork":false,"pushed_at":"2026-05-06T00:29:02.000Z","size":8490,"stargazers_count":4,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-07T03:37:56.812Z","etag":null,"topics":["agents","bun","claude","docker","document-processing","mcp","mcp-server","mistral","nextjs","ocr","ollama","openrouter","pdf-ocr","rag","self-hosted","vector-database","vision-models"],"latest_commit_sha":null,"homepage":"https://extracto.help","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codelined-ag.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-03T12:46:27.000Z","updated_at":"2026-05-06T00:28:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"df661fb3-3cf7-4904-8ea3-ae779cb61568","html_url":"https://github.com/codelined-ag/Extracto","commit_stats":null,"previous_names":["codelined-ag/extracto"],"tags_count":29,"template":false,"template_full_name":null,"purl":"pkg:github/codelined-ag/Extracto","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelined-ag%2FExtracto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelined-ag%2FExtracto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelined-ag%2FExtracto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelined-ag%2FExtracto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codelined-ag","download_url":"https://codeload.github.com/codelined-ag/Extracto/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codelined-ag%2FExtracto/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32807861,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"online","status_checked_at":"2026-05-09T02:00:06.633Z","response_time":123,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","bun","claude","docker","document-processing","mcp","mcp-server","mistral","nextjs","ocr","ollama","openrouter","pdf-ocr","rag","self-hosted","vector-database","vision-models"],"created_at":"2026-05-05T01:03:39.175Z","updated_at":"2026-05-09T05:01:52.956Z","avatar_url":"https://github.com/codelined-ag.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"extracto-banner.png\" alt=\"Extracto\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eYour private document brain.\u003c/strong\u003e\u003cbr/\u003e\n  PDFs in, RAG out. Self-hosted. Plug everywhere.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#quickstart\"\u003eQuickstart\u003c/a\u003e ·\n  \u003ca href=\"#what-you-get\"\u003eWhat you get\u003c/a\u003e ·\n  \u003ca href=\"#plug-everywhere\"\u003ePlug everywhere\u003c/a\u003e ·\n  \u003ca href=\"https://extracto.help\"\u003eDocs\u003c/a\u003e ·\n  \u003ca href=\"./openapi.yaml\"\u003eOpenAPI\u003c/a\u003e ·\n  \u003ca href=\"./CHANGELOG.md\"\u003eChangelog\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/codelined-ag/Extracto/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/codelined-ag/Extracto/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"./LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/github/license/codelined-ag/Extracto?color=brightgreen\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/codelined-ag/Extracto/pkgs/container/extracto\"\u003e\u003cimg src=\"https://img.shields.io/badge/ghcr.io-extracto-blue?logo=docker\" alt=\"GHCR\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/codelined-ag/Extracto/stargazers\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/codelined-ag/Extracto?style=flat\u0026color=ffb000\" alt=\"Stars\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"docs/screenshots/main-dark.png\"\u003e\n    \u003cimg src=\"docs/screenshots/main-light.png\" alt=\"Extracto workspace\" width=\"100%\"\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003e **v1.1.0**: cloud integrations end-to-end. Connect Dropbox / Google Drive / OneDrive from the UI (paste your own OAuth client_id+secret if the operator hasn't), browse and import any file from the cloud, send any OCR result back as `md`, `docx`, `xlsx`, `obsidian`, or `zip`, and configure watched folders (cloud or local) that auto-submit new files for OCR. See the [changelog](./CHANGELOG.md).\n\u003e\n\u003e **v1.0.0**: side-by-side multi-model comparison with server-computed word-level diff, model recommendations from your own OCR history, PII auto-redaction with audit trail, form-field extraction, LaTeX equation extraction, and an E2E encryption scaffold (RSA SPKI public-key registration + AES-256-GCM envelope).\n\n---\n\n## Why\n\nMost document-to-AI tools are SaaS. They cost per page, they see your documents, they lock you into one provider. Extracto is the opposite: one Docker container, your machine, any vision model (local or hosted), output goes wherever you want it. Browser, code, agent, vector store. You pick.\n\n---\n\n## What you get\n\nA complete pipeline from raw document to retrievable knowledge, in one container:\n\n1. **Ingest** any PDF, image, watched local folder, or watched Dropbox / Google Drive / OneDrive folder.\n2. **Extract** with the vision model of your choice (Ollama, Mistral OCR, OpenRouter, any OpenAI-compatible endpoint).\n3. **Post-process** with a second LLM pass (clean to markdown or strict JSON, with your own instruction).\n4. **Chunk + embed + store** into Chroma, Qdrant, Weaviate, Milvus, OpenSearch, Pinecone, or Typesense.\n5. **Retrieve** through a stable v1 REST API, an OpenAI-Chat-Completions adapter, an MCP server, a typed CLI, or the browser UI.\n6. **Push** any result back to Dropbox / Google Drive / OneDrive, S3/MinIO, or download as `md`, `json`, `docx`, `rtf`, `csv`, `xlsx`, `obsidian`, or per-page `zip`.\n\nEverything else (per-user accounts, scoped API keys, rate limits, signed webhooks, S3/MinIO offload, Prometheus metrics, multi-language UI, per-user OAuth credentials when the operator hasn't preconfigured them) is documented at [extracto.help](https://extracto.help).\n\n---\n\n## Quickstart\n\nYou need Docker. That's it.\n\n```bash\ncurl -fsSL https://extracto.help/install.sh | bash\n```\n\nPulls the prebuilt multi-arch image, runs a single container with an auto-generated `AUTH_SECRET` and a persistent SQLite volume, waits for the healthcheck, and prints the URL. Open \u003chttp://localhost:3000\u003e, sign up, follow the tour.\n\nFor the full install (compose stack, Docker + Ollama provisioning, `extracto` CLI on PATH, Windows path), see [extracto.help/install](https://extracto.help/install).\n\n---\n\n## Plug everywhere\n\nSame backend, five surfaces. Pick what fits.\n\n| Surface | Use it when | Read |\n|---|---|---|\n| **Browser UI** | You're a human with a stack of PDFs | [How it works](https://extracto.help/how-it-works) |\n| **REST API** (`/api/v1/*`) | You're building a document-intake pipeline | [API reference](https://extracto.help/api/overview) |\n| **MCP server** | Your agent speaks MCP (Claude Desktop, Cursor, Codex, OpenClaw, Hermes) | [Agents](https://extracto.help/agents/overview) |\n| **CLI + [`SKILL.md`](./SKILL.md)** | Your agent only has a shell tool | [Skill file](./SKILL.md) |\n| **OpenAI-Chat adapter** | You already have OpenAI-SDK code; point it at Extracto | [OpenAI compat](https://extracto.help/api/openai-compat) |\n\nOpenAPI 3.1 spec at [`openapi.yaml`](./openapi.yaml). Live Scalar reference at `/api/v1/docs` on every running instance.\n\n---\n\n## Star history\n\n\u003ca href=\"https://star-history.com/#codelined-ag/Extracto\u0026Date\"\u003e\n  \u003cimg src=\"https://api.star-history.com/svg?repos=codelined-ag/Extracto\u0026type=Date\" alt=\"Star History\" width=\"600\"/\u003e\n\u003c/a\u003e\n\n---\n\n## License\n\n[MIT](./LICENSE) © codelined\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelined-ag%2Fextracto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodelined-ag%2Fextracto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodelined-ag%2Fextracto/lists"}