{"id":39603782,"url":"https://github.com/rithulkamesh/docproc","last_synced_at":"2026-04-02T18:50:25.359Z","repository":{"id":274982428,"uuid":"924572940","full_name":"rithulkamesh/docproc","owner":"rithulkamesh","description":"Document Intelligence Platform — Extract, refine, and query documents with vision LLMs and config-driven RAG.","archived":false,"fork":false,"pushed_at":"2026-03-30T03:54:11.000Z","size":1142,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-30T06:04:47.573Z","etag":null,"topics":["content-extraction","data-extraction","document-analysis","document-parsing","equation-detection","layout-analysis","machine-learning","mathematical-symbols","ocr","pdf-processing","pdf-text-extraction","python","region-detection","text-classification","text-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rithulkamesh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["rithulkamesh"]}},"created_at":"2025-01-30T09:08:57.000Z","updated_at":"2026-03-30T03:54:14.000Z","dependencies_parsed_at":"2025-02-16T13:31:53.100Z","dependency_job_id":"36254cb1-043a-4a79-90c7-841301ec9821","html_url":"https://github.com/rithulkamesh/docproc","commit_stats":null,"previous_names":["rithulkamesh/docproc"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/rithulkamesh/docproc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rithulkamesh%2Fdocproc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rithulkamesh%2Fdocproc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rithulkamesh%2Fdocproc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rithulkamesh%2Fdocproc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rithulkamesh","download_url":"https://codeload.github.com/rithulkamesh/docproc/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rithulkamesh%2Fdocproc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31313433,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["content-extraction","data-extraction","document-analysis","document-parsing","equation-detection","layout-analysis","machine-learning","mathematical-symbols","ocr","pdf-processing","pdf-text-extraction","python","region-detection","text-classification","text-extraction"],"created_at":"2026-01-18T07:56:26.980Z","updated_at":"2026-04-02T18:50:25.351Z","avatar_url":"https://github.com/rithulkamesh.png","language":"Python","funding_links":["https://github.com/sponsors/rithulkamesh"],"categories":[],"sub_categories":[],"readme":"# docproc\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/logo.svg\" width=\"160\" alt=\"docproc logo\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003edocproc\u003c/b\u003e\u003cbr\u003e\n  Turn messy documents into clean markdown for AI pipelines.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  Document → Markdown → AI\n\u003c/p\u003e\n\n---\n\ndocproc is a document-to-markdown extraction engine. It converts PDFs, DOCX, PPTX, and XLSX into clean structured markdown while preserving equations, figures, and embedded images. It is designed to power LLM pipelines, RAG systems, and document processing workflows.\n\n## Features\n\n- **PDF → Markdown** — Native text extraction plus vision-based handling of embedded images\n- **DOCX → Markdown** — Full document structure and formatting\n- **PPTX → Markdown** — Slides to structured content\n- **XLSX → Markdown** — Spreadsheets to readable tables\n- **Equation preservation** — LaTeX and math kept intact (with optional LLM refinement)\n- **Figure extraction** — Every image, diagram, and label described by a vision model\n- **Clean structured output** — Ready for LLMs, RAG, and downstream pipelines\n\n## Example\n\n**Before:** A PDF with mixed text, equations, and diagrams.\n\n**After:** A single `.md` file with extracted text, LaTeX math blocks, and every figure explained by the vision model—ready to embed, chunk, or feed into an LLM.\n\n```bash\ndocproc --file paper.pdf -o paper.md\n```\n\n## Installation\n\n```bash\npip install git+https://github.com/rithulkamesh/docproc.git\n```\n\nOr with [uv](https://github.com/astral-sh/uv):\n\n```bash\nuv tool install git+https://github.com/rithulkamesh/docproc.git\n```\n\nFrom source:\n\n```bash\ngit clone https://github.com/rithulkamesh/docproc.git \u0026\u0026 cd docproc\nuv sync --python 3.12\n```\n\n## Usage\n\nOne-time config (generates `docproc.yaml` from your `.env`):\n\n```bash\ndocproc init-config --env .env\n```\n\nExtract a document to markdown:\n\n```bash\ndocproc --file input.pdf -o output.md\n```\n\nOptional: `--config path`, `-v` for verbose output. Shell completions: `docproc completions bash` or `docproc completions zsh`.\n\n### Python library\n\nInstall the package, then use the `Docproc` facade with instance-scoped config (PEP 561 typing via `py.typed`):\n\n```python\nfrom docproc import Docproc\n\nDocproc.from_config_path(\"docproc.yaml\").extract_to_file(\"input.pdf\", \"output.md\")\n\n# Or minimal OpenAI in code (uses OPENAI_API_KEY):\nDocproc.with_openai().extract_to_file(\"input.pdf\", \"output.md\")\n\n# String output for RAG / LLM pipelines:\nmd = Docproc.from_env().extract(\"paper.pdf\")\n```\n\nLower-level API: `extract_document_to_text`, `parse_config`, `docprocConfig`. Runnable samples: [examples/](examples/).\n\n## Why docproc?\n\nNaive PDF parsers often drop equations, misread layouts, and leave images as black boxes. docproc uses native extractors where possible (PyMuPDF, python-docx, etc.) and runs a vision model on every embedded image—so diagrams, charts, and equations become text or LaTeX that your AI stack can actually use. Optional LLM refinement cleans markdown and normalizes math. The result is document content that fits cleanly into RAG pipelines and LLM context windows instead of noisy, incomplete text.\n\n## Architecture\n\ndocproc ships as a **CLI** and an **importable Python library**; there is no bundled server or database for extraction. The pipeline is:\n\n1. **Load** — Read the file (PDF/DOCX/PPTX/XLSX) and extract full text from the native layer.\n2. **Vision** — For PDFs, run a vision model on every embedded image; get descriptions, LaTeX, or structured captions.\n3. **Refine** (optional) — LLM pass to tidy markdown, normalize LaTeX, and strip boilerplate.\n4. **Sanitize** — Dedupe and clean; write a single `.md` file.\n\nConfiguration lives in `docproc.yaml` (or generated via `docproc init-config --env .env`). AI providers: OpenAI, Azure, Anthropic, Ollama, LiteLLM. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) and [docs/CONFIGURATION.md](docs/CONFIGURATION.md) for details.\n\n## Demo (docproc // edu)\n\nThe [demo/](demo/) is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. It’s a separate Go + React app that calls this CLI when a document is uploaded. See [demo/README.md](demo/README.md).\n\n## Docs\n\n| Doc | Description |\n|-----|-------------|\n| [docs/README.md](docs/README.md) | Index |\n| [docs/CONFIGURATION.md](docs/CONFIGURATION.md) | Config schema, providers, ingest, RAG |\n| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | Pipeline, CLI, Python library |\n| [docs/AZURE_SETUP.md](docs/AZURE_SETUP.md) | Azure OpenAI and Vision setup |\n| [docs/ASSESSMENTS_AI.md](docs/ASSESSMENTS_AI.md) | Assessments and grading in the demo |\n\n**Environment:** `DOCPROC_CONFIG` for config path (default: `docproc.yaml`). Provider keys: `OPENAI_API_KEY`, `AZURE_OPENAI_*`, `ANTHROPIC_API_KEY`, etc. See [.env.example](.env.example).\n\n## Contributing\n\nPull requests welcome. Run the tests before sending.\n\n## License\n\nMIT. See [LICENSE.md](LICENSE.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frithulkamesh%2Fdocproc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frithulkamesh%2Fdocproc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frithulkamesh%2Fdocproc/lists"}