{"id":49722791,"url":"https://github.com/hparreao/doclingconverter","last_synced_at":"2026-05-09T02:46:50.638Z","repository":{"id":262188083,"uuid":"886474140","full_name":"hparreao/doclingconverter","owner":"hparreao","description":"Quick way to convert files (PDF, DOCX, HTML, PPTX, Images) to (MD, JSON, YAML) using Docling and Streamlit","archived":false,"fork":false,"pushed_at":"2025-07-04T10:25:53.000Z","size":12,"stargazers_count":8,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-04T11:36:53.146Z","etag":null,"topics":["image-to-markdown","markdown-converter","markdown-to-json","pdf-converter","pdf-to-json","pdf-to-markdown","streamlit","yaml-convertor"],"latest_commit_sha":null,"homepage":"https://doclingconvert.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hparreao.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-11T03:20:16.000Z","updated_at":"2025-07-04T10:25:57.000Z","dependencies_parsed_at":"2024-11-11T04:31:24.209Z","dependency_job_id":null,"html_url":"https://github.com/hparreao/doclingconverter","commit_stats":null,"previous_names":["hparreao/doclingconverter"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hparreao/doclingconverter","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hparreao%2Fdoclingconverter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hparreao%2Fdoclingconverter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hparreao%2Fdoclingconverter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hparreao%2Fdoclingconverter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hparreao","download_url":"https://codeload.github.com/hparreao/doclingconverter/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hparreao%2Fdoclingconverter/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32805513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"online","status_checked_at":"2026-05-09T02:00:06.633Z","response_time":123,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-to-markdown","markdown-converter","markdown-to-json","pdf-converter","pdf-to-json","pdf-to-markdown","streamlit","yaml-convertor"],"created_at":"2026-05-09T02:46:49.995Z","updated_at":"2026-05-09T02:46:50.621Z","avatar_url":"https://github.com/hparreao.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Docling Converter \n\n## Table of Contents\n\n1. Introduction  \n2. Features  \n3. Quick Start  \n4. Installation  \n5. Running the Application  \n6. Usage Guide  \n7. Application Architecture  \n8. Programmatic API Reference  \n9. Extending the Converter  \n10. Deployment  \n11. Contributing  \n12. License  \n\n---\n\n## 1. Introduction\n\n**Docling Converter** is a [Streamlit](https://streamlit.io/) web application that leverages the powerful **[Docling](https://github.com/axiros/docling)** library to convert a variety of document formats into Markdown, JSON, or YAML. It supports PDF (with optional OCR), Word, HTML, PowerPoint, images, AsciiDoc, and Markdown sources.\n\nThe demo is available at: \u003chttps://doclingconvert.streamlit.app/\u003e.\n\n---\n\n## 2. Features\n\n• **Multi-format input** – PDF, DOCX, HTML, PPTX, images, AsciiDoc, and Markdown.  \n• **Flexible output** – choose between Markdown, JSON, or YAML.  \n• **OCR support** – extract text from scanned PDFs/images with one click.  \n• **Adjustable image resolution** – fine-tune the DPI multiplier (1.0-4.0).  \n• **Streamlit UI** – modern, reactive interface with instant previews \u0026 downloads.  \n\n---\n\n## 3. Quick Start\n\n```bash\n# clone the repository\n$ git clone https://github.com/hparreao/doclingconverter.git\n$ cd doclingconverter\n\n# create virtual environment (optional but recommended)\n$ python -m venv venv \u0026\u0026 source venv/bin/activate\n\n# install dependencies\n$ pip install -r requirements.txt\n\n# run the Streamlit server\n$ streamlit run app.py\n```\n\nNavigate to \u003chttp://localhost:8501\u003e in your browser and start converting documents.\n\n---\n\n## 4. Installation\n\nThe project has only two runtime dependencies:\n\n```text\nDocling  # heavy-lifting document conversion engine\nStreamlit # frontend/UI framework\n```\n\nBoth are automatically installed via `requirements.txt`.\n\nFor **development** you might also want:\n\n```bash\npip install black isort flake8 pre-commit\n```\n\n---\n\n## 5. Running the Application\n\n| Command | Description |\n|---------|-------------|\n| `streamlit run app.py` | Launch the local development server. |\n| `streamlit run app.py --server.headless true` | Run headless (useful for remote/Docker deployments). |\n\nEnvironment variables (all optional):\n\n| Variable | Purpose | Default |\n|----------|---------|---------|\n| `DOC_CONVERTER_MAX_PAGES` | Override `AppConfig.MAX_PAGES` | `100` |\n| `DOC_CONVERTER_MAX_FILE_SIZE` | Override `AppConfig.MAX_FILE_SIZE` (bytes) | `20971520` |\n\n---\n\n## 6. Usage Guide\n\n1. Select the **document type** from the left sidebar.  \n2. **Upload** the file (max 20 MB; max 100 pages).  \n3. Pick the desired **output format** (Markdown / JSON / YAML).  \n4. Toggle **OCR** and adjust **image resolution** (if available).  \n5. Hit **Start Conversion**.  \n6. Download the generated file or inspect the preview inline.\n\n---\n\n## 7. Application Architecture\n\n```\napp.py        # Streamlit entry-point \u0026 main module\n│\n├── AppConfig                    # Centralised runtime configuration\n├── DocumentConverterUI          # UI/UX helpers (layout \u0026 widgets)\n├── DocumentProcessor            # Wrapper around Docling's DocumentConverter\n└── handle_conversion_output()   # Result post-processing \u0026 download link\n```\n\nThe heavy lifting is delegated to `docling.DocumentConverter`. This repository only configures the converter (pipelines, OCR, page limits) and provides a sleek Streamlit interface.\n\n---\n\n## 8. Programmatic API Reference\n\nBelow is a high-level overview of the public classes/functions you may import in your own scripts.\n\n### 8.1 `AppConfig`\n\n```python\n@dataclass\nclass AppConfig:\n    SUPPORTED_TYPES: Dict[str, List[str]]\n    OUTPUT_FORMATS: List[str]\n    MAX_PAGES: int\n    MAX_FILE_SIZE: int\n    DEFAULT_IMAGE_SCALE: float\n```\n\nConfiguration defaults controlling allowed extensions, limits and UI presets. Feel free to instantiate your own subclass or override attributes at runtime.\n\n### 8.2 `DocumentConverterUI`\n\nResponsible for setting the Streamlit page parameters and rendering the widget tree.\n\nKey methods:\n\n• `setup_page()` – initialises page meta data and header.  \n• `render_main_content()` – returns a **settings** dictionary capturing all user-selected options (file type, OCR flag, resolution, etc.).\n\n### 8.3 `DocumentProcessor`\n\n```python\nclass DocumentProcessor:\n    @staticmethod\n    @st.cache_resource\n    def get_converter(use_ocr: bool = True) -\u003e DocumentConverter: ...\n\n    @staticmethod\n    def process_document(file, settings: dict, config: AppConfig): ...\n```\n\n1. **`get_converter()`** – creates (and caches) a `docling.DocumentConverter` with customised **pipelines**:\n   * PDFs ➜ `StandardPdfPipeline` + `PyPdfiumDocumentBackend`\n   * DOCX/HTML/PPTX ➜ `SimplePipeline`\n\n2. **`process_document()`** – orchestrates the conversion, enforces `MAX_PAGES` and `MAX_FILE_SIZE`, and returns a `docling.DocumentConversionResult`.\n\n### 8.4 `handle_conversion_output(result, settings, file)`\n\nFormats the conversion result into the chosen output representation, injects a one-click download link, and renders an inline preview using Streamlit utilities.\n\n---\n\n## 9. Extending the Converter\n\nWant to add new formats or tweak pipelines? Follow these steps:\n\n1. Import the corresponding `InputFormat` and `FormatOption` implementation from **Docling**.\n2. Update `DocumentProcessor.get_converter()` by appending a new element to `allowed_formats` and its mapping in `format_options`.\n3. (Optional) Extend `AppConfig.SUPPORTED_TYPES` to expose the new extension in the UI dropdown.\n\nExample for adding **EPUB** support:\n\n```python\nfrom docling.datamodel.base_models import InputFormat\nfrom docling.document_converter import EpubFormatOption\n\n# inside get_converter()\nallowed_formats=[..., InputFormat.EPUB]\nformat_options={\n    ...,\n    InputFormat.EPUB: EpubFormatOption(),\n}\n```\n\n---\n\n## 10. Deployment\n\n### Docker (recommended)\n\n```dockerfile\n# Dockerfile\nFROM python:3.11-slim\n\nWORKDIR /app\nCOPY . .\nRUN pip install --no-cache-dir -r requirements.txt\n\nEXPOSE 8501\nCMD [\"streamlit\", \"run\", \"app.py\", \"--server.headless\", \"true\"]\n```\n\nThen build \u0026 run:\n\n```bash\ndocker build -t docling-converter .\ndocker run -p 8501:8501 docling-converter\n```\n\n### Streamlit Community Cloud\n\n1. Push your fork to GitHub.  \n2. Create a new Streamlit app, select the repo and `app.py` as the entry point.  \n3. Add `requirements.txt`.  \n4. Deploy – that's all!\n\n---\n\n## 11. Contributing\n\nPull requests and issues are welcome! Please open a discussion if you plan major changes.\n\nDevelopment guidelines:\n\n```bash\n# lint \u0026 format\n$ black . \u0026\u0026 isort . \u0026\u0026 flake8\n\n# run app with hot-reload\n$ streamlit run app.py\n```\n\n---\n\n## 12. License\n\nThis project is licensed under the terms of the **MIT License** – see the `LICENSE` file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhparreao%2Fdoclingconverter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhparreao%2Fdoclingconverter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhparreao%2Fdoclingconverter/lists"}