{"id":28722375,"url":"https://github.com/docling-project/docling-parse","last_synced_at":"2026-02-23T11:40:09.159Z","repository":{"id":251888693,"uuid":"838719171","full_name":"docling-project/docling-parse","owner":"docling-project","description":"Simple package to extract text with coordinates from programmatic PDFs","archived":false,"fork":false,"pushed_at":"2026-02-20T06:56:17.000Z","size":194220,"stargazers_count":239,"open_issues_count":53,"forks_count":53,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-02-20T11:24:16.287Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/docling-project.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":"MAINTAINERS.md","copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-08-06T07:55:41.000Z","updated_at":"2026-02-20T06:56:19.000Z","dependencies_parsed_at":"2024-08-20T14:53:18.561Z","dependency_job_id":"d7ce819d-d0fb-4000-88e0-448890d01d72","html_url":"https://github.com/docling-project/docling-parse","commit_stats":null,"previous_names":["ds4sd/docling-parse","docling-project/docling-parse"],"tags_count":65,"template":false,"template_full_name":null,"purl":"pkg:github/docling-project/docling-parse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-parse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-parse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-parse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-parse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/docling-project","download_url":"https://codeload.github.com/docling-project/docling-parse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-parse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29741689,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-23T07:44:07.782Z","status":"ssl_error","status_checked_at":"2026-02-23T07:44:07.432Z","response_time":90,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-15T08:09:08.934Z","updated_at":"2026-02-23T11:40:09.097Z","avatar_url":"https://github.com/docling-project.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# Docling Parse\n\n[![PyPI version](https://img.shields.io/pypi/v/docling-parse)](https://pypi.org/project/docling-parse/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling-parse)](https://pypi.org/project/docling-parse/)\n[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\n[![Pybind11](https://img.shields.io/badge/build-pybind11-blue)](https://github.com/pybind/pybind11/)\n[![Platforms](https://img.shields.io/badge/platform-macos%20|%20linux%20|%20windows-blue)](https://github.com/docling-project/docling-parse/)\n[![License MIT](https://img.shields.io/github/license/docling-project/docling-parse)](https://opensource.org/licenses/MIT)\n\nSimple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the [Docling](https://github.com/docling-project/docling) PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.\n\nTo do the visualizations yourself, simply run (change `word` into `char` or `line`),\n\n```sh\nuv run python ./docling_parse/visualize.py -i \u003cpath-to-pdf-file\u003e -c word --interactive\n```\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eoriginal\u003c/th\u003e\n    \u003cth\u003echar\u003c/th\u003e\n    \u003cth\u003eword\u003c/th\u003e\n    \u003cth\u003eline\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_1.orig.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_1.char.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_1.word.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_1.line.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_3.orig.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_3.char.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_3.word.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_3.line.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_4.orig.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_4.char.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_4.word.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/ligatures_01.pdf.page_4.line.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_1.orig.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_1.char.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_1.word.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_1.line.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_4.orig.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_4.char.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_4.word.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cimg src=\"./docs/visualisations/table_of_contents_01.pdf.page_4.line.png\" alt=\"screenshot\" width=\"170\"/\u003e\u003c/td\u003e\n  \u003c/tr\u003e  \n\u003c/table\u003e\n\n## Quick start\n\nInstall the package from Pypi\n\n```sh\npip install docling-parse\n```\n\nConvert a PDF (look in the [visualize.py](docling_parse/visualize.py) for a more detailed information)\n\n```python\nfrom docling_core.types.doc.page import TextCellUnit\nfrom docling_parse.pdf_parser import DoclingPdfParser, PdfDocument\n\nparser = DoclingPdfParser()\n\npdf_doc: PdfDocument = parser.load(\n    path_or_stream=\"\u003cpath-to-pdf\u003e\"\n)\n\n# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.\nfor page_no, pred_page in pdf_doc.iterate_pages():\n\n    # iterate over the word-cells\n    for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):\n        print(word.rect, \": \", word.text)\n\n        # create a PIL image with the char cells\n    img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)\n    img.show()\n```\n\nUse the CLI\n\n```sh\n$ docling-parse -h\nusage: docling-parse [-h] -p PDF\n\nProcess a PDF file.\n\noptions:\n  -h, --help         show this help message and exit\n  -p PDF, --pdf PDF  Path to the PDF file\n```\n\n## Performance Benchmarks\n\n*Coming soon - benchmarks will be updated for the current parser version.*\n\nFor historical V1 vs V2 benchmarks, see [legacy_performance_benchmarks.md](./docs/legacy_performance_benchmarks.md).\n\n## Development\n\n### CXX\n\nTo build the parser, simply run the following command in the root folder,\n\n```sh\nrm -rf build; cmake -B ./build; cd build; make\n```\n\nYou can run the parser from your build folder:\n\n```sh\n% ./parse.exe -h\nprogram to process PDF files or configuration files\nUsage:\n  PDFProcessor [OPTION...]\n\n  -i, --input arg          Input PDF file\n  -c, --config arg         Config file\n      --create-config arg  Create config file\n  -p, --page arg           Pages to process (default: -1 for all) (default:\n                           -1)\n      --password arg       Password for accessing encrypted, password-protected files\n  -o, --output arg         Output file\n  -l, --loglevel arg       loglevel [error;warning;success;info]\n  -h, --help               Print usage\n```\n\nIf you don't have an input file, a template input file will be printed on the terminal.\n\n\n### Python\n\nTo build the package, simply run (make sure [uv](https://docs.astral.sh/uv/) is [installed](https://docs.astral.sh/uv/getting-started/installation)),\n\n```sh\nuv sync\n```\n\nThe latter will only work after a clean `git clone`. If you are developing and updating C++ code, please use,\n\n```sh\n# uv pip install --force-reinstall --no-deps -e .\nrm -rf .venv; uv venv; uv pip install --force-reinstall --no-deps -e \".[perf-tools]\"\n```\n\nTo test the package, run:\n\n```sh\nuv run pytest ./tests -v -s\n```\n\n## Contributing\n\nPlease read [Contributing to Docling Parse](https://github.com/docling-project/docling-parse/blob/main/CONTRIBUTING.md) for details.\n\n## References\n\nIf you use Docling in your projects, please consider citing the following:\n\n```bib\n@techreport{Docling,\n  author = {Docling Team},\n  month = {8},\n  title = {Docling Technical Report},\n  url = {https://arxiv.org/abs/2408.09869},\n  eprint = {2408.09869},\n  doi = {10.48550/arXiv.2408.09869},\n  version = {1.0.0},\n  year = {2024}\n}\n```\n\n## License\n\nThe Docling Parse codebase is under MIT license.\nFor individual model usage, please refer to the model licenses found in the original packages.\n\n## LF AI \u0026 Data\n\nDocling (and also docling-parse) is hosted as a project in the [LF AI \u0026 Data Foundation](https://lfaidata.foundation/projects/).\n\n### IBM ❤️ Open Source AI\n\nThe project was started by the AI for knowledge team at IBM Research Zurich.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocling-project%2Fdocling-parse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdocling-project%2Fdocling-parse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocling-project%2Fdocling-parse/lists"}