{"id":13551598,"url":"https://github.com/pd3f/pd3f","last_synced_at":"2026-04-08T15:36:42.039Z","repository":{"id":37501339,"uuid":"266394847","full_name":"pd3f/pd3f","owner":"pd3f","description":"🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based ","archived":false,"fork":false,"pushed_at":"2023-10-13T17:44:27.000Z","size":952,"stargazers_count":329,"open_issues_count":19,"forks_count":40,"subscribers_count":7,"default_branch":"master","last_synced_at":"2026-03-05T18:24:38.383Z","etag":null,"topics":["extract-text","language-model","machine-learning","ocr","parsr","pd3f","pdf","pdf-to-text","pipeline","python","text-extraction"],"latest_commit_sha":null,"homepage":"https://pd3f.com","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pd3f.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-05-23T18:21:11.000Z","updated_at":"2026-02-22T08:07:52.000Z","dependencies_parsed_at":"2024-01-14T04:44:56.701Z","dependency_job_id":"6d801053-7929-46e2-9d9f-221d271aa83c","html_url":"https://github.com/pd3f/pd3f","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/pd3f/pd3f","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fpd3f","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fpd3f/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fpd3f/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fpd3f/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pd3f","download_url":"https://codeload.github.com/pd3f/pd3f/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pd3f%2Fpd3f/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31562695,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extract-text","language-model","machine-learning","ocr","parsr","pd3f","pdf","pdf-to-text","pipeline","python","text-extraction"],"created_at":"2024-08-01T12:01:50.989Z","updated_at":"2026-04-08T15:36:42.013Z","avatar_url":"https://github.com/pd3f.png","language":"HTML","funding_links":[],"categories":["HTML","python","machine-learning"],"sub_categories":[],"readme":"![](imgs/flow.jpg)\n\n# `pd3f`\n\n*Experimental, use with care.*\n\n`pd3f` is a PDF **text extraction** pipeline that is self-hosted, local-first and Docker-based.\nIt **reconstructs** the original **continuous text** with the help of **machine learning**.\n\n`pd3f` can OCR scanned PDFs with [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) (Tesseract) and extracts tables with [Camelot](https://github.com/camelot-dev/camelot) and [Tabula](https://github.com/tabulapdf/tabula).\nIt's built upon the output of [Parsr](https://github.com/axa-group/Parsr).\nParsr detects hierarchies of text and splits the text into words, lines and paragraphs.\n\nEven though Parsr brings some structure to the PDF, the text is still scrambled, i.e., due to hyphens.\nThe underlying Python package [pd3f-core](https://github.com/pd3f/pd3f-core) tries to reconstruct the original continuous text by removing hyphens, new lines and / or spaces.\nIt uses [language models](https://machinelearningmastery.com/statistical-language-modeling-and-neural-language-models/) to guess how the original text looked like.\n\n`pd3f` is especially useful for languages with long words such as German.\nIt was mainly developed to parse German letters and official documents.\nBesides German `pd3f` supports English, Spanish, French and Italian.\nMore languages will be added a later stage.\n\n`pd3f` includes a Web-based GUI and a [Flask](https://flask.palletsprojects.com/)-based microservice (API).\nYou can find a demo at [demo.pd3f.com](https://demo.pd3f.com).\n\n## Documentation\n\nCheck out the full Documentation at: \u003chttps://pd3f.com/docs/\u003e\n\n## Future Work / TODO\n\nPDFs are hard to process and it's hard to extract information.\nSo the results of this tool may not satisfy you.\nThere will be more work to improve this software but altogether, it's unlikely that it will successfully extract all the information anytime soon.\n\nHere some things that will get improved.\n\n### statics about how long processing (per page) took in the past\n\n- calculate runtime based on `job.started_at` and `job.ended_at`\n- Get average runtime of jobs and store data in redis list\n\n### more information about PDF\n\n- NER\n- entity linking\n- extract keywords\n- use [textacy](https://github.com/chartbeat-labs/textacy)\n\n### add more language\n\n- check if flair has model\n- what to do if there is no fast model?\n\n\n### Python client\n\n- simple client based on request\n- send whole folders\n\n### Markdown / HTML export\n\n- go beyond text\n\n### use pdf-scripts / allow more processing\n\n- reduce size\n- repair PDF\n- detect if scanned\n- force to OCR again\n\n### improve logs / get better feedback\n\n- show uncertainty of ML model\n- allow different log levels\n\n## Related Work\n\n- https://github.com/axa-group/Parsr\n- https://github.com/jzillmann/pdf-to-markdown\n- some PDF processing tools in [my blog post](https://johannesfilter.com/python-and-pdf-a-review-of-existing-tools/)\n\n## Development\n\nInstall and use [poetry](https://python-poetry.org/).\n\nInitially run:\n\n```bash\n./dev.sh --build\n```\n\nOmit `--build` if the Docker images do not need to get build.\nRight now Docker + poetry is not able to cache the installs so building the image all the time is uncool.\n\n## Contributing\n\nIf you have a **question**, found a **bug** or want to propose a new **feature**, have a look at the [issues page](https://github.com/pd3f/pd3f/issues).\n\n**Pull requests** are especially welcomed when they fix bugs or improve the code quality.\n\n\n## License\n\nAffero General Public License 3.0\n\n![](imgs/logo.jpg)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpd3f%2Fpd3f","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpd3f%2Fpd3f","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpd3f%2Fpd3f/lists"}