{"id":21702373,"url":"https://github.com/aphp/edspdf","last_synced_at":"2025-10-24T03:03:55.812Z","repository":{"id":49501682,"uuid":"517726737","full_name":"aphp/edspdf","owner":"aphp","description":"EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.","archived":false,"fork":false,"pushed_at":"2025-02-12T14:12:06.000Z","size":9359,"stargazers_count":46,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-30T14:11:15.447Z","etag":null,"topics":["extraction","machine-learning","pdf"],"latest_commit_sha":null,"homepage":"https://aphp.github.io/edspdf/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aphp.png","metadata":{"files":{"readme":"README.md","changelog":"changelog.md","contributing":"contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-25T15:47:09.000Z","updated_at":"2025-03-17T10:55:41.000Z","dependencies_parsed_at":"2024-02-07T05:45:17.330Z","dependency_job_id":"4c9da124-9356-4fa9-9c75-b3bd8706f386","html_url":"https://github.com/aphp/edspdf","commit_stats":{"total_commits":293,"total_committers":6,"mean_commits":"48.833333333333336","dds":0.4300341296928327,"last_synced_commit":"5778c0e1ec01c9a0835832938cc3b49b5d051ffc"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Fedspdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Fedspdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Fedspdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aphp%2Fedspdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aphp","download_url":"https://codeload.github.com/aphp/edspdf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247500468,"owners_count":20948880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extraction","machine-learning","pdf"],"created_at":"2024-11-25T21:15:08.052Z","updated_at":"2025-10-24T03:03:55.720Z","avatar_url":"https://github.com/aphp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Tests](https://img.shields.io/github/actions/workflow/status/aphp/edspdf/tests.yml?branch=main\u0026label=tests\u0026style=flat-square)\n[![Documentation](https://img.shields.io/github/actions/workflow/status/aphp/edspdf/documentation.yml?branch=main\u0026label=docs\u0026style=flat-square)](https://aphp.github.io/edspdf/latest/)\n[![PyPI](https://img.shields.io/pypi/v/edspdf?color=blue\u0026style=flat-square)](https://pypi.org/project/edspdf/)\n[![Coverage](https://raw.githubusercontent.com/aphp/edspdf/coverage/coverage.svg)](https://raw.githubusercontent.com/aphp/edspdf/coverage/coverage.txt)\n[![DOI](https://zenodo.org/badge/517726737.svg)](https://zenodo.org/badge/latestdoi/517726737)\n\n# EDS-PDF\n\nEDS-PDF provides a modular framework to extract text information from PDF documents.\n\nYou can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:\n- 📄 [Extractors](https://aphp.github.io/edspdf/latest/pipes/extractors) to parse PDFs (based on [pdfminer](https://github.com/euske/pdfminer), [mupdf](https://github.com/aphp/edspdf-mupdf) or [poppler](https://github.com/aphp/edspdf-poppler))\n- 🎯 [Classifiers](https://aphp.github.io/edspdf/latest/pipes/box-classifiers) to perform text box classification, in order to segment PDFs\n- 🧩 [Aggregators](https://aphp.github.io/edspdf/latest/pipes/aggregators) to produce an aggregated output from the detected text boxes\n- 🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., [embedding](https://aphp.github.io/edspdf/latest/pipes/embeddings) building blocks or a [trainable classifier](https://aphp.github.io/edspdf/latest/pipes/box-classifiers/trainable/))\n\nVisit the [:book: documentation](https://aphp.github.io/edspdf/) for more information!\n\n## Getting started\n\n### Installation\n\nInstall the library with pip:\n\n```bash\npip install edspdf\n```\n\n### Extracting text\n\nLet's build a simple PDF extractor that uses a rule-based classifier. There are two\nways to do this, either by using the [configuration system](#configuration) or by using\nthe pipeline API.\n\nCreate a configuration file:\n\n\u003ch5 a\u003e\u003cstrong\u003e\u003ccode\u003econfig.cfg\u003c/code\u003e\u003c/strong\u003e\u003c/h5\u003e\n\n```ini\n[pipeline]\npipeline = [\"extractor\", \"classifier\", \"aggregator\"]\n\n[components.extractor]\n@factory = \"pdfminer-extractor\"\n\n[components.classifier]\n@factory = \"mask-classifier\"\nx0 = 0.2\nx1 = 0.9\ny0 = 0.3\ny1 = 0.6\nthreshold = 0.1\n\n[components.aggregator]\n@factory = \"simple-aggregator\"\n```\n\nand load it from Python:\n\n```python\nimport edspdf\nfrom pathlib import Path\n\nmodel = edspdf.load(\"config.cfg\")  # (1)\n```\n\nOr create a pipeline directly from Python:\n\n```python\nfrom edspdf import Pipeline\n\nmodel = Pipeline()\nmodel.add_pipe(\"pdfminer-extractor\")\nmodel.add_pipe(\n    \"mask-classifier\",\n    config=dict(\n        x0=0.2,\n        x1=0.9,\n        y0=0.3,\n        y1=0.6,\n        threshold=0.1,\n    ),\n)\nmodel.add_pipe(\"simple-aggregator\")\n```\n\nThis pipeline can then be applied (for instance with this [PDF](https://github.com/aphp/edspdf/raw/main/tests/resources/letter.pdf)):\n\n```python\n# Get a PDF\npdf = Path(\"/Users/perceval/Development/edspdf/tests/resources/letter.pdf\").read_bytes()\npdf = model(pdf)\n\nbody = pdf.aggregated_texts[\"body\"]\n\ntext, style = body.text, body.properties\n```\n\nSee the [rule-based recipe](https://aphp.github.io/edspdf/latest/recipes/rule-based) for a step-by-step explanation of what is happening.\n\n## Citation\n\nIf you use EDS-PDF, please cite us as below.\n\n```bibtex\n@software{edspdf,\n  author  = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},\n  doi     = {10.5281/zenodo.6902977},\n  license = {BSD-3-Clause},\n  title   = {{EDS-PDF: Smart text extraction from PDF documents}},\n  url     = {https://github.com/aphp/edspdf}\n}\n```\n\n## Acknowledgement\n\nWe would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/) and\n[AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faphp%2Fedspdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faphp%2Fedspdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faphp%2Fedspdf/lists"}