https://github.com/aphp/edspdf
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.
https://github.com/aphp/edspdf
extraction machine-learning pdf
Last synced: 8 months ago
JSON representation
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.
- Host: GitHub
- URL: https://github.com/aphp/edspdf
- Owner: aphp
- License: bsd-3-clause
- Created: 2022-07-25T15:47:09.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2025-02-12T14:12:06.000Z (over 1 year ago)
- Last Synced: 2025-03-30T14:11:15.447Z (about 1 year ago)
- Topics: extraction, machine-learning, pdf
- Language: Python
- Homepage: https://aphp.github.io/edspdf/
- Size: 8.93 MB
- Stars: 46
- Watchers: 2
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: changelog.md
- Contributing: contributing.md
- License: LICENSE
- Citation: CITATION.cff
- Roadmap: docs/roadmap.md
Awesome Lists containing this project
README

[](https://aphp.github.io/edspdf/latest/)
[](https://pypi.org/project/edspdf/)
[](https://raw.githubusercontent.com/aphp/edspdf/coverage/coverage.txt)
[](https://zenodo.org/badge/latestdoi/517726737)
# EDS-PDF
EDS-PDF provides a modular framework to extract text information from PDF documents.
You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models:
- 📄 [Extractors](https://aphp.github.io/edspdf/latest/pipes/extractors) to parse PDFs (based on [pdfminer](https://github.com/euske/pdfminer), [mupdf](https://github.com/aphp/edspdf-mupdf) or [poppler](https://github.com/aphp/edspdf-poppler))
- 🎯 [Classifiers](https://aphp.github.io/edspdf/latest/pipes/box-classifiers) to perform text box classification, in order to segment PDFs
- 🧩 [Aggregators](https://aphp.github.io/edspdf/latest/pipes/aggregators) to produce an aggregated output from the detected text boxes
- 🧠Trainable layers to incorporate machine learning in your pipeline (e.g., [embedding](https://aphp.github.io/edspdf/latest/pipes/embeddings) building blocks or a [trainable classifier](https://aphp.github.io/edspdf/latest/pipes/box-classifiers/trainable/))
Visit the [:book: documentation](https://aphp.github.io/edspdf/) for more information!
## Getting started
### Installation
Install the library with pip:
```bash
pip install edspdf
```
### Extracting text
Let's build a simple PDF extractor that uses a rule-based classifier. There are two
ways to do this, either by using the [configuration system](#configuration) or by using
the pipeline API.
Create a configuration file:
config.cfg
```ini
[pipeline]
pipeline = ["extractor", "classifier", "aggregator"]
[components.extractor]
@factory = "pdfminer-extractor"
[components.classifier]
@factory = "mask-classifier"
x0 = 0.2
x1 = 0.9
y0 = 0.3
y1 = 0.6
threshold = 0.1
[components.aggregator]
@factory = "simple-aggregator"
```
and load it from Python:
```python
import edspdf
from pathlib import Path
model = edspdf.load("config.cfg") # (1)
```
Or create a pipeline directly from Python:
```python
from edspdf import Pipeline
model = Pipeline()
model.add_pipe("pdfminer-extractor")
model.add_pipe(
"mask-classifier",
config=dict(
x0=0.2,
x1=0.9,
y0=0.3,
y1=0.6,
threshold=0.1,
),
)
model.add_pipe("simple-aggregator")
```
This pipeline can then be applied (for instance with this [PDF](https://github.com/aphp/edspdf/raw/main/tests/resources/letter.pdf)):
```python
# Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes()
pdf = model(pdf)
body = pdf.aggregated_texts["body"]
text, style = body.text, body.properties
```
See the [rule-based recipe](https://aphp.github.io/edspdf/latest/recipes/rule-based) for a step-by-step explanation of what is happening.
## Citation
If you use EDS-PDF, please cite us as below.
```bibtex
@software{edspdf,
author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
doi = {10.5281/zenodo.6902977},
license = {BSD-3-Clause},
title = {{EDS-PDF: Smart text extraction from PDF documents}},
url = {https://github.com/aphp/edspdf}
}
```
## Acknowledgement
We would like to thank [Assistance Publique – Hôpitaux de Paris](https://www.aphp.fr/) and
[AP-HP Foundation](https://fondationrechercheaphp.fr/) for funding this project.