https://github.com/dnlbauer/pdfact-service
Analyze pdf files with pdfact using a simple web API
https://github.com/dnlbauer/pdfact-service
Last synced: 4 months ago
JSON representation
Analyze pdf files with pdfact using a simple web API
- Host: GitHub
- URL: https://github.com/dnlbauer/pdfact-service
- Owner: dnlbauer
- License: cc0-1.0
- Created: 2022-11-27T11:15:23.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-11-18T19:11:02.000Z (over 1 year ago)
- Last Synced: 2025-12-07T23:59:29.564Z (6 months ago)
- Language: Python
- Size: 37.1 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdfact-service
[](./LICENSE)

[](https://hub.docker.com/r/dnlbauer/pdfact-service/tags)
A Webservice to analyze the content of PDF Documents using a HTTP API. This is a simple HTTP wrapper around [ad-freiburg/pdfact](https://github.com/ad-freiburg/pdfact) and builds the container image directly from their source.
## Usage
Start the service:
```bash
> docker run -p 80:80 dnlbauer/pdfact-service
[2022-11-27 12:36:38 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2022-11-27 12:36:38 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
[2022-11-27 12:36:38 +0000] [1] [INFO] Using worker: gthread
[2022-11-27 12:36:38 +0000] [7] [INFO] Booting worker with pid: 7
```
PDFs can be `POST`ed to `/analyze` as multipart file request. The response will contain the output of `pdfact`. The response format can be specified using the correct MIME `Accept` header; `pdfact` van provide json, xml and plain text as output format.
```bash
> curl -H "Accept: application/json" -F file=@testfile.pdf localhost:80/analyze
{"paragraphs": [
{"paragraph": {
"role": "page-header",
"positions": [{
"minY": 642.6,
"minX": 210.3,
"maxY": 652.6,
"maxX": 401.6,
"page": 1
}],
"text": "This is a test PDF document."
}},
{"paragraph": {
"role": "body",
"positions": [{
"minY": 628.5,
"minX": 91.2,
"maxY": 636.6,
"maxX": 520.8,
"page": 1
}],
"text": "If you can read this, you are lucky."
}}
]}
```
Supported cli arguments (`--units`, `--roles`) can be supplied as http parameters. Example:
```bash
> curl ... localhost:80/analyze?roles=body&units=words
```
## Thanks
All credits go to the Algorithms and Data Structures Group from
University of Freiburg for [ad-freiburg/pdfact](https://github.com/ad-freiburg/pdfact).
## License
Published under CC0. Do whatever you want :-)
*(but also check the license of pdfact if you are going to use the image as is).*