https://github.com/harryr/pandapdf
PDF to images with content / link extraction
https://github.com/harryr/pandapdf
ghostscript pdf pdf-converter photoshop poppler webp
Last synced: about 1 year ago
JSON representation
PDF to images with content / link extraction
- Host: GitHub
- URL: https://github.com/harryr/pandapdf
- Owner: HarryR
- License: gpl-2.0
- Created: 2015-09-07T20:51:56.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2015-12-10T02:03:51.000Z (over 10 years ago)
- Last Synced: 2025-01-24T23:27:28.415Z (about 1 year ago)
- Topics: ghostscript, pdf, pdf-converter, photoshop, poppler, webp
- Language: C++
- Homepage:
- Size: 75.2 KB
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
```
____ _ ____ ____ _____
| _ \ __ _ _ __ __| | __ _| _ \| _ \| ___|
| |_) / _` | '_ \ / _` |/ _` | |_) | | | | |_
| __/ (_| | | | | (_| | (_| | __/| |_| | _|
|_| \__,_|_| |_|\__,_|\__,_|_| |____/|_|
```
PandaPDF was developed as one of the components used to fully automate a the
workflow of a digital magazine publishing company, the aim was to convert PDF
files into images so they can be distributed without requiring full PDF
reading software on the client, this was necessary because delivering high
DPI print-quality PDF files to portable devices wasn't feasible.
This software allows for a PDF file to be converted into individual image files
at different resolutions which can be streamed to clients on an as-needed basis,
when using WEBP or lower quality JPEG the aim is for the total download size to
be lower than if the PDF were to be retrieved in-full.
*In almost all situations PandaPDF, when used with the Poppler and Ghostscript
backends, will be visually identical to PhotoShop even with very complex PDFs.*
Super-sampling is used in an attempt to and mitigate aliasing and float-rounding
problems that occur with highly complex graphical artwork, when this type of PDF
is rendered by PhotoShop there are no visual artefacts, but when rendered with
Acrobat, Ghostscript and Poppler small white gaps between elements appear. The
super-sampling technique renderes the PDF at a higher than required resolution
then down-samples it to produce the final images.
For best bandwidth efficiency *we highly recommend using WEBP output*, on average
this is half the size of JPEG output at 80% quality and results in significant
bandwidth savings.
PandaPDF software provides the following features:
* High-Quality PDF rasterization
* Thumbnail Generation
* Content extraction
* Interactive Region extraction
* Manifest creator
* Output to PNG, JPEG and WEBP
[](https://drone.io/github.com/HarryR/PandaPDF/latest)
v3.0 Goals
----------
* Multiple PDF rasterizers, Photoshop and Ghostscript
* Remove usage of boost dependencies.
* Remove GPLv2 code from project, open-source it
* Decouple output format from rendering component.
* Document problems with coordinate transforms etc.
* Improved build system, builds on Win32, OSX and Linux
* Fully modular architecture with cleaner code
* Include 'poppler-data' package in distribution. (FAIL)
* Regression tests. (FAIL)
* Unit tests. (FAIL)
Version History
---------------
* 3.0 - Photoshop & Ghostscript support
* 2.x - Refactoring, Modular code and backends
* 2.0 - Improved Quality, Cairo backend
* 1.x - Multi-threaded, Text Extraction
* 1.0 - Production run, cloud enabled
* 0.x - C++/Poppler Prototype
* -1 - Java Prototype
Commandline
-----------
```
Usage: pandapdf <-options> [ ...]
Options:
-debug : Enable debug logging
-quiet : Log only warnings & errors
-pdf : PDF File
-out : Output Directory
-box : Which PDF box to use [crop,trim,media]
-no-images : Disable Image Output
-json-regions : Enable JSON regions
-json-words : Enable JSON words
-text-words : Enable text words
-opw : Owner/Modify Password
-upw : User/Open Password
-first : First Page
-last : Last Page
-image-backend : Which backend to use
-supersample : Supersample at N times resolution [default 1.5]
Image Backends, set with '-image-backend':
- poppler-cairo (default)
- poppler-splash
- ghostscript
Examples:
$ pandapdf -pdf test.pdf -out dir -json-regions large,jpeg@80,1500 medium,webp@90,720
Versions:
PandaPDF : 3.0.1 (c) 2009-2015 PixelMags Inc., H Roberts
Poppler : 0.38.0
Cairo : 1.12.18
Freetype : 0e0fdc5dc89e5079898c5da67b56f994c439fee1
FontConfig : 2.11.94
Pixman : 0.32.8
libJBIG : 2.1
libPNG : 1.2.54
libJPEG : 8d
libTIFF : 4.0.6
libWEBP : 0.4.4
```
Example Usage:
pandapdf -pdf test.pdf -out ~/test/ large,jpeg@80,800 medium,webp@80,800
This would extract images of all pages from `test.pdf` into the `~/test/` directory. The output files are named according to the format:
page_%04d_%s.%s - e.g. page_0023_large.jpg
The different profiles specify the maximum dimension and output quality for
each image.
Building
--------
The software can be built using system dependencies, but also includes a build
system that compiles all required libraries from scratch to produce a mostly
static executable.
The `pandapdf` target produces a dynamic executable, if `make dependencies` is
run it will build, compile and then statically link all libraries into a single executable leaving only the system libc dynamic. If `make dependencies` is not
run it will dynamically link all required dependencies from the system.
```
apt-get install autoconf automake make g++ gcc wget git libtool pkg-config xz-utils libexpat1-dev ghostscript upx cmake libbzip2-dev
make dependencies
make
```
On OSX, Windows and Linux the executable is about 9mb.
After compression with UPX the executable is a little over 2mb.
Licensing
---------
The software is inexplicably tied to the Poppler library and includes one
source file (parseargs) from the project. As Poppler is released under the GPLv2
license this program must comply with the licensing restrictions, this means
that PandaPDF binaries cannot be distributed without releasing the source code
as it is considered a derivative work. As such the source code has been released
under the terms of the GPLv2 (see the `COPYING` file) so that it can be further
developed, improved and distributed.