https://github.com/ungerik/pdfzig
A fast, cross-platform PDF utility tool written in Zig, powered by PDFium
https://github.com/ungerik/pdfzig
pdf pdf-processing pdf-rendering pdfium zig
Last synced: 21 days ago
JSON representation
A fast, cross-platform PDF utility tool written in Zig, powered by PDFium
- Host: GitHub
- URL: https://github.com/ungerik/pdfzig
- Owner: ungerik
- License: mit
- Created: 2026-01-04T09:26:53.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-01-10T20:08:29.000Z (about 1 month ago)
- Last Synced: 2026-01-11T04:19:36.728Z (about 1 month ago)
- Topics: pdf, pdf-processing, pdf-rendering, pdfium, zig
- Language: Zig
- Homepage:
- Size: 1.21 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdfzig
A fast, cross-platform PDF utility tool written in Zig, powered by PDFium.
## Features
- **Render** PDF pages to PNG or JPEG images at any DPI
- **Extract text** content from PDFs
- **Extract images** embedded in PDF pages
- **Extract attachments** embedded in PDFs with glob pattern filtering
- **Visual diff** - compare two PDFs visually, pixel by pixel
- **Display info** including metadata, page count, encryption status, PDF/A conformance, attachments, and PDF version
- **Rotate** pages by 90, 180, or 270 degrees
- **Mirror** pages horizontally or vertically
- **Delete** pages from PDFs
- **Add** new pages with optional image or text content
- **Create** new PDFs from multiple sources (PDFs, images, text files)
- **Attach** files as PDF attachments
- **Detach** (remove) attachments from PDFs
- Support for **password-protected** PDFs
- **Multi-resolution output** - generate multiple image sizes in one pass
- **Runtime PDFium linking** - download or link PDFium libraries dynamically
## Installation
### Requirements
- Zig 0.15.1 or later
### Build
```bash
git clone https://github.com/ungerik/pdfzig.git
cd pdfzig
zig build
```
### Download PDFium
After building, download the PDFium library for your platform:
```bash
# Download latest PDFium build
./zig-out/bin/pdfzig download_pdfium
# Or download a specific Chromium build version
./zig-out/bin/pdfzig download_pdfium 7606
```
PDFium is automatically downloaded on first use if not already present. The library is installed next to the pdfzig executable with the naming pattern `libpdfium_v{BUILD}.dylib` (macOS), `libpdfium_v{BUILD}.so` (Linux), or `pdfium_v{BUILD}.dll` (Windows).
Downloads are verified using SHA256 checksums from the GitHub release API to ensure authenticity and integrity.
The executable will be in `zig-out/bin/`.
## Usage
### Render PDF to Images
```bash
# Render all pages to PNG at 150 DPI (default)
pdfzig render document.pdf
# Render to a specific directory
pdfzig render document.pdf ./output
# Render at 300 DPI
pdfzig render -O 300:png:0:page_{num}.png document.pdf
# Render specific pages
pdfzig render -p 1-5,10 document.pdf
# Multi-resolution: full-size PNG + JPEG thumbnail
pdfzig render -O 300:png:0:{basename}_{num0}.png -O 72:jpeg:85:thumb_{num}.jpg document.pdf
```
Output specification format: `DPI:FORMAT:QUALITY:TEMPLATE`
- **DPI**: Resolution (e.g., 72, 150, 300)
- **FORMAT**: `png` or `jpeg`/`jpg`
- **QUALITY**: JPEG quality 1-100 (ignored for PNG, use 0)
- **TEMPLATE**: Filename with variables `{num}`, `{num0}`, `{basename}`, `{ext}`
### Extract Text
Extract text content from PDF pages in plain text or structured JSON format. By default, pages are separated by a single newline. Add custom separators with template variables and escape sequences using `--page-separator`.
```bash
# Print text to stdout
pdfzig extract_text document.pdf
# Save to file
pdfzig extract_text -o output.txt document.pdf
# Extract from specific pages
pdfzig extract_text -p 1-10 document.pdf
# Extract as JSON with formatting information
pdfzig extract_text --format json document.pdf
# Save JSON to file
pdfzig extract_text -f json -o output.json document.pdf
# Add page separator with page number
pdfzig extract_text --page-separator "--- Page {{PAGE_NO}} ---" document.pdf
# Separator with escape sequences (multiple newlines)
pdfzig extract_text --page-separator '\n\n=== Page {{PAGE_NO}} ===\n\n' document.pdf
```
#### JSON Output Format
When using `--format json`, the output contains structured text blocks with formatting information:
```json
{
"pages": [
{
"page": 1,
"width": 595.28,
"height": 841.89,
"blocks": [
{
"text": "Hello World",
"bbox": {"left": 72.0, "top": 750.0, "right": 150.0, "bottom": 738.0},
"font": "Helvetica",
"size": 12.0,
"weight": 400,
"italic": false,
"color": "#000000"
}
]
}
]
}
```
Each text block contains:
- **text**: The text content
- **bbox**: Bounding box with `left`, `top`, `right`, `bottom` coordinates (in PDF points)
- **font**: Font name (or `null` if unavailable)
- **size**: Font size in points
- **weight**: Font weight (400 = normal, 700 = bold, -1 = unavailable)
- **italic**: Whether the text is italic
- **color**: CSS-compatible hex color (`#rrggbb` or `#rrggbbaa` with alpha)
### Extract Embedded Images
```bash
# Extract all images as PNG
pdfzig extract_images document.pdf
# Extract to specific directory as JPEG
pdfzig extract_images -f jpeg -Q 90 document.pdf ./images
# Extract from specific pages
pdfzig extract_images -p 1-5 document.pdf
```
### Extract Attachments
```bash
# Extract all attachments
pdfzig extract_attachments document.pdf
# Extract only XML files using glob pattern
pdfzig extract_attachments document.pdf "*.xml"
# Extract to specific directory
pdfzig extract_attachments document.pdf "*.xml" ./xml-output
# List all attachments without extracting
pdfzig extract_attachments -l document.pdf
# List only JSON files
pdfzig extract_attachments -l document.pdf "*.json"
```
Pattern syntax: `*` matches any characters, `?` matches a single character
### Visual Diff
Compare two PDFs visually by rendering and comparing pixels:
```bash
# Compare two PDFs (exit code 0 = identical, 1 = different)
pdfzig visual_diff original.pdf modified.pdf
# Compare at higher resolution (default: 150 DPI)
pdfzig visual_diff -r 300 doc1.pdf doc2.pdf
# Generate diff images showing differences
pdfzig visual_diff -o ./diffs doc1.pdf doc2.pdf
# Use RGB mode for per-channel color diffs
pdfzig visual_diff -o ./diffs --colors rgb doc1.pdf doc2.pdf
# Use gray mode (average diff) instead of contrast mode
pdfzig visual_diff -o ./diffs --colors gray doc1.pdf doc2.pdf
# Invert colors (white = identical, black = max difference)
pdfzig visual_diff -o ./diffs --invert doc1.pdf doc2.pdf
# Compare encrypted PDFs
pdfzig visual_diff -P secret1 -P secret2 enc1.pdf enc2.pdf
```
When `-o` is specified, diff images are created showing pixel differences.
Output files are named `diff_page1.png`, `diff_page2.png`, etc.
Color modes (`--colors`):
- **contrast** (default): Grayscale where values are scaled so the maximum difference appears as white
- **gray**: Grayscale where each pixel's brightness is the average RGB difference
- **rgb**: Per-channel RGB differences shown as color values
Use `--invert` to flip colors so identical pixels are white and differences are black.
### Display PDF Information
```bash
pdfzig info document.pdf
```
Output example:
```
File: document.pdf
Pages: 10
PDF Version: 1.7
Encrypted: No
Metadata:
Title: My Document
Author: John Doe
Creator: LaTeX
Producer: pdfTeX-1.40.23
Creation Date: D:20240101120000+00'00'
PDF/A: PDF/A-1b
Attachments: 2
invoice.xml [XML]
data.json
XML files: 1 (use 'extract_attachments "*.xml"' to extract)
```
The `PDF/A` field is automatically detected from XMP metadata and displays the conformance level (e.g., PDF/A-1b, PDF/A-2u, PDF/A-3a). This field is only shown for PDF/A-compliant documents.
Use `--json` for machine-readable output with per-page dimensions:
```bash
pdfzig info --json document.pdf
```
### Rotate Pages
```bash
# Rotate all pages 90 degrees clockwise
pdfzig rotate document.pdf 90
# Rotate specific pages 180 degrees
pdfzig rotate -p 1,3,5 document.pdf 180
# Rotate and save to a different file
pdfzig rotate -o rotated.pdf document.pdf 270
# Use aliases: right (90°) and left (-90°)
pdfzig rotate document.pdf right
pdfzig rotate document.pdf left
```
Supported angles: `90`, `180`, `270` (clockwise), or `left`/`right`
### Mirror Pages
```bash
# Mirror all pages horizontally (left-right flip)
pdfzig mirror document.pdf
# Mirror vertically (up-down flip)
pdfzig mirror --updown document.pdf
# Mirror specific pages
pdfzig mirror -p 1,3,5 document.pdf
# Mirror both horizontally and vertically
pdfzig mirror --leftright --updown document.pdf
# Save to a different file
pdfzig mirror -o mirrored.pdf document.pdf
```
### Delete Pages
```bash
# Delete page 5
pdfzig delete document.pdf 5
# Delete pages 1, 3, and 5-10
pdfzig delete document.pdf 1,3,5-10
# Delete and save to a different file
pdfzig delete -o trimmed.pdf document.pdf 1-3
# Delete all pages (replaces with one empty page of same size as first page)
pdfzig delete document.pdf
```
### Add Pages
```bash
# Add an empty page at the end
pdfzig add document.pdf
# Add an empty page at position 3
pdfzig add -p 3 document.pdf
# Add a page with an image (scaled to fit)
pdfzig add document.pdf image.png
# Add a page with text content
pdfzig add document.pdf notes.txt
# Add a page with formatted text from JSON (same format as extract_text --format json)
pdfzig add document.pdf content.json
# Specify page size using standard names
pdfzig add -s A4 document.pdf
pdfzig add -s Letter document.pdf
# Use landscape orientation (append 'L')
pdfzig add -s A4L document.pdf
# Specify size with units
pdfzig add -s 210x297mm document.pdf
pdfzig add -s 8.5x11in document.pdf
```
Supported page sizes: A0-A8, B0-B6, C4-C6, Letter, Legal, Tabloid, Ledger, Executive, Folio, Quarto, Statement
Supported units: `mm`, `cm`, `in`/`inch`, `pt` (points, default)
### Create PDF
Create a new PDF from multiple sources (PDFs, images, text files):
```bash
# Merge two PDFs
pdfzig create -o combined.pdf doc1.pdf doc2.pdf
# Import only specific pages from a PDF
pdfzig create -o excerpt.pdf -p 1-5,10 document.pdf
# Create from mixed sources: image cover + PDF content
pdfzig create -o book.pdf cover.png content.pdf
# Insert blank pages using :blank
pdfzig create -o padded.pdf :blank document.pdf :blank
# Combine everything: blank + image + PDF pages + text
pdfzig create -o report.pdf :blank logo.png -p 1-3 intro.pdf notes.txt
# Specify page size for images and text (default: A4)
pdfzig create -s Letter -o output.pdf image.png notes.txt
```
Source types:
- **PDF files**: Import pages (all pages or use `-p` for specific pages)
- **Images**: PNG, JPEG, BMP - creates a page with the image scaled to fit
- **Text files**: Creates a page with the text content
- **JSON files**: Creates pages with formatted text blocks (same format as `extract_text --format json`)
- **`:blank`**: Inserts a blank page
#### Standard PDF Fonts
When using JSON files for text input, fonts are mapped to the PDF Standard Base 14 Fonts which are guaranteed to be available in all PDF viewers:
| Font Family | Variants |
|--------------|----------------------------------------|
| Helvetica | Regular, Bold, Oblique, Bold-Oblique |
| Courier | Regular, Bold, Oblique, Bold-Oblique |
| Times | Roman, Bold, Italic, Bold-Italic |
| Symbol | Regular |
| ZapfDingbats | Regular |
Font names in JSON are matched using these rules:
- Names containing "Courier" or "Mono" → Courier
- Names containing "Times" or "Serif" → Times
- All other names → Helvetica (default sans-serif)
- Bold/Italic variants are selected based on `weight` (≥600) and `italic` fields
### Attach Files
```bash
# Attach a file to the PDF
pdfzig attach document.pdf invoice.xml
# Attach multiple files
pdfzig attach document.pdf file1.json file2.xml
# Attach files matching a glob pattern
pdfzig attach -g "*.xml" document.pdf
# Save to a different file
pdfzig attach -o with_attachments.pdf document.pdf data.json
```
### Detach (Remove) Attachments
```bash
# Remove attachment by index
pdfzig detach -i 0 document.pdf
# Remove attachments matching a glob pattern
pdfzig detach -g "*.xml" document.pdf
# Save to a different file
pdfzig detach -o clean.pdf -g "*.tmp" document.pdf
```
### Use a Specific PDFium Library
The `--link` global option loads PDFium from a specific path instead of the default location:
```bash
# Use a specific PDFium library for a command
pdfzig --link /path/to/libpdfium.dylib info document.pdf
# Works with any command
pdfzig --link /usr/local/lib/libpdfium.dylib render document.pdf
```
The version number is automatically parsed from filenames matching the pattern `libpdfium_v{VERSION}.dylib` (or `.so`/`.dll` on other platforms).
### Password-Protected PDFs
All commands support the `-P` flag for encrypted PDFs:
```bash
pdfzig info -P mypassword encrypted.pdf
pdfzig render -P mypassword encrypted.pdf
pdfzig extract_text -P mypassword encrypted.pdf
```
### Page Selection
Many commands support the `-p` option to select specific pages. If not specified, the command operates on all pages.
Page selection syntax:
- Single page: `-p 5`
- Multiple pages: `-p 1,3,5`
- Page range: `-p 1-10`
- Combined: `-p 1-5,8,10-12`
Commands supporting `-p`: `render`, `extract_text`, `extract_images`, `rotate`, `mirror`, `create`
## Supported Platforms
| Platform | Architecture |
|----------|--------------------|
| macOS | arm64, x86_64 |
| Linux | x86_64, arm64, arm |
| Windows | x86_64, x86, arm64 |
## Dependencies
- [PDFium](https://pdfium.googlesource.com/pdfium/) - PDF rendering engine (dynamically loaded at runtime from [pdfium-binaries](https://github.com/bblanchon/pdfium-binaries))
- [zigimg](https://github.com/zigimg/zigimg) - PNG encoding
- [zstbi](https://github.com/zig-gamedev/zstbi) - JPEG encoding
## Development
```bash
# Build
zig build
# Run tests
zig build test --summary all
# Run directly
zig build run -- info document.pdf
# Check source code formatting
zig build fmt
# Fix source code formatting
zig build fmt-fix
# Remove build artifacts and caches (zig-out/, .zig-cache/, test-cache/)
zig build clean
# Build for all supported platforms (outputs to zig-out//)
zig build all
```
### Cross-Compilation Targets
The `zig build all` command builds for all supported platforms:
| Platform | Architecture | Output Directory |
|----------|--------------|--------------------------------|
| macOS | x86_64 | `zig-out/x86_64-macos-none/` |
| macOS | arm64 | `zig-out/aarch64-macos-none/` |
| Linux | x86_64 | `zig-out/x86_64-linux-gnu/` |
| Linux | arm64 | `zig-out/aarch64-linux-gnu/` |
| Linux | arm | `zig-out/arm-linux-gnueabihf/` |
| Windows | x86_64 | `zig-out/x86_64-windows-gnu/` |
| Windows | x86 | `zig-out/x86-windows-gnu/` |
| Windows | arm64 | `zig-out/aarch64-windows-gnu/` |
### Build Options
| Option | Description |
|---------------------------|----------------------------------------------------------------|
| `-Ddownload-pdfium` | Download PDFium library for target platform(s) |
| `-Doptimize=ReleaseFast` | Build with optimizations for speed |
| `-Doptimize=ReleaseSmall` | Build with optimizations for size |
| `-Doptimize=ReleaseSafe` | Build with optimizations and runtime safety checks |
| `-Dtarget=` | Cross-compile for a specific target (e.g., `x86_64-linux-gnu`) |
Examples:
```bash
# Build with PDFium library included
zig build -Ddownload-pdfium
# Build optimized release with PDFium
zig build -Doptimize=ReleaseFast -Ddownload-pdfium
# Build for all platforms with PDFium libraries included
zig build all -Ddownload-pdfium
# Cross-compile for Linux with PDFium
zig build -Dtarget=x86_64-linux-gnu -Ddownload-pdfium
```
### Testing
Run the test suite:
```bash
# Run all tests except network-dependent tests
zig build test --summary all
# Run all tests including network-dependent tests (downloads test PDFs)
RUN_NETWORK_TESTS=1 zig build test --summary all
```
#### Test Types
**Unit Tests** (always run):
- CLI argument parsing
- Color parsing
- Font mapping
- PDF/A conformance parsing
- Page range parsing
**PDFium Tests** (always run):
- Golden file tests (render, rotate, mirror operations)
- PDF roundtrip tests (create PDF → extract text)
- Requires: PDFium library (auto-downloads if missing) + local test files in `test-files/input/`
**Network Tests** (require `RUN_NETWORK_TESTS=1`):
- Integration tests using sample PDFs from [py-pdf/sample-files](https://github.com/py-pdf/sample-files)
- Integration tests using ZUGFeRD invoices from [ZUGFeRD/corpus](https://github.com/ZUGFeRD/corpus)
- Requires: PDFium library + network access
- Test PDFs are cached in `test-cache/` directory (gitignored)
#### Golden File Testing
Visual regression tests compare PDF operations against reference "golden files":
```bash
# Generate/regenerate golden files
zig build generate-golden-files
# Delete and regenerate all golden files
zig build generate-golden-files -Dclean
# Run tests (golden file tests always run)
zig build test
```
Golden files are stored in `test-files/expected/` and checked into git. Tests use pixel-level PNG comparison (not byte-level) to handle minor rendering variations across platforms.
## License
MIT License - see [LICENSE](LICENSE)
### Third-Party Licenses
This software uses PDFium (BSD-3-Clause/Apache-2.0), zigimg (MIT), and stb libraries (MIT/Public Domain). See [THIRD_PARTY_NOTICES.md](THIRD_PARTY_NOTICES.md) for details.