https://github.com/eraydin/semantic-pdf-diff
semantic-pdf-diff is a Rust CLI and library for producing stable, evidence-preserving semantic diffs for digitally generated PDFs.
https://github.com/eraydin/semantic-pdf-diff
cli comparison content diff document extract layout parser pdf report semantic text
Last synced: 28 days ago
JSON representation
semantic-pdf-diff is a Rust CLI and library for producing stable, evidence-preserving semantic diffs for digitally generated PDFs.
- Host: GitHub
- URL: https://github.com/eraydin/semantic-pdf-diff
- Owner: eraydin
- Created: 2026-05-09T17:34:55.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-22T22:52:47.000Z (about 1 month ago)
- Last Synced: 2026-05-23T00:16:27.256Z (about 1 month ago)
- Topics: cli, comparison, content, diff, document, extract, layout, parser, pdf, report, semantic, text
- Language: Rust
- Homepage: http://eraydin.net/semantic-pdf-diff/
- Size: 939 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# semantic-pdf-diff
`semantic-pdf-diff` is a Rust CLI and library for producing stable,
evidence-preserving semantic diffs for digitally generated PDFs.
The CLI binary is `spdfdiff`.
Full documentation lives in [`docs/`](docs/) and is published at
[eraydin.github.io/semantic-pdf-diff](https://eraydin.github.io/semantic-pdf-diff/).
## Status
Project status: `compatibility-gate`.
The project is useful for controlled digitally generated PDFs and committed
sample scenarios. It is not yet a broad public-alpha compatibility claim for
arbitrary real-world PDFs. Unsupported or degraded PDF features should appear as
stable diagnostics rather than silent success.
## What It Supports Today
- Semantic text diffing from extracted positioned text blocks.
- Text extraction prefers `/ToUnicode` and includes a conservative Base14 Latin
fallback for safe Helvetica, Times, and Courier-family simple-font text.
- Stable JSON, AI review JSON, Markdown, and self-contained HTML reports.
- Report evidence carries semantic roles for extracted blocks, including
candidate headers, footers, page templates, tables, lists, and headings.
- Parser-backed diagnostics and partial results for unsupported or degraded PDF
surfaces.
- CI checks with configured PDF pairs, thresholds, baseline suppression, and
deterministic artifacts.
- Selected document-surface comparisons for images, vector/style signatures,
annotations, links, form fields, outlines, name trees, metadata/XMP, and
embedded-file surfaces.
- Simple tagged-PDF summaries with `/RoleMap`, parent-tree, and MCID-backed
text mapping for controlled cases.
- Optional external OCR through `SPDFDIFF_OCR_COMMAND` or `tesseract` for
supported image-only samples.
Renderer-grade visual diffing, arbitrary table reconstruction, broad tagged-PDF
coverage, and corpus-backed public-alpha compatibility remain incremental work.
## Quickstart
Build the workspace:
```powershell
cargo build --workspace
```
Compare two PDFs:
```powershell
.\target\debug\spdfdiff.exe diff .\old.pdf .\new.pdf --format json --output .\diff.json
```
Run through Cargo without using the built binary directly:
```powershell
cargo run -p spdfdiff_cli -- diff .\old.pdf .\new.pdf --format md
```
## CLI Essentials
```powershell
# Compare two PDFs.
.\target\debug\spdfdiff.exe diff .\old.pdf .\new.pdf --format json
# Run configured CI checks.
.\target\debug\spdfdiff.exe check --config .\.spdfdiff.toml
# Evaluate the committed sample corpus gate.
.\target\debug\spdfdiff.exe corpus .\samples --manifest .\samples\compatibility_corpus_manifest.json --fail-on-gate
# Run the synthetic benchmark smoke gate.
.\target\debug\spdfdiff.exe benchmark --pages 50 --output .\benchmark.json
```
Other CLI commands include `inspect`, `extract`, and `review`. See the
[documentation site](https://eraydin.github.io/semantic-pdf-diff/#cli) for the
full command reference.
## CI
Use `spdfdiff check` with a repository config:
```powershell
.\target\debug\spdfdiff.exe check --config .\.spdfdiff.toml
```
In GitHub Actions:
```yaml
- uses: eraydin/semantic-pdf-diff@main
with:
config: .spdfdiff.toml
```
The composite action uses an existing `spdfdiff` on `PATH` when available;
otherwise it installs the CLI from the checked-out action source.
## Reports And Schemas
Report formats:
- `json`: stable machine-readable diff report.
- `ai-json`: compact deterministic review artifact for agent workflows.
- `md`: Markdown summary for code review.
- `html`: self-contained evidence report.
Machine-readable schemas live in [`schemas/`](schemas/), with schema history in
[`schemas/CHANGELOG.md`](schemas/CHANGELOG.md).
## Crate Map
- `spdfdiff_types`: shared IDs, geometry, provenance, diagnostics, limits, and
report-facing IR.
- `pdf_core`: low-level parser, object graph, streams, xref handling, and parser
diagnostics.
- `pdf_content`, `pdf_text`, `pdf_semantic`: content interpretation, positioned
text extraction, and semantic blocks.
- `diff_core`, `diff_report`, `spdfdiff_cli`: matching, report rendering, and the
public CLI.
Core crates do not use third-party PDF parser or renderer libraries.
## Development Gates
Before considering a code change complete, run:
```powershell
cargo fmt --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
```
For fuzzing-target changes, also run:
```powershell
cargo check --manifest-path fuzz/Cargo.toml --bins
```