{"id":50462769,"url":"https://github.com/eraydin/semantic-pdf-diff","last_synced_at":"2026-06-01T05:30:53.410Z","repository":{"id":359680303,"uuid":"1234055733","full_name":"eraydin/semantic-pdf-diff","owner":"eraydin","description":"semantic-pdf-diff is a Rust CLI and library for producing stable, evidence-preserving semantic diffs for digitally generated PDFs.","archived":false,"fork":false,"pushed_at":"2026-05-22T22:52:47.000Z","size":962,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T00:16:27.256Z","etag":null,"topics":["cli","comparison","content","diff","document","extract","layout","parser","pdf","report","semantic","text"],"latest_commit_sha":null,"homepage":"http://eraydin.net/semantic-pdf-diff/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eraydin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-09T17:34:55.000Z","updated_at":"2026-05-22T22:44:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/eraydin/semantic-pdf-diff","commit_stats":null,"previous_names":["eraydin/semantic-pdf-diff"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/eraydin/semantic-pdf-diff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eraydin%2Fsemantic-pdf-diff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eraydin%2Fsemantic-pdf-diff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eraydin%2Fsemantic-pdf-diff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eraydin%2Fsemantic-pdf-diff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eraydin","download_url":"https://codeload.github.com/eraydin/semantic-pdf-diff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eraydin%2Fsemantic-pdf-diff/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33762215,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","comparison","content","diff","document","extract","layout","parser","pdf","report","semantic","text"],"created_at":"2026-06-01T05:30:48.614Z","updated_at":"2026-06-01T05:30:53.405Z","avatar_url":"https://github.com/eraydin.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# semantic-pdf-diff\n\n`semantic-pdf-diff` is a Rust CLI and library for producing stable,\nevidence-preserving semantic diffs for digitally generated PDFs.\n\nThe CLI binary is `spdfdiff`.\n\nFull documentation lives in [`docs/`](docs/) and is published at\n[eraydin.github.io/semantic-pdf-diff](https://eraydin.github.io/semantic-pdf-diff/).\n\n## Status\n\nProject status: `compatibility-gate`.\n\nThe project is useful for controlled digitally generated PDFs and committed\nsample scenarios. It is not yet a broad public-alpha compatibility claim for\narbitrary real-world PDFs. Unsupported or degraded PDF features should appear as\nstable diagnostics rather than silent success.\n\n## What It Supports Today\n\n- Semantic text diffing from extracted positioned text blocks.\n- Text extraction prefers `/ToUnicode` and includes a conservative Base14 Latin\n  fallback for safe Helvetica, Times, and Courier-family simple-font text.\n- Stable JSON, AI review JSON, Markdown, and self-contained HTML reports.\n- Report evidence carries semantic roles for extracted blocks, including\n  candidate headers, footers, page templates, tables, lists, and headings.\n- Parser-backed diagnostics and partial results for unsupported or degraded PDF\n  surfaces.\n- CI checks with configured PDF pairs, thresholds, baseline suppression, and\n  deterministic artifacts.\n- Selected document-surface comparisons for images, vector/style signatures,\n  annotations, links, form fields, outlines, name trees, metadata/XMP, and\n  embedded-file surfaces.\n- Simple tagged-PDF summaries with `/RoleMap`, parent-tree, and MCID-backed\n  text mapping for controlled cases.\n- Optional external OCR through `SPDFDIFF_OCR_COMMAND` or `tesseract` for\n  supported image-only samples.\n\nRenderer-grade visual diffing, arbitrary table reconstruction, broad tagged-PDF\ncoverage, and corpus-backed public-alpha compatibility remain incremental work.\n\n## Quickstart\n\nBuild the workspace:\n\n```powershell\ncargo build --workspace\n```\n\nCompare two PDFs:\n\n```powershell\n.\\target\\debug\\spdfdiff.exe diff .\\old.pdf .\\new.pdf --format json --output .\\diff.json\n```\n\nRun through Cargo without using the built binary directly:\n\n```powershell\ncargo run -p spdfdiff_cli -- diff .\\old.pdf .\\new.pdf --format md\n```\n\n## CLI Essentials\n\n```powershell\n# Compare two PDFs.\n.\\target\\debug\\spdfdiff.exe diff .\\old.pdf .\\new.pdf --format json\n\n# Run configured CI checks.\n.\\target\\debug\\spdfdiff.exe check --config .\\.spdfdiff.toml\n\n# Evaluate the committed sample corpus gate.\n.\\target\\debug\\spdfdiff.exe corpus .\\samples --manifest .\\samples\\compatibility_corpus_manifest.json --fail-on-gate\n\n# Run the synthetic benchmark smoke gate.\n.\\target\\debug\\spdfdiff.exe benchmark --pages 50 --output .\\benchmark.json\n```\n\nOther CLI commands include `inspect`, `extract`, and `review`. See the\n[documentation site](https://eraydin.github.io/semantic-pdf-diff/#cli) for the\nfull command reference.\n\n## CI\n\nUse `spdfdiff check` with a repository config:\n\n```powershell\n.\\target\\debug\\spdfdiff.exe check --config .\\.spdfdiff.toml\n```\n\nIn GitHub Actions:\n\n```yaml\n- uses: eraydin/semantic-pdf-diff@main\n  with:\n    config: .spdfdiff.toml\n```\n\nThe composite action uses an existing `spdfdiff` on `PATH` when available;\notherwise it installs the CLI from the checked-out action source.\n\n## Reports And Schemas\n\nReport formats:\n\n- `json`: stable machine-readable diff report.\n- `ai-json`: compact deterministic review artifact for agent workflows.\n- `md`: Markdown summary for code review.\n- `html`: self-contained evidence report.\n\nMachine-readable schemas live in [`schemas/`](schemas/), with schema history in\n[`schemas/CHANGELOG.md`](schemas/CHANGELOG.md).\n\n## Crate Map\n\n- `spdfdiff_types`: shared IDs, geometry, provenance, diagnostics, limits, and\n  report-facing IR.\n- `pdf_core`: low-level parser, object graph, streams, xref handling, and parser\n  diagnostics.\n- `pdf_content`, `pdf_text`, `pdf_semantic`: content interpretation, positioned\n  text extraction, and semantic blocks.\n- `diff_core`, `diff_report`, `spdfdiff_cli`: matching, report rendering, and the\n  public CLI.\n\nCore crates do not use third-party PDF parser or renderer libraries.\n\n## Development Gates\n\nBefore considering a code change complete, run:\n\n```powershell\ncargo fmt --check\ncargo clippy --workspace --all-targets -- -D warnings\ncargo test --workspace\n```\n\nFor fuzzing-target changes, also run:\n\n```powershell\ncargo check --manifest-path fuzz/Cargo.toml --bins\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feraydin%2Fsemantic-pdf-diff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feraydin%2Fsemantic-pdf-diff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feraydin%2Fsemantic-pdf-diff/lists"}