{"id":50425472,"url":"https://github.com/pmarreck/deflate_fingerprint","last_synced_at":"2026-05-31T10:03:49.060Z","repository":{"id":359343428,"uuid":"1245262954","full_name":"pmarreck/deflate_fingerprint","owner":"pmarreck","description":"Identify which DEFLATE encoder produced a given compressed byte stream by byte-exact reproduction. MIT, pure Zig core + C FFI.","archived":false,"fork":false,"pushed_at":"2026-05-21T12:27:12.000Z","size":63,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"yolo","last_synced_at":"2026-05-21T20:52:15.430Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pmarreck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-21T04:09:53.000Z","updated_at":"2026-05-21T12:27:16.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pmarreck/deflate_fingerprint","commit_stats":null,"previous_names":["pmarreck/deflate_fingerprint"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/pmarreck/deflate_fingerprint","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdeflate_fingerprint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdeflate_fingerprint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdeflate_fingerprint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdeflate_fingerprint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pmarreck","download_url":"https://codeload.github.com/pmarreck/deflate_fingerprint/tar.gz/refs/heads/yolo","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fdeflate_fingerprint/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33726722,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-31T10:03:48.047Z","updated_at":"2026-05-31T10:03:49.054Z","avatar_url":"https://github.com/pmarreck.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# deflate_fingerprint\n\nIdentify and reproduce the DEFLATE encoder configuration that produced a\ncompressed byte stream.\n\nDEFLATE decoding is deterministic; DEFLATE encoding is not. Different encoders,\nlevels, strategies, memory settings, block split heuristics, Huffman builders,\nflush patterns, and wrapper choices can all produce valid but byte-distinct\nstreams for the same uncompressed input. `deflate_fingerprint` is a\nparameterized DEFLATE implementation whose job is to recover those choices and\nreproduce the exact bytes.\n\n## Status\n\nActive reverse-engineering / implementation project.\n\n- 32 fingerprints registered: zlib level 0, zlib levels 1-9 across default /\n  fixed / filtered strategies, collapsed zlib HUFFMAN_ONLY and RLE fingerprints,\n  zlib L1 explicit `SYNC_FLUSH` + empty finish handling, and zlib L6\n  memLevel 6/7/9 pending-buffer variants plus Info-ZIP-style 4096-symbol\n  profitability flushes observed in ZIP-family streams.\n- `./test` currently passes 176 tests: Zig unit tests, CLI integration tests,\n  real-zlib fidelity checks, and the internal corpus hit-rate sweep.\n- Internal project-file corpus: 100% hit rate across 500 generated raw-DEFLATE\n  streams.\n- Real-world ZIP-family probing is underway. Current checkpoint hit rates are\n  recorded in [SESSION_RESUME.md](SESSION_RESUME.md).\n- The first 200 mixed local ZIP-family DEFLATE streams currently reproduce at\n  100.0% exact coverage after adding zlib L6 memLevel=9 and generic\n  Info-ZIP-style 4096-symbol profitability flush profiles.\n- Active research target: large ZIP-family XML streams with non-default flush\n  topology. The core now models these as abstract `DeflateReproductionConfig`\n  values with per-offset `FlushEvent` counts derived from the target stream,\n  parse mode, explicit block token-count plans, raw-end block plans, per-block\n  type choices, and empty fixed marker sequences; `fingerprintConfigured` can\n  recover a byte-exact generic config for observed flush/finish and explicit\n  block-plan streams, while named producer details such as Excel worksheet\n  structure live in probes/tests rather than the main encoder.\n- Planned corpus coverage explicitly includes broader Office/iWork documents,\n  EPUB, PDF FlateDecode streams, PNG IDAT streams, gzip, and outputs generated\n  by zlib, libdeflate, 7-Zip, miniz, Go, .NET, Java, and Apple/CoreFoundation.\n- ZIP-container coverage is intentionally broad: `.zip`, `.jar`, `.war`,\n  `.ear`, `.apk`, `.ipa`, `.whl`, `.xpi`, `.crx`, `.vsix`, `.odt`, `.ods`,\n  `.odp`, `.epub`, `.cbz`, and OOXML files reuse the same generic\n  method=8 entry walker before any format-specific metadata is layered on.\n\n## Why it exists\n\nThe downstream consumer is `../blar`, a deterministic archive format and tool\nthat transparently expands containers such as ZIP, Office Open XML, EPUB, PDF,\nPNG, gzip, and tar before recompressing their meaningful payloads with stronger\ncompression.\n\nFor ZIP-family files, simply inflating each entry and later deflating it again\nis content-preserving but not byte-preserving. That is not good enough for\nusers who need exact restoration of original `.docx`, `.xlsx`, `.epub`, `.jar`,\nor `.zip` files.\n\nThe same byte-identity requirement applies to PNG IDAT data, PDF FlateDecode\nstreams, gzip payloads, iWork packages, and any other embedded RFC 1951 stream.\nContainer syntax, PNG filters, PDF object layout, and wrapper bytes are handled\nby format adapters/tests or upstream consumers; the core responsibility here is\nto identify and reproduce the embedded DEFLATE bytes exactly.\n\nThe intended archive workflow is:\n\n1. During blar archive creation, inflate an embedded DEFLATE stream.\n2. Run `deflate_fingerprint` against `(raw bytes, original compressed bytes)`.\n3. Store the raw bytes under blar's stronger compression, plus the compact\n   fingerprint/config that best reproduces the original stream.\n4. On extract, call this encoder with that fingerprint to regenerate the\n   DEFLATE stream.\n5. If the best reproduction is close but not exact, store a compact\n   DEFLATE-aware correction stream and apply it during restore. This residual\n   must be described at the token/block/Huffman-decision level, not as a\n   naive byte diff of the final packed DEFLATE bytes, because one wrong\n   bit-aligned decision shifts every downstream byte.\n6. For each stream, choose the smaller representation: recompressed raw data\n   plus fingerprint/config/correction, or the original compressed bytes stored\n   as-is.\n\nThe practical goal is to recover storage space from already-compressed formats\nwithout giving up byte-identical reconstruction for the archive audiences that\ncare about it.\n\n## What it does\n\n- Encodes raw input using known DEFLATE behavior profiles.\n- Identifies the first registered profile that reproduces a target stream\n  byte-for-byte.\n- Exposes a Zig API and C FFI suitable for blar and other consumers, including\n  explicit config-driven compression via `dfp_encode_configured`.\n- Provides a CLI for current raw-stream attribution experiments.\n- Includes a ZIP-family corpus probe that extracts raw DEFLATE entries from\n  `.zip`, `.docx`, `.xlsx`, `.pptx`, `.epub`, `.jar`, `.apk`, `.whl`, `.xpi`,\n  `.odt`, `.ods`, `.cbz`, and similar files.\n- Includes a PNG IDAT probe that concatenates IDAT chunks, preserves zlib\n  wrapper metadata, strips to the raw RFC 1951 body, inflates to PNG-filtered\n  bytes, runs the same fingerprint/config path, and reports sanitized\n  aggregate miss features such as 4096-token dynamic blocks and empty fixed\n  markers.\n  A private 25-PNG checkpoint improved from 76.0% to 92.0% exact coverage after\n  adding target-derived block-token plans, explicit parse-mode configs, and a\n  generic small-window filtered candidate. The two remaining row-like misses\n  are now narrowed to parser-choice or correction data: both target and zlib\n  choices are legal and hash-chain-visible under row partial-flush replay.\n- Will add corpus probes for PDF, iWork, gzip, and generator-oracle outputs from\n  external DEFLATE implementations. Those tools may be development-only\n  dependencies supplied by `flake.nix`.\n\nLong-term, this should become a general, highly configurable DEFLATE\nimplementation:\n\n- \"Extract\"/identify path: return the best-guess fingerprint/config for an\n  observed compressed stream, plus confidence and optional correction data.\n- Compress/reproduce path: accept an explicit fingerprint/config and emit the\n  corresponding DEFLATE bytes, applying correction data when exact\n  reproduction cannot be expressed by config alone.\n- Default path: provide a sensible default encoder config, but keep\n  reproduction driven by explicit configuration.\n\nThe main research loop is corpus-driven: produce outputs from known encoder\nimplementations, harvest embedded DEFLATE streams from real files, classify\ntheir observed block/flush/token behavior, then promote only byte-exact generic\nreproductions into the fingerprint/config registry.\n\nFor zlib-like and Info-ZIP-like streams, the intended correction payload is\nusually empty: the fingerprint/config alone should reproduce the target. For\noptimal or combinatorial encoders such as zopfli, kzip, and some 7-Zip modes,\nthe parse can be structurally different from any zlib-heuristic prediction. In\nthose cases, exact restoration is still possible, but economics must be decided\nper stream by comparing the corrected representation against storing the\noriginal DEFLATE blob.\n\nCorpus data is split into committed public fixtures and gitignored\nlocal/private corpora. See `docs/CORPUS_WORKFLOW.md` before sampling from a NAS\nor promoting a reproduction fixture.\n\n## Current CLI\n\n```bash\ndeflate-fingerprint identify --raw RAW --target TARGET\ndeflate-fingerprint identify --json --raw RAW --target TARGET\ndeflate-fingerprint --about\ndeflate-fingerprint --help\n```\n\nPlanned CLI surface:\n\n```bash\ndeflate-fingerprint reproduce ID --raw RAW [--out FILE]\ndeflate-fingerprint list\n```\n\n## Build\n\nRequires [Nix](https://nixos.org/) with flakes enabled.\n\nUse the top-level scripts:\n\n```bash\n./build          # ReleaseFast build via nix build\n./build --debug  # debug build\n./test           # full test suite\n./bm             # benchmarks, once implemented\n```\n\nOn this project, native Zig builds should go through the top-level scripts.\nThe scripts avoid host macOS / Zig libSystem stub mismatches documented in\n`AGENTS.md`.\n\n## Key documents\n\n- [SESSION_RESUME.md](SESSION_RESUME.md) - live checkpoint for the current\n  investigation\n- [GOALS.md](GOALS.md) - mission, scope, success criteria, audiences\n- [DESIGN.md](DESIGN.md) - architectural intent, algorithm, module breakdown\n- [PLAN.md](PLAN.md) - phased roadmap and current work items\n- [docs/PRIOR_ART.md](docs/PRIOR_ART.md) - prior-art notes for\n  precomp/preflate/preflate-rs/grittibanzli/reflate and how their ideas map to\n  this project\n- [docs/ENCODER_NOTES.md](docs/ENCODER_NOTES.md) - empirical encoder findings\n- [docs/DEFLATE_DIALS.md](docs/DEFLATE_DIALS.md) - enumerated DEFLATE choices\n- [docs/V0.1_STATUS.md](docs/V0.1_STATUS.md) - current v0.1 implementation status\n\n## License\n\nMIT. Foundational, fully open infrastructure, like BLIP and blar. The\ncommercial value is downstream; this library benefits the broader archive,\nforensics, and reproducible-build ecosystems.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fdeflate_fingerprint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpmarreck%2Fdeflate_fingerprint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fdeflate_fingerprint/lists"}