{"id":47247904,"url":"https://github.com/franchoy/coldkeep","last_synced_at":"2026-06-06T08:01:52.933Z","repository":{"id":341998287,"uuid":"1158711822","full_name":"franchoy/coldkeep","owner":"franchoy","description":"coldkeep is an experimental local-first content-addressed file storage engine with verifiable integrity written in Go. Files are split into content-addressed chunks, packed into container files on disk, and tracked through PostgreSQL metadata to attempt deduplication","archived":false,"fork":false,"pushed_at":"2026-03-29T20:01:08.000Z","size":14659,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-29T20:23:53.853Z","etag":null,"topics":["backup","cold-storage","content-defined-chunking","deduplicate","go","research-project","storage-engines"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/franchoy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-15T19:58:15.000Z","updated_at":"2026-03-28T10:29:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/franchoy/coldkeep","commit_stats":null,"previous_names":["franchoy/coldkeep"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/franchoy/coldkeep","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/franchoy%2Fcoldkeep","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/franchoy%2Fcoldkeep/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/franchoy%2Fcoldkeep/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/franchoy%2Fcoldkeep/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/franchoy","download_url":"https://codeload.github.com/franchoy/coldkeep/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/franchoy%2Fcoldkeep/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31290943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T13:12:26.723Z","status":"ssl_error","status_checked_at":"2026-04-01T13:12:25.102Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["backup","cold-storage","content-defined-chunking","deduplicate","go","research-project","storage-engines"],"created_at":"2026-03-14T09:27:27.024Z","updated_at":"2026-06-06T08:01:52.919Z","avatar_url":"https://github.com/franchoy.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# coldkeep\n\n![Coldkeep Logo](assets/logo/coldkeep-logo.png)\n\nCorrectness-first cold storage engine\n\n• Content-addressed • Built-in deduplication • Deterministic restore\n\n• Verifiable integrity • Crash-safe • GC-safe\n\n## Branding\n\nColdkeep uses a visual identity based on an ice cube vault:\n\n- 🧊 cold storage (ice cube)\n- 🔒 secure data (vault door)\n- 🗄️ structured containers (internal shelves)\n\n## Project Status\n\n![CI](https://github.com/franchoy/coldkeep/actions/workflows/ci.yml/badge.svg)\n![Go Version](https://img.shields.io/badge/go-1.25+-blue)\n![License](https://img.shields.io/badge/license-Apache%202.0-blue)\n![Status](https://img.shields.io/badge/status-v1.12%20planning%20%2F%20engine--catalog%20migration-blue)\n![Release](https://img.shields.io/github/v/release/franchoy/coldkeep?include_prereleases)\n\n\u003e Status: v1.9 formalizes transform-based storage semantics (logical/compressed/physical layers) with block-level compression and explicit staged verification, while preserving deterministic restore, GC safety, snapshot semantics, and mixed-repository compatibility.\n\u003e Migration note (v1.9): existing v1.7/v1.8 payloads remain readable through compatibility paths with no forced rewrite or recompression. Missing PostgreSQL schema requires manual schema application or `COLDKEEP_DB_AUTO_BOOTSTRAP=true`. Existing older schemas are auto-upgraded to the required v15 schema at startup.\n\n## Current release state\n\nColdkeep v1.11 introduced the behavior-preserving engine facade.\n\nThe current development focus is v1.12:\n\n- migrate business orchestration into engine entry points;\n- introduce catalog/metadata facade boundaries;\n- prepare the SQLite-first local catalog direction;\n- preserve PostgreSQL compatibility;\n- keep CLI behavior, JSON output, exit codes, and repository/storage formats stable.\n\nThe v1.12 migration must preserve existing behavior first and prove parity before lifting logic. No\ncommand is routed through the engine unless its request/result contract can represent the existing\ncommand behavior.\n\ncoldkeep is a local-first content-addressed storage engine focused on deterministic restore,\nexplicit integrity verification, and safe lifecycle behavior under failure scenarios.\n\nNow with snapshot lineage, diff summaries, and safe deletion insights.\n\n## Why coldkeep?\n\ncoldkeep is designed for correctness-first cold storage.\n\nUnlike traditional backup tools, it emphasizes:\n\n- deterministic, byte-identical restore\n- content-addressed deduplication\n- explicit, test-backed integrity checks\n- safe recovery and reference-safe garbage collection\n- machine-readable CLI behavior suitable for automation\n\nThe goal is confidence and recoverability over maximum throughput.\n\nv1.7 performance work followed the existing execution model: bounded worker-based commands under explicit safety constraints, without turning coldkeep into a fully concurrent daemon or changing on-disk format, chunk layout, or operator-visible schema compatibility. v1.8 introduced packed multi-chunk storage blocks and completed AES-GCM packed-block integration. v1.9 builds on that foundation with formal transform-aware storage semantics, block-level compression, and explicit verification stages while preserving restore determinism, snapshot semantics, and GC safety.\n\n## Features\n\n- Snapshot lineage (`--from`)\n- Snapshot diff summaries\n- Snapshot tree visualization\n- Safe deletion preview (`--dry-run`)\n- Read-only observability (`stats`, `inspect`)\n- Exact GC simulation with trace support\n- Built-in deduplication\n- Deterministic restore\n\n## Status\n\nColdkeep has ten explicit correctness layers:\n\n- v1.0: storage correctness (restore determinism, integrity, recovery, GC safety)\n- v1.1: interaction correctness (CLI orchestration, machine-readable contracts, batch semantics)\n- v1.2: physical-file graph coherence, explicit repair semantics, audited GC refusal, and invariant-aware batch maintenance reporting\n- v1.3: snapshot-based retention as a correctness layer (immutable point-in-time captures, snapshot-protected GC, reachability audits)\n- v1.4: snapshot clarity and lifecycle hardening (explicit lineage semantics, safer dry-run wording, stricter pre-release verification guidance)\n- v1.5: chunker-evolution compatibility contract clarity (mixed-version repositories, explicit new-writes-only chunker policy)\n- v1.6: observability and simulation contract hardening (read-only introspection, exact GC simulation parity, trace channel behavior)\n- v1.7: controlled-execution performance validation (benchmarking, deterministic comparison, and release-readiness safety proof without storage-format or schema-breaking change)\n- v1.8: packed block abstraction and AES-GCM packed-block integration (multi-chunk storage blocks, dual-compat read path, locked block-size defaults, configurable operator override, release hardening)\n- v1.9: transform-based storage architecture freeze (block-level compression, logical/compressed/physical hash semantics, metadata-driven read path, and explicit staged verification)\n\nGuarantees are enforced through automated validation and CI gates; see [VALIDATION_MATRIX.md](VALIDATION_MATRIX.md) for guarantee-to-evidence mapping.\n\nIf you are new to the project, start here, then continue to [ARCHITECTURE.md](ARCHITECTURE.md) for the internal model and [VALIDATION_MATRIX.md](VALIDATION_MATRIX.md) for the guarantee-to-evidence map.\n\n## v1.9 Storage Contract\n\n- v1.9 keeps packed storage blocks as the default write path for new data.\n- The default packed block size is 1 MiB.\n- `COLDKEEP_BLOCK_TARGET_SIZE_MB` exists as an advanced operator tuning override for new writes only. Valid values for v1.9: `1`, `2`, `3` (MiB). Other values log a warning and use the locked default. This override is retained for benchmarking and specialized operator tuning; production deployments should use the default.\n- `COLDKEEP_PACKED_BLOCK_SIZE_MIB` is a legacy fallback environment variable checked only if `COLDKEEP_BLOCK_TARGET_SIZE_MB` is not set. It is accepted for backward compatibility; new configurations should use `COLDKEEP_BLOCK_TARGET_SIZE_MB`.\n- v1.9 reads existing v1.7/v1.8 repositories without rewriting historical data.\n- v1.9 writes packed blocks for new data through `storage_blocks` and `chunk_block_refs`.\n- Mixed repositories containing legacy v1.7/v1.8 data and new v1.9 compressed/encrypted blocks are valid steady-state.\n- v1.7 is not guaranteed to read repositories that contain v1.8/v1.9 packed-block data.\n- Both `plain` and `aes-gcm` codec settings work end-to-end with packed writes. When `COLDKEEP_CODEC=aes-gcm`, the full encoded block is AES-GCM encrypted and `storage_blocks.codec` is set to `\"aes-gcm\"`; stored bytes are a 12-byte nonce prefix followed by the ciphertext. When `COLDKEEP_CODEC=plain`, `storage_blocks.codec` is `\"none\"` and stored bytes are the plaintext encoded block. The read path (`StorageBlockReader`) handles both layouts transparently using per-block metadata.\n- Compression settings (`none` / `zstd`) affect future writes only and never rewrite historical blocks.\n\n## Compression and Integrity Contract (Pre-v1.10 Freeze)\n\nCompression behavior:\n\n- Compression is block-level.\n- Compression happens before encryption.\n- Compression configuration affects only newly written blocks.\n- Existing blocks are never recompressed automatically.\n- Reads and verify use per-block metadata, so mixed repositories (legacy + new transform metadata) are valid steady-state.\n- Compression is store-if-smaller: some zstd-configured blocks are intentionally stored uncompressed when compression would expand payload size.\n- Compression does not change dedup identity; dedup remains anchored to logical block content.\n\nIntegrity checkpoints:\n\n- `logical_hash` (`block_hash`) verifies decoded logical block content.\n- `payload_hash` is a deprecated lowercase-hex mirror of `block_hash` retained for compatibility/observability only.\n- `compressed_hash` verifies pre-encryption compressed payload.\n- `physical_hash` verifies exact persisted bytes in container storage.\n\n## Core Guarantees\n\n### Summary\n\n- deterministic, byte-identical restore\n- no exposure of partially written or inconsistent data\n- GC is reference-safe: no reachable chunk is ever deleted\n- Atomic restore replacement (within single-node local filesystem semantics)\n- Safe in-process concurrent storage operations\n\n### Core invariants\n\nGuarantee IDs are stable and tracked in [VALIDATION_MATRIX.md](VALIDATION_MATRIX.md):\n\n- G1: deterministic, byte-identical restore\n- G2: repeat store does not drift chunk graph\n- G3: no exposure of partially written or inconsistent data\n- G4: GC is reference-safe (no reachable chunk is deleted)\n- G5: atomic restore replacement (single-node local filesystem semantics)\n- G6: safe in-process concurrent storage operations\n- G7: deep corruption detection (payload/offset/tail)\n- G8: corrective health gate contract stability\n- G9: deterministic batch CLI orchestration and automation-safe contract behavior\n- G10: current-state physical mapping graph coherence is audited in standard verify\n- G11: GC executes only on an audited coherent physical-root graph\n- G12: invariant failures expose stable machine-readable classification and operator guidance\n- G13: batch maintenance commands expose deterministic execution semantics and invariant-aware per-item reporting\n- G14: snapshot-retained content is GC-safe and protected by liveness union (current + snapshot roots)\n- G15: snapshot deletion only changes metadata and future GC eligibility (content preserved)\n- G16: stats expose snapshot-retention pressure to operators (retained-only-by-current, retained-only-by-snapshot, shared)\n- G17: verify and doctor audit persisted snapshot reachability integrity and report retention context\n\nDefinitions and evidence mapping for G1-G17 are tracked in [VALIDATION_MATRIX.md](VALIDATION_MATRIX.md).\n\nDocumentation is split into:\n\n- [README.md](README.md) for overview, quickstart, and CLI usage\n- [ARCHITECTURE.md](ARCHITECTURE.md) for the internal model, invariants, lifecycle, and trust boundary\n- [COMPATIBILITY.md](COMPATIBILITY.md) for version-compatibility, chunker-evolution contract, and explicit non-guarantees\n- [VALIDATION_MATRIX.md](VALIDATION_MATRIX.md) for guarantee-to-evidence mapping\n- [CONTRIBUTING.md](CONTRIBUTING.md) for contributor workflow, local CI guidance, and stats benchmark commands for observability-sensitive changes\n- [PRE_RELEASE_CHECKLIST.md](PRE_RELEASE_CHECKLIST.md) for release-gate execution\n- [SECURITY.md](SECURITY.md) for the threat model and security limits\n- [docs/internal/storage_compatibility_matrix.md](docs/internal/storage_compatibility_matrix.md) for the formal storage compatibility matrix and benchmark scope split\n- [docs/PATH_IDENTITY.md](docs/PATH_IDENTITY.md) for current-state path identity policy\n- [CHANGELOG.md](CHANGELOG.md) for milestone history\n\nFor the deeper model (invariants, lifecycle, validity, recovery, trust boundary), see [ARCHITECTURE.md](ARCHITECTURE.md).\n\n## Chunking at a Glance\n\ncoldkeep uses content-defined chunking (CDC).\n\n- chunk boundaries depend on data patterns (not fixed-size windows),\n- different chunker versions can choose different boundary strategies,\n- stored state is a chunked reconstruction recipe (`file_chunk -\u003e chunk -\u003e blocks`), not a raw whole-file blob.\n\nExample:\n\n```text\nFile A (v1):\n  [chunk1][chunk2][chunk3]\n\nFile B (v2):\n  [chunk4][chunk5]\n```\n\nEven with overlapping content, layout can differ across chunker versions.\n\n## Chunker Versions\n\n- each committed logical file stores `chunker_version` metadata,\n- one repository can contain multiple chunker versions,\n- chunker version is selected at store time,\n- fresh v1.5+ repositories default new writes to `v2-fastcdc`,\n- upgraded repositories preserve prior write default (`v1-simple-rolling` unless explicitly changed),\n- chunks may be reused across chunker versions if their content is identical,\n- cross-version reuse is opportunistic and not guaranteed for efficiency ratios,\n- `chunker_version` on chunk rows is origin metadata, not a reuse constraint,\n- restore is recipe-driven and does not depend on the active write chunker.\n\nConfigure repository write default:\n\n```bash\ncoldkeep config set default-chunker \u003cversion\u003e\n```\n\nThis affects new writes only and does not rewrite existing data.\n\n## Safety Guarantees (High-Level)\n\n- restore correctness: stored files restore byte-identically,\n- snapshot stability: snapshots remain valid across upgrades,\n- non-destructive evolution: no automatic background re-chunking or silent rewrite,\n- forward-compatible metadata: unknown but well-formed future chunker labels do not block restore.\n\nFor full guarantees, non-guarantees, and upgrade behavior details:\n\n- [COMPATIBILITY.md](COMPATIBILITY.md)\n- [ARCHITECTURE.md](ARCHITECTURE.md)\n\nLegacy compatibility contract (v1.9):\n\n- mandatory: old repositories remain readable/restorable\n- not guaranteed: automatic rewrite, recompression, or eager migration of historical data\n\n## When to use coldkeep\n\nGood fit:\n\n- cold/backup storage where correctness matters more than speed\n- environments needing explicit integrity verification\n- deduplication + deterministic restore use cases\n\nNot a fit (v1.x scope):\n\n- hot-path high-throughput storage\n- distributed/multi-node coordination\n\n## Quickstart\n\nA small samples directory is included for local testing.\n\nIf you only want the fastest successful first run, use the Local (no Docker)\npath below, then come back to the later sections as needed.\n\n### Local (no Docker)\n\n```bash\n# 1) Initialize key material (.env)\ncoldkeep init\n\n# 2) Load environment\nexport $(cat .env | xargs)\n\n# 3) Configure local PostgreSQL connection (required for local mode)\nexport DB_HOST=127.0.0.1\nexport DB_PORT=5432\nexport DB_USER=coldkeep\nexport DB_PASSWORD=coldkeep\nexport DB_NAME=coldkeep\nexport DB_SSLMODE=disable\nexport COLDKEEP_DB_AUTO_BOOTSTRAP=true\n\n# 4) Store and inspect\ncoldkeep store samples/hello.txt\ncoldkeep stats\n\n# 5) Restore + verify\n# restore expects file ID(s), not source filename\ncoldkeep restore 1 ./restored\ncoldkeep verify system --standard\n```\n\nSecurity note: if the encryption key is lost, encrypted data cannot be recovered.\n\nCommand form tips:\n\n- `restore` expects logical file IDs (`coldkeep restore \u003cfileID\u003e \u003coutputDir\u003e`); use `--stored-path` if you want path-based restore.\n- `verify` expects a target: `coldkeep verify system ...` or `coldkeep verify file \u003cfileID\u003e ...`.\n\n### Docker\n\n```bash\n# 1) Start services\ndocker compose up -d --build\n\n# 2) Initialize key material on host-mounted workspace\ndocker compose run --rm -v \"$PWD:/app\" coldkeep init\n\n# 3) Store a sample file\ndocker compose run --rm \\\n  --env-file .env \\\n  -v \"$PWD/samples:/samples\" \\\n  coldkeep store /samples/hello.txt\n```\n\n## Smoke Validation (Two Approaches)\n\nIf you are preparing a PR, run the smoke gate (`scripts/smoke.sh`) with either\nworkflow below. Both are valid and both are used by contributors.\n\nPR author tip: use the PR template at [`.github/pull_request_template.md`](.github/pull_request_template.md)\nto summarize invariants and lifecycle-semantics impact for reviewers.\nFor a contributor-oriented local CI path before that, see [CONTRIBUTING.md](CONTRIBUTING.md).\nIf your change touches `coldkeep stats` or stats query shape, the same guide also includes a short stats benchmarking section with small/medium/large benchmark commands.\n\n### Approach A: Docker runner\n\nUse the `coldkeep` service container to run the smoke script.\n\n```bash\n# 1) Ensure PostgreSQL service is up\ndocker compose up -d coldkeep_postgres\n\n# 2) Load encryption env from .env generated by coldkeep init\nset -a\nsource .env\nset +a\n\n# 3) Run smoke inside the coldkeep container\ndocker compose run --rm \\\n  -e COLDKEEP_KEY=\"$COLDKEEP_KEY\" \\\n  -e COLDKEEP_CODEC=\"$COLDKEEP_CODEC\" \\\n  -v \"$PWD/samples:/samples:ro\" \\\n  --entrypoint sh coldkeep \\\n  -lc 'apk add --no-cache jq \u003e/dev/null \u0026\u0026 COLDKEEP_SAMPLES_DIR=/samples scripts/smoke.sh'\n```\n\n### Approach B: Host runner\n\nRun the smoke script on host with a local binary, pointing to Docker PostgreSQL.\n\n```bash\n# 1) Ensure PostgreSQL service is up\ndocker compose up -d coldkeep_postgres\n\n# 2) Build coldkeep locally and load encryption env\ngo build -o coldkeep ./cmd/coldkeep\nset -a\nsource .env\nset +a\n\n# 3) Run smoke from host against Docker PostgreSQL\nDB_HOST=127.0.0.1 \\\nDB_PORT=5432 \\\nDB_USER=coldkeep \\\nDB_PASSWORD=coldkeep \\\nDB_NAME=coldkeep \\\nDB_SSLMODE=disable \\\nPATH=\"$PWD:$PATH\" \\\n./scripts/smoke.sh\n\n# 4) Optional cleanup of local binary\nrm -f coldkeep\n```\n\nNotes:\n\n- `scripts/smoke.sh` requires `jq` and `coldkeep` on PATH in the execution environment.\n- Containerized simulate checks may print a non-fatal warning about sqlite/cgo stubs; smoke continues unless `COLDKEEP_SMOKE_STRICT_SIMULATE=1` is set.\n\n## CLI Basics\n\nTypical flows:\n\n```bash\ncoldkeep store file.txt\ncoldkeep store-folder ./data\ncoldkeep restore 12 ./out\ncoldkeep restore --stored-path docs/report.txt --destination ./out/report.txt --mode override\ncoldkeep remove 12\ncoldkeep gc\ncoldkeep stats\ncoldkeep list\ncoldkeep search report\ncoldkeep verify system --standard\ncoldkeep doctor\n```\n\nSimulation (no physical writes):\n\n```bash\ncoldkeep simulate store-folder ./data\ncoldkeep simulate store file.txt --output json\n```\n\nObservability and GC simulation (read-only):\n\n```bash\ncoldkeep stats\ncoldkeep stats --json\n\ncoldkeep inspect \u003centity\u003e \u003cid\u003e\ncoldkeep inspect \u003centity\u003e \u003cid\u003e --relations\ncoldkeep inspect \u003centity\u003e \u003cid\u003e --reverse\ncoldkeep inspect \u003centity\u003e \u003cid\u003e --deep --limit N\n\ncoldkeep simulate gc\ncoldkeep simulate gc --delete-snapshot \u003cid\u003e\ncoldkeep simulate gc --containers\n\n# trace diagnostics are emitted on stderr\ncoldkeep stats --trace\ncoldkeep inspect chunk \u003cid\u003e --trace-json\ncoldkeep simulate gc --trace-json\n```\n\nSupported inspect entities currently include: `file` (alias: `logical-file`), `chunk`, `container`, and `snapshot`.\n\nObservability command guarantees (v1.6):\n\n- `stats`, `inspect`, and `simulate gc` are read-only command surfaces.\n- `simulate gc` is an exact simulation of GC reclaimability under the same integrity gates.\n- `simulate gc` previews exact GC reclaimability using the shared GC planning layer (`gc.BuildPlan`), including fully-dead active containers; it is not legacy `gc --dry-run` behavior.\n- GC simulation does not mutate repository state (no database writes and no filesystem writes).\n- JSON output is intended for tooling/automation contracts.\n- `meta.version` is the CLI JSON contract version. It remains `v1.7` for additive, backward-compatible fields (including v1.8/v1.9 `stats.block_layout` additions) and only bumps on breaking JSON contract changes.\n- Deep inspect output can be large; use `--limit N` to bound traversal output for operators and CI.\n- `--trace` and `--trace-json` are diagnostics channels; traces are emitted to stderr so stdout data remains stable for piping.\n- v1.8/v1.9 `stats` includes block-layout observability for packed storage: `storage_blocks_count`, `chunk_block_refs_count`, `avg_chunks_per_block`, `avg_block_plaintext_size`, `avg_block_stored_size`, `avg_block_fill_ratio`, `legacy_block_count`, `packed_block_count`, and `codec_distribution` when packed blocks are present.\n\nOperator-facing v1.9 delta for common commands:\n\n- `coldkeep store`, `restore`, `verify system --standard`, `gc --dry-run`, `gc`, `stats --json`, and `inspect` keep their existing invocation shape; v1.9 does not add new required flags to these commands.\n- `stats` may include packed-block metrics in human and JSON output.\n- `verify` may surface packed-block integrity categories such as packed block hash or metadata corruption.\n- Block abstraction is documented, but remains a compatibility-layer change rather than a new mandatory operator workflow.\n\nChunker benchmark and interpretation:\n\n```bash\ncoldkeep benchmark chunkers --output json\ncoldkeep benchmark run --dataset small --repeat 1 --output json\nscripts/run_phase8_blocksize_matrix.sh --list-missing\n```\n\nv1.9 supports both CLI and scripted benchmark workflows.\n\n- Use `coldkeep benchmark chunkers` and `coldkeep benchmark run` for operator-facing repeatable local measurements.\n- Use `scripts/run_phase8_*.sh` and `scripts/compare_phase8_*.py` for release matrix orchestration and historical comparison workflows.\n\nTypical outcomes to expect (informational ranges):\n\n- Small modifications:\n  v1: ~92-96% reuse\n  v2: ~94-98% reuse\n- Shifted data:\n  v1: ~5-20% reuse\n  v2: ~25-50% reuse\n\nInterpretation note: the shifted-data reuse gap is the main justification signal for v2 FastCDC boundary stability improvements.\nCritical insight: this indicates FastCDC improves not only dedup ratio, but dedup stability over time under boundary-shifting changes.\n\nCommon mistakes to avoid:\n\n- Do not assert exact chunk counts; implementations can vary slightly while preserving correctness.\n- Do not use non-deterministic input data; keep all generated data seed-driven for CI reliability.\n- Do not ignore shifted-data comparisons; this is the most important stability signal.\n- Do not overcomplicate metrics; keep interpretation focused on reuse percentage, chunk count, and coverage invariants.\n\n## Batch Operations (v1.2)\n\nBatch restore/remove/repair extends the automation contract with deterministic orchestration and invariant-aware reporting.\n\n```bash\ncoldkeep restore 12 18 24 ./out\ncoldkeep remove 12 18 24\ncoldkeep remove --input ids.txt\ncoldkeep remove --stored-paths /data/a.txt /data/b.txt --input paths.txt\ncoldkeep repair ref-counts --batch\ncoldkeep repair --batch --input repair_targets.txt\ncoldkeep restore 12 18 ./out --dry-run\n```\n\nCurrent `repair --batch` scope is target-oriented, not item-oriented:\n\n- today the only supported target is `ref-counts`\n- input files for `repair --batch --input \u003cfile\u003e` currently contain repeated target names such as `ref-counts`\n- they do not contain file IDs or stored paths\n\nSemantics (summary):\n\n- per-item isolation by default\n- optional fail-fast for execution failures\n- duplicate target skipping\n- deterministic per-item report ordering\n- JSON status values are intentionally two-layered:\n  - overall payload status: ok, partial_failure, error\n  - per-item result status: success, failed, skipped, planned\n- JSON execution mode is explicit: `continue_on_error` (default) or `fail_fast`\n- process exit is automation-friendly:\n  - 0 when no item fails\n  - 1 when one or more items fail\n  - 2 for pre-execution validation/usage failures (including empty effective target sets after parsing input)\n\nExample JSON payload:\n\n```json\n{\n  \"status\": \"partial_failure\",\n  \"operation\": \"repair\",\n  \"dry_run\": false,\n  \"execution_mode\": \"continue_on_error\",\n  \"summary\": {\n    \"total\": 2,\n    \"succeeded\": 1,\n    \"failed\": 1,\n    \"skipped\": 0\n  },\n  \"results\": [\n    {\n      \"id\": \"ref-counts\",\n      \"status\": \"success\",\n      \"message\": \"logical_file ref_count values repaired\"\n    },\n    {\n      \"id\": \"ref-counts\",\n      \"status\": \"failed\",\n      \"message\": \"repair refused: orphan physical_file rows detected\",\n      \"invariant_code\": \"REPAIR_REFUSED_ORPHAN_ROWS\",\n      \"recommended_action\": \"Remove or correct orphan physical_file rows before retrying repair.\"\n    }\n  ]\n}\n```\n\nFor full batch contract details and examples, see [ARCHITECTURE.md](ARCHITECTURE.md) and [PRE_RELEASE_CHECKLIST.md](PRE_RELEASE_CHECKLIST.md).\n\n## Snapshot Layer (v1.4)\n\ncoldkeep snapshots capture an immutable, point-in-time view of your stored files.\n\nSnapshots capture a complete, immutable view of the current system state.\nEven when using `--from`, snapshots are always fully self-contained and do not depend on their parent.\n\nCritical clarity:\n\n- Snapshots are always self-contained.\n- `--from` records lineage metadata for analysis only.\n- `--from` does not create dependencies.\n- A child snapshot restore never requires reading parent snapshot content.\n\n### Creating snapshots\n\nv1.4 flow example:\n\n```bash\n# Create initial snapshot\ncoldkeep snapshot create --id day1\n\n# Modify files...\n\n# Create snapshot with lineage\ncoldkeep snapshot create --id day2 --from day1\n\n# Understand changes\ncoldkeep snapshot diff day1 day2 --summary\n\n# Inspect snapshot reuse\ncoldkeep snapshot stats day2\n\n# Visualize history\ncoldkeep snapshot list --tree\n\n# Preview deletion\ncoldkeep snapshot delete day1 --dry-run\n```\n\n```bash\n# Full snapshot (all physical_file entries)\ncoldkeep snapshot create\n\n# Full snapshot with lineage metadata\ncoldkeep snapshot create --id day2 --from day1\n\n# Partial snapshot (exact paths and/or directory prefixes)\ncoldkeep snapshot create docs/ report.txt --label release-2026-04\n```\n\n- `--id \u003csnapshotID\u003e`: snapshot_id system identifier. This is the command target for `show`, `restore`, `stats`, `diff`, and `delete`.\n- `--label \u003cstring\u003e`: optional user-facing metadata only. It is not an identifier and is never used for command targeting.\n- `--from \u003csnapshotID\u003e`: optional parent snapshot lineage metadata on create. This is informational only and does not create any parent-content dependency during create or restore.\n\n`--from \u003csnapshotID\u003e` behavior:\n\n- snapshot recorded as derived from parent\n- does not create a dependency\n- snapshot content is still built from current system state\n- parent relationship is used for comparison and visualization only\n\nCurrent lineage scope policy:\n\n- `--from` is currently supported only for full snapshots.\n- Parent snapshot referenced by `--from` must also be full.\n- Filtered parent/child lineage for partial snapshots is intentionally rejected in this phase.\n\nSnapshot command targeting contract:\n\n- There is no `--snapshot` selector flag for snapshot subcommands.\n- Pass snapshot_id positionally (for example: `coldkeep snapshot restore \u003csnapshotID\u003e`).\n\n### Listing and inspecting\n\n```bash\ncoldkeep snapshot list\ncoldkeep snapshot list --type full --limit 10 --since 2026-01-01\ncoldkeep snapshot list --tree\ncoldkeep snapshot show snap-abc123\ncoldkeep snapshot show snap-abc123 --limit 50\ncoldkeep snapshot show snap-abc123 --prefix docs/\ncoldkeep snapshot show snap-abc123 --pattern \"docs/*.txt\" --min-size 1024\ncoldkeep snapshot stats\ncoldkeep snapshot stats snap-abc123\n```\n\n`snapshot list --tree` renders a lineage view from snapshot metadata (`id`, `parent_id`, `created_at`).\nIf a parent snapshot was deleted, affected snapshots are still shown as roots; snapshot usability is unchanged.\nLineage visualization is not a dependency graph for restore execution.\nThe snapshot tree represents lineage metadata, not dependency.\n\nConceptual lineage example:\n\n```text\nday1\n └── day2\n  └── day3\n```\n\nEach snapshot is independent despite this structure.\n\n`snapshot list --tree`:\n\n- displays snapshots as a lineage tree based on parent relationships\n- reflects metadata lineage only (not restore dependency)\n\n`snapshot stats` lineage context:\n\n- when a parent snapshot is available, stats include reused files, new files, and reuse ratio\n- if the parent snapshot is missing, stats fall back gracefully with explanatory output\n\nSnapshot file queries are reusable across `snapshot show`, `snapshot restore`, and `snapshot diff`.\n\nSupported query flags:\n\n- `--path \u003cexact\u003e`: exact normalized snapshot path match; repeatable\n- `--prefix \u003cdir/\u003e`: normalized directory prefix match; repeatable and must end with `/`\n- `--pattern \u003cglob\u003e`: slash-path glob (`path.Match`) against the normalized snapshot path\n- `--regex \u003cre\u003e`: regular expression against the snapshot path\n- `--min-size \u003cbytes\u003e` / `--max-size \u003cbytes\u003e`: inclusive logical size range\n- `--modified-after \u003cRFC3339|YYYY-MM-DD\u003e` / `--modified-before \u003cRFC3339|YYYY-MM-DD\u003e`: inclusive mtime window\n\nAll active criteria are ANDed together. Path and prefix inputs are normalized before matching, and result ordering remains deterministic.\n\n### Restoring from a snapshot\n\n```bash\n# Restore all files to their original paths\ncoldkeep snapshot restore snap-abc123\n\n# Restore a subdirectory under a new prefix\ncoldkeep snapshot restore snap-abc123 docs/ --mode prefix --destination ./restored\n\n# Restore a single file to an explicit destination\ncoldkeep snapshot restore snap-abc123 docs/report.txt --mode override --destination ./out/report.txt\n\n# Restore only matching files from the snapshot query layer\ncoldkeep snapshot restore snap-abc123 --prefix docs/ --pattern \"docs/*.txt\" --mode prefix --destination ./restored\n```\n\n### Diffing two snapshots\n\n`snapshot diff` compares two snapshots by path and logical file identity, classifying each change as `added`, `removed`, or `modified`.\nWhen query filters include size or mtime constraints, diff evaluates `added` and `modified` entries against target-snapshot metadata, and `removed` entries against base-snapshot metadata.\nA file is considered modified if its content changes, even when the path stays the same.\n\n```bash\n# Show all changes between two snapshots\ncoldkeep snapshot diff snap-1 snap-2\n\n# Show only added files\ncoldkeep snapshot diff snap-1 snap-2 --filter added\n\n# Restrict the diff view to a path subset\ncoldkeep snapshot diff snap-1 snap-2 --prefix docs/\n\n# Return summary counts only (no per-entry list)\ncoldkeep snapshot diff snap-1 snap-2 --summary\n\n# Combine diff classification with snapshot query filters\ncoldkeep snapshot diff snap-1 snap-2 --filter modified --regex \"\\\\.yaml$\"\n\n# Machine-readable JSON output\ncoldkeep snapshot diff snap-1 snap-2 --output json\n```\n\nText output example:\n\n```text\n[SNAPSHOT DIFF]\n\nBase:    snap-1\nTarget:  snap-2\n\n+ docs/new.txt\n- docs/old.txt\n~ docs/config.yaml\n\nSummary:\n  added: 1\n  removed: 1\n  modified: 1\n```\n\nJSON output schema:\n\n```json\n{\n  \"status\": \"ok\",\n  \"command\": \"snapshot diff\",\n  \"data\": {\n    \"base\": \"snap-1\",\n    \"target\": \"snap-2\",\n    \"summary\": { \"added\": 1, \"removed\": 1, \"modified\": 1 },\n    \"entries\": [\n      { \"path\": \"docs/new.txt\",    \"type\": \"added\",    \"base_logical_id\": null, \"target_logical_id\": 2 },\n      { \"path\": \"docs/old.txt\",    \"type\": \"removed\",  \"base_logical_id\": 1,    \"target_logical_id\": null },\n      { \"path\": \"docs/config.yaml\",\"type\": \"modified\", \"base_logical_id\": 3,    \"target_logical_id\": 4 }\n    ],\n    \"duration_ms\": 12\n  }\n}\n```\n\n`--filter` limits output to one change type (`added`, `removed`, or `modified`). Summary counts reflect the filtered set.\n`--summary` returns counts only and skips detailed `entries` output.\n\n`snapshot diff --summary`:\n\n- displays a summary of changes\n- includes added, removed, and modified counts\n\nThe JSON contract for snapshot commands is unchanged. Query flags only reduce the returned `files` or `entries` collections and the derived counts; field names and envelope structure remain stable.\n\n### Deleting a snapshot\n\n```bash\ncoldkeep snapshot delete snap-abc123 --force\ncoldkeep snapshot delete snap-abc123 --dry-run\n```\n\nDeletes only the snapshot row and its `snapshot_file` entries. The underlying logical files and blocks are not affected.\nDeleting a snapshot removes metadata only. Data remains retained when still referenced by other snapshots or current state.\n\n`--dry-run` is read-only and reports impact details (lineage preview and file-count breakdown) without applying changes.\nDry-run impact describes metadata/reference effects and does not guarantee disk-space reclamation.\nWhen both `--force` and `--dry-run` are passed, `--dry-run` takes precedence and the command remains read-only.\n\n`snapshot delete --dry-run` preview includes:\n\n- number of files referenced by the snapshot\n- files unique to this snapshot\n- files shared with other snapshots\n- lineage impact\n\nNo data is modified in dry-run mode.\n\n### Safe lineage workflow (v1.4)\n\nUse this sequence when operating on parent/child snapshots:\n\n```bash\n# 1) Create baseline and child lineage metadata\ncoldkeep snapshot create --id day1\ncoldkeep snapshot create --id day2 --from day1\n\n# 2) Review lineage and impact before delete\ncoldkeep snapshot list --tree\ncoldkeep snapshot delete day1 --dry-run\n\n# 3) If approved, delete parent metadata\ncoldkeep snapshot delete day1 --force\n\n# 4) Verify child remains independently restorable\ncoldkeep snapshot restore day2\n```\n\nExpected behavior:\n\n- Deleting `day1` changes lineage metadata and future GC eligibility only.\n- `day2` remains restorable because snapshots are self-contained.\n- `snapshot list --tree` may re-root children after parent delete; restore behavior is unchanged.\n\n### Snapshot release gate (operator quick checklist)\n\nBefore tagging a release, run the dedicated snapshot/retention contract gate in [PRE_RELEASE_CHECKLIST.md](PRE_RELEASE_CHECKLIST.md).\n\nFor the focused automated snapshot gate, run:\n\n```bash\nscripts/run_snapshot_release_gate.sh --count 1\n```\n\nRun the checklist step-by-step and in order. For the manual snapshot lifecycle gate, use a stable snapshot identifier (for example via `snapshot create --id pre-gc-gate`) and pass snapshot IDs positionally in `snapshot restore`, `snapshot diff`, and `snapshot delete`.\n\nManual lifecycle expected in the release gate:\n\n- create snapshot\n- remove current mapping\n- confirm GC dry-run reports snapshot-retained logical files\n- restore from snapshot\n- diff two snapshots\n- delete snapshot\n- confirm GC eligibility changes only after delete\n\nFor the full release criteria, use the snapshot sign-off sections in [PRE_RELEASE_CHECKLIST.md](PRE_RELEASE_CHECKLIST.md):\n\n- `15) Snapshot sign-off checklist (Phases 1-7)`\n- `C. Test surface checklist`\n- `D. Documentation / release checklist`\n- `15) Verify snapshot / retention contract (manual gate)`\n- `16) Final global sign-off`\n\nWhen opening the release PR, use [`.github/pull_request_template.md`](.github/pull_request_template.md)\nto keep impact and validation context explicit.\n\n### Future Hardening Backlog (non-blocking)\n\n- Add fuzz coverage for snapshot query combinations (`--regex`, `--pattern`, `--prefix`) to further harden parser+matcher edge cases.\n- This is a future hardening task and is not part of the current release gate.\n\n## Doctor (recommended health gate)\n\ncoldkeep doctor is the operator health gate:\n\n- runs recovery first (corrective)\n- then schema/version sanity checks\n- then verification (standard by default; full/deep optional)\n\nDoctor is intentionally corrective, not read-only.\n\n```bash\ncoldkeep doctor\ncoldkeep doctor --full\ncoldkeep doctor --deep --output json\n```\n\n### Recovery strict mode\n\nStartup recovery runs in strict mode by default.\n\n```bash\nCOLDKEEP_STRICT_RECOVERY=true   # default; fail-closed on ambiguous state\nCOLDKEEP_STRICT_RECOVERY=false  # operator escape hatch for recovery investigation\n```\n\nStrict mode is the recommended production setting. Disabling it should be treated as a temporary operator escape hatch only, not a normal operating mode.\n\n## Verification\n\nVerification levels:\n\n- standard: metadata integrity\n- full: structural/container integrity\n- deep: full content read + hash validation\n\n```bash\ncoldkeep verify system --standard\ncoldkeep verify system --full\ncoldkeep verify system --deep\n```\n\nVerification checks are observational. In CLI flows, startup recovery may run before verification.\n\n## Documentation Map\n\n- Architecture and internals: [ARCHITECTURE.md](ARCHITECTURE.md)\n- Guarantee mapping and evidence: [VALIDATION_MATRIX.md](VALIDATION_MATRIX.md)\n- Contribution workflow: [CONTRIBUTING.md](CONTRIBUTING.md)\n- Release readiness flow: [PRE_RELEASE_CHECKLIST.md](PRE_RELEASE_CHECKLIST.md)\n- Security reporting and threat guidance: [SECURITY.md](SECURITY.md)\n- Current-state path identity policy: [docs/PATH_IDENTITY.md](docs/PATH_IDENTITY.md)\n- Benchmark infrastructure and baseline policy: [docs/benchmarking.md](docs/benchmarking.md)\n- Frozen v1.9 benchmark baseline contract: [docs/internal/benchmark_baselines_v1_9.md](docs/internal/benchmark_baselines_v1_9.md)\n- Milestone history: [CHANGELOG.md](CHANGELOG.md)\n\n## Roadmap note (post-v1.9)\n\nCurrent status:\n\n- v1.2 physical mapping/repair and audited GC root gates are complete.\n- v1.3/v1.4 snapshot-retention correctness and lifecycle clarity are complete.\n- v1.5 chunker-evolution compatibility contract is complete.\n- v1.6 read-only observability and exact GC simulation tooling are complete.\n- v1.7 controlled-execution performance validation and release-readiness hardening are complete.\n- v1.8 packed block abstraction, AES-GCM packed-block integration, and release hardening are complete.\n- v1.9 transform-based storage semantics, block-level compression, and staged verification are complete.\n\nNext focus is v1.10: architecture extraction on top of frozen v1.9 storage semantics.\n\n## Contributing\n\nContributions and discussions are welcome.\nSee [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\nApache-2.0. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffranchoy%2Fcoldkeep","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffranchoy%2Fcoldkeep","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffranchoy%2Fcoldkeep/lists"}