{"id":44647653,"url":"https://github.com/yaronkoresh/pakem","last_synced_at":"2026-04-07T18:01:02.459Z","repository":{"id":338485108,"uuid":"1158036089","full_name":"YaronKoresh/pakem","owner":"YaronKoresh","description":"pakem is a repository packaging system designed to convert source trees into portable artifacts for analysis, indexing, sharing, and restoration workflows.","archived":false,"fork":false,"pushed_at":"2026-03-18T13:10:07.000Z","size":220,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-18T15:12:09.811Z","etag":null,"topics":["cli","code-analysis","codebase","context-window","devtools","python","token-counter","xml"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YaronKoresh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-14T17:38:40.000Z","updated_at":"2026-03-18T13:10:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"f38b02e0-d7b1-4e6c-8ba3-5a53fdaca637","html_url":"https://github.com/YaronKoresh/pakem","commit_stats":null,"previous_names":["yaronkoresh/pakem"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/YaronKoresh/pakem","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaronKoresh%2Fpakem","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaronKoresh%2Fpakem/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaronKoresh%2Fpakem/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaronKoresh%2Fpakem/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YaronKoresh","download_url":"https://codeload.github.com/YaronKoresh/pakem/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaronKoresh%2Fpakem/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31522574,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T16:28:08.000Z","status":"ssl_error","status_checked_at":"2026-04-07T16:28:06.951Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","code-analysis","codebase","context-window","devtools","python","token-counter","xml"],"created_at":"2026-02-14T20:16:03.202Z","updated_at":"2026-04-07T18:01:02.450Z","avatar_url":"https://github.com/YaronKoresh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pakem\n\n`pakem` is a repository packaging system designed to convert source trees into portable artifacts for analysis, indexing, sharing, and restoration workflows.\n\nIt supports document-oriented outputs (`xml`, `json`, `proto`) and a binary archive format (`pakem`) with optional compression, reversible encryption, split output, incremental state tracking, and delta reporting.\n\n---\n\n## Table of Contents\n\n1. [Mission and Scope](#mission-and-scope)\n2. [Capability Matrix](#capability-matrix)\n3. [System Architecture](#system-architecture)\n4. [Data Flow and Control Flow](#data-flow-and-control-flow)\n5. [Installation and Environment](#installation-and-environment)\n6. [CLI Reference](#cli-reference)\n7. [Output Formats and File Specifications](#output-formats-and-file-specifications)\n8. [State, Delta, and Restore Semantics](#state-delta-and-restore-semantics)\n9. [Security and Trust Model](#security-and-trust-model)\n10. [Performance and Scalability Tuning](#performance-and-scalability-tuning)\n11. [Validation and Quality Gates](#validation-and-quality-gates)\n12. [Python API Usage](#python-api-usage)\n13. [Operational Playbooks](#operational-playbooks)\n14. [Troubleshooting](#troubleshooting)\n15. [Module-by-Module Internals](#module-by-module-internals)\n16. [Contributing and Release Workflow](#contributing-and-release-workflow)\n17. [Glossary](#glossary)\n18. [License](#license)\n\n---\n\n## Mission and Scope\n\n`pakem` exists to solve one problem well:\n\n- Scan a repository deterministically.\n- Analyze textual source files.\n- Serialize normalized metadata and content.\n- Optionally track historical state for delta-oriented operations.\n- Optionally emit a binary artifact that can be restored later.\n\n### Primary Use Cases\n\n| Use Case | Input | Output | Typical Consumer |\n|---|---|---|---|\n| LLM context packaging | Source repository | XML/JSON/Proto/LLM Prompt | Prompt pipelines, RAG preprocessors |\n| Incremental repository snapshots | Source + state file | XML/JSON/Proto/Pakem + updated state | CI and scheduled jobs |\n| Change-only artifact generation | Source + previous state + `--delta` | Delta subset + diff manifest | Review and sync automation |\n| Archive and restore workflow | Source | `.pakem` (single or split) | Backup, transfer, migration |\n| Archive-to-archive change analysis | Two artifacts | Added/modified/removed report | Regression triage and release validation |\n| Cloud artifact transport | Local source + cloud URI (`s3://`, `gs://`, `az://`) | Artifact read/write over object storage | Remote backup and pipeline automation |\n| Archive inspection and reporting | Existing artifact(s) | TUI/plain explorer and HTML diff report | Release engineering and audits |\n| Ecosystem loaders | Existing artifact(s) | LangChain documents / LlamaIndex nodes | RAG and indexing services |\n| Ignore-rule diagnostics | Source + ignore patterns/files | Printed ignored list | Build engineering and DevEx |\n\n---\n\n## Capability Matrix\n\n| Capability | xml | json | proto | pakem | llm-prompt |\n|---|---:|---:|---:|---:|---:|\n| Full repository metadata | Yes | Yes | Yes | Yes |\n| File line-level content | Yes | Yes | Yes | Packed payload | Structured prompt blocks |\n| Incremental state tracking (`--state`) | Yes | Yes | Yes | Yes |\n| Delta mode (`--delta`) | Yes | Yes | Yes | Yes |\n| Diff manifest output (`diff --diff-out`) | Yes | Yes | Yes | Yes |\n| HTML diff report (`--html-diff-out` / `archive-diff --html-out`) | Yes | Yes | Yes | Yes |\n| Optional compression (`--compress zlib/zstd/lz4`) | No | No | No | Yes |\n| Optional reversible encryption (`--encrypt-key`) | No | No | No | Yes |\n| Optional split output (`--split-size`) | No | No | No | Yes |\n| Chunk-level dedup (`--dedup-chunks`) | No | No | No | Yes |\n| Restore support (`restore`) | No | No | No | Yes | No |\n| Semantic chunking (`--semantic-chunking`) | Yes | Yes | Yes | Yes | Yes |\n| Git tracked-files mode (`--tracked-files`) | Yes | Yes | Yes | Yes | Yes |\n| Distributed shard filtering (`--distributed-shards` + `--distributed-index`) | Yes | Yes | Yes | Yes | Yes |\n| Analysis cache mode (`--cache-mode`) | Yes | Yes | Yes | Yes | Yes |\n| Cloud URI output/input (`s3://`, `gs://`, `az://`) | Yes | Yes | Yes | Yes | Yes |\n\n---\n\n## System Architecture\n\n### High-Level Module Topology\n\n```mermaid\nflowchart TB\n    CLI[cli.py] --\u003e CMD[commands.py]\n    CMD --\u003e PACKER[packer.py]\n    PACKER --\u003e FS[fs.py]\n    PACKER --\u003e ANALYZE[analyze.py]\n    PACKER --\u003e STATE[state.py]\n    PACKER --\u003e SERIALIZE[serialize.py]\n    PACKER --\u003e VALIDATE[validation.py]\n    ANALYZE --\u003e TOKEN[tokenizer.py]\n    FS --\u003e IGNORE[ignore.py]\n```\n\n### Command Dispatch Model\n\n```mermaid\nflowchart LR\n    A[argv] --\u003e B{normalize argv}\n    B --\u003e|no subcommand| C[prepend pack]\n    B --\u003e|has subcommand| D[keep argv]\n    C --\u003e E[argparse subparsers]\n    D --\u003e E[argparse subparsers]\n    E --\u003e F{command}\n    F --\u003e|pack| G[PackCommand.execute]\n    F --\u003e|diff| H[DiffCommand.execute]\n    F --\u003e|restore| I[RestoreCommand.execute]\n    F --\u003e|archive-diff| J[ArchiveDiffCommand.execute]\n    F --\u003e|explore| K[ExploreCommand.execute]\n    F --\u003e|setup-precommit| L[SetupPrecommitCommand.execute]\n```\n\n### Packing Execution Pipeline\n\n```mermaid\nflowchart TD\n    START[RepoPacker.pack] --\u003e SR[start_repository]\n    SR --\u003e WALK[walk file tree]\n    WALK --\u003e FILTER[ignore + binary filter]\n    FILTER --\u003e ANALYSIS[parallel analyze_entry]\n    ANALYSIS --\u003e STATEUPD[update current state]\n    STATEUPD --\u003e SERIAL[serializer.add_file]\n    SERIAL --\u003e PAYLOAD[optional pakem payload transforms]\n    PAYLOAD --\u003e ENDREP[end_repository]\n    ENDREP --\u003e TOTALS[update totals]\n    TOTALS --\u003e WRITE[serializer.write_to]\n    WRITE --\u003e SAVE[state save if configured]\n    SAVE --\u003e DONE[exit code 0]\n```\n\n---\n\n## Data Flow and Control Flow\n\n### Pack Command Data Contract\n\n| Stage | Input | Output | Invariants |\n|---|---|---|---|\n| Argument resolution | CLI options | `Namespace` | Subcommand is one of `pack`, `diff`, `restore`, `archive-diff`, `explore`, `setup-precommit` |\n| Output path resolution | `--out`, `--format` | concrete file path or cloud URI | If `--out` has no suffix, suffix is inferred by format |\n| File walk | root + ignore rules | `FileEntry` stream | Relative paths normalized to `/` |\n| Text analysis | file path | `FileMetadata` + content lines | Binary files are skipped |\n| State update | previous state + file hash | current state | Each processed text file gets deterministic state entry |\n| Serialization | metadata + lines | format artifact | Repository totals updated at end |\n\n### Restore Command Data Contract\n\n| Stage | Input | Output | Invariants |\n|---|---|---|---|\n| Artifact read | `.pakem` or split parts | bytes | Header must start with `PAKM` |\n| Header parse | artifact bytes | metadata JSON + payload bytes | Version byte must be supported by negotiation metadata |\n| Chunk reconstruction | payload stream + per-file lengths | file bytes | Transform reversal is inverse of transform order |\n| Write stage | `target_dir` + relative path | restored files | Path traversal outside target is rejected |\n\n---\n\n## Installation and Environment\n\n### Minimal Installation\n\n```bash\npip install pakem\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/YaronKoresh/pakem.git\ncd pakem\npip install -e \".[dev]\"\n```\n\n### Optional Extras\n\nInstall optional extras when needed:\n\n```bash\npip install -e \".[extra]\"\n```\n\nExtras currently include:\n\n| Package | Enables |\n|---|---|\n| `pathspec` | Advanced gitignore-style pattern matching |\n| `tiktoken` | Model-aware token counting |\n| `protobuf` | Proto serializer support |\n| `dulwich` | Git-native tracked-files and metadata enrichment |\n| `zstandard` | zstd compression profile |\n| `lz4` | lz4 compression profile |\n| `fsspec` + cloud FS plugins (`s3fs`, `gcsfs`, `adlfs`) | Cloud URI read/write for artifacts and reports |\n| `langchain` | LangChain loader adapter |\n| `llama-index` | LlamaIndex reader adapter |\n\n### Runtime Requirements\n\n| Requirement | Value |\n|---|---|\n| Python | `\u003e=3.10` |\n| Project version | `2.0.0` |\n| Entry point | `pakem = pakem.cli:main` |\n\n---\n\n## CLI Reference\n\n## Global Invocation Forms\n\n```bash\npakem \u003csubcommand\u003e [options]\npython -m pakem \u003csubcommand\u003e [options]\n```\n\nIf no subcommand is supplied, `pack` is implicitly used.\n\n## `pack` Command\n\n```bash\npakem pack [--path PATH] [--out OUT] [--format {xml,json,proto,pakem,llm-prompt}]\n```\n\n### `pack` Options Table\n\n| Option | Type | Default | Description |\n|---|---|---|---|\n| `--path` | string | `.` | Root directory to process |\n| `--out` | string | `repo` | Output path or base name |\n| `--ignore` | list[string] | none | Additional ignore patterns |\n| `--include` | list[string] | none | Allowlist patterns; only matching paths are considered |\n| `--tracked-files` | flag | `false` | Only include files tracked in git index |\n| `--git-metadata` | flag | `false` | Enrich file metadata with commit hash/author/date |\n| `--semantic-chunking` | flag | `false` | Preserve class/function boundaries when rendering file content |\n| `--summary-mode` | enum | `off` | Optional low-priority summarization mode |\n| `--plugin` | list[string] | none | Optional plugin module paths loaded before execution |\n| `--cache-mode` | enum | `off` | Analysis cache mode (off/local/memory) |\n| `--dedup-chunks` | flag | `false` | Enable chunk-level deduplication for pakem payloads |\n| `--distributed-shards` | int | none | Total number of shards for distributed packing |\n| `--distributed-index` | int | none | Zero-based shard index for this run |\n| `--ignore-file` | string | none | Path to extra ignore file |\n| `--state` | string | none | JSON state file path |\n| `--delta` | flag | `false` | Include only changed files |\n| `--max-file-size` | size | none | Skip files larger than this threshold (`512KB`, `10MB`, `1GB`) |\n| `--max-total-tokens` | int | none | Cap packaged token total across selected files |\n| `--dry-run` | flag | `false` | Analyze and report without writing package/state/report files |\n| `--focus-ranking` | enum | `basic` | Ranking strategy used when token budget is constrained |\n| `--list-ignored` | flag | `false` | Print ignored entries and exit |\n| `--model` | string | none | Tokenization model hint |\n| `--workers` | int | auto | Analysis worker count (positive integer) |\n| `--format` | enum | `xml` | Output format |\n| `--emit-schema` | string | none | Schema output path |\n| `--schema-format` | enum | `xml` | Schema format |\n| `--compress` | enum | `none` | pakem payload compression |\n| `--encrypt-key` | string | none | pakem reversible key |\n| `--cipher` | enum | `aes-gcm` | Encryption profile for pakem payload encryption |\n| `--sign-key` | string | none | Optional provenance signature key (`hmac-sha256`) |\n| `--split-size` | size | none | pakem split threshold (`1MB`, `512KB`, `2GB`) |\n| `--sensitive-data-policy` | enum | `off` | Sensitive data handling mode (off/warn/redact/block) |\n| `--secret-scanner` | enum | `builtin` | Secret scanner integration mode (builtin/gitleaks/trufflehog/auto/off) |\n| `--sensitive-report-out` | string | none | Optional JSON report output for sensitive-data findings |\n| `--selection-report-out` | string | none | Optional JSON report with selected and skipped paths |\n\n### `pack` Examples\n\n```bash\n# Default pack in current directory (implicit .xml suffix)\npakem pack\n\n# Explicit format with auto extension\npakem pack --path ./repo --format json --out snapshot\n\n# Delta pack using state file\npakem pack --path ./repo --state .pakem-state.json --delta --out delta-report\n\n# Focused pack with include allowlist and token budget\npakem pack --path ./repo --include \"src/**\" --max-total-tokens 20000 --focus-ranking basic --out focused\n\n# Analysis-only pass with no artifact writes\npakem pack --path ./repo --dry-run --max-file-size 1048576\n\n# pakem archive with payload transforms and splitting\npakem pack --path ./repo --format pakem --compress zlib --encrypt-key key123 --split-size 1048576 --out archive\n\n# LLM prompt profile with semantic chunking\npakem pack --path ./repo --format llm-prompt --semantic-chunking --summary-mode basic --out prompt\n```\n\n## `diff` Command\n\n```bash\npakem diff --state STATE [--path PATH] [--out OUT] [--diff-out FILE]\n```\n\n### `diff` Options Table\n\n| Option | Type | Required | Description |\n|---|---|---:|---|\n| `--state` | string | Yes | Existing state file to compare against |\n| `--path` | string | No | Root directory (default `.`) |\n| `--out` | string | No | Artifact output base/path |\n| `--diff-out` | string | No | JSON diff manifest output path |\n| `--html-diff-out` | string | No | HTML diff report output path |\n| `--ignore` | list[string] | No | Additional ignore patterns |\n| `--include` | list[string] | No | Allowlist patterns; only matching paths are considered |\n| `--tracked-files` | flag | No | Only include files tracked in git index |\n| `--git-metadata` | flag | No | Enrich file metadata with commit hash/author/date |\n| `--semantic-chunking` | flag | No | Preserve class/function boundaries when rendering file content |\n| `--summary-mode` | enum | No | Optional low-priority summarization mode |\n| `--plugin` | list[string] | No | Optional plugin module paths loaded before execution |\n| `--cache-mode` | enum | No | Analysis cache mode (off/local/memory) |\n| `--dedup-chunks` | flag | No | Enable chunk-level deduplication for pakem payloads |\n| `--distributed-shards` | int | No | Total number of shards for distributed packing |\n| `--distributed-index` | int | No | Zero-based shard index for this run |\n| `--ignore-file` | string | No | Extra ignore file |\n| `--format` | enum | No | Output format |\n| `--max-file-size` | size | No | Skip files larger than this threshold (`512KB`, `10MB`, `1GB`) |\n| `--max-total-tokens` | int | No | Cap packaged token total across selected files |\n| `--dry-run` | flag | No | Analyze without writing package or diff output files |\n| `--focus-ranking` | enum | No | Ranking strategy used when token budget is constrained |\n| `--selection-report-out` | string | No | Optional JSON report with selected and skipped paths |\n| `--secret-scanner` | enum | No | Secret scanner integration mode (builtin/gitleaks/trufflehog/auto/off) |\n\nConstraint semantics:\n\n- `--max-file-size` and `--max-total-tokens` define the selected scope.\n- Selected scope is used consistently for artifact content, persisted state entries, and `diff` output.\n- Runtime stats include skip counters for max-file-size and token-budget exclusions.\n- `--selection-report-out` emits selected paths and skip reasons for automation and audits.\n- Size arguments accept case-insensitive suffixes: `B`, `KB`, `MB`, `GB`, `TB`.\n\nState backend semantics:\n\n- File backend (default): pass a normal path to `--state`.\n- Memory backend: `--state memory://\u003ckey\u003e`.\n- SQLite backend: `--state sqlite:///path/to/state.db?key=\u003cstate-key\u003e`.\n\nArchive negotiation and provenance:\n\n- pakem archives include `min_reader_version` and `max_reader_version` metadata.\n- Optional signatures use `--sign-key` during pack and `--verify-signature-key` during restore.\n- Restore rejects signature mismatches and incompatible negotiation ranges.\n\nPack option compatibility:\n\n- `--compress`, `--encrypt-key`, and `--split-size` are valid only with `--format pakem`.\n- `--cipher` customization is valid only with `--format pakem`.\n- `--cipher none` cannot be combined with `--encrypt-key`.\n\n### `diff` Output Schema\n\nIf `--diff-out` is provided, JSON shape is:\n\n```json\n{\n  \"added\": [\"...\"],\n  \"modified\": [\"...\"],\n  \"removed\": [\"...\"]\n}\n```\n\n## `restore` Command\n\n```bash\npakem restore --in ARCHIVE --target TARGET [--format pakem] [--compress {none,zlib,zstd,lz4}] [--encrypt-key KEY]\n```\n\n### `restore` Notes\n\n- Supports `.pakem` artifacts and split sequences (`.pakem.part001`, `.part002`, ...).\n- Uses metadata `payload_length` values to reconstruct file payload boundaries.\n- Rejects writes that resolve outside `--target`.\n\n## `archive-diff` Command\n\n```bash\npakem archive-diff --left OLD --right NEW [--left-format FMT] [--right-format FMT] [--out OUT] [--html-out REPORT.html]\n```\n\nProduces deterministic added/modified/removed results without scanning a live repository.\n\n## `explore` Command\n\n```bash\npakem explore --in ARCHIVE [--tui]\n```\n\nInspects archive entries either with plain terminal output or a curses-based TUI (`--tui`).\n\n## `setup-precommit` Command\n\n```bash\npakem setup-precommit [--path PATH] [--force]\n```\n\nGenerates `.pre-commit-config.yaml` with `ruff`, `ruff-format`, and a local `poe check` hook.\n\n---\n\n## Output Formats and File Specifications\n\n## Extension Auto-Selection\n\nWhen `--out` has no suffix:\n\n| Format | Applied Suffix |\n|---|---|\n| `xml` | `.xml` |\n| `json` | `.json` |\n| `proto` | `.pb` |\n| `pakem` | `.pakem` |\n| `llm-prompt` | `.prompt.md` |\n\nIf `--out` already has a suffix, it is preserved.\n\n## XML and JSON\n\nBoth represent repository metadata, directory records, and file records. XML uses nested elements, JSON uses structured objects.\n\n## Protobuf\n\nProtobuf uses dynamic message generation from `pakem.proto` descriptor construction at runtime.\n\n## pakem Binary Container\n\n### Binary Layout\n\n| Segment | Size | Description |\n|---|---:|---|\n| Magic | 4 bytes | ASCII `PAKM` |\n| Version | 1 byte | Current value: `2` |\n| Header length | 4 bytes (big-endian) | Byte length of metadata JSON |\n| Metadata | variable | UTF-8 JSON with repository/file descriptors |\n| Payload | variable | Concatenated file payload chunks |\n\n### Metadata Core Keys\n\n| Key | Type | Description |\n|---|---|---|\n| `repository` | object | Root metadata, totals, timestamp |\n| `directories` | array | Optional directory entries |\n| `files` | array | File descriptors including `payload_length` |\n| `payload_size` | int | Total payload bytes |\n\n### Split Output Behavior\n\nIf final blob size exceeds `--split-size`, writer emits:\n\n- `name.pakem.part001`\n- `name.pakem.part002`\n- ...\n\nNo root `.pakem` file is emitted in split mode.\n\n---\n\n## State, Delta, and Restore Semantics\n\n## State Model\n\nState file stores:\n\n| Field | Type | Description |\n|---|---|---|\n| `version` | int | Schema version, current default is `2` |\n| `files` | object map | `rel_path -\u003e {mtime, size, sha256}` |\n\nLegacy state without `version` loads as version `1`.\n\n## Delta Computation\n\n`RepoState.diff_paths(new_state)` returns sorted path lists for:\n\n- `added`\n- `modified`\n- `removed`\n\nIn delta mode:\n\n- unchanged files are not serialized into output payload\n- serializer repository metadata can include a `delta` block\n\n## Restore Semantics\n\nRestore requires `pakem` format and follows this inversion logic:\n\n1. Parse header and metadata.\n2. Slice payload by `payload_length` per file.\n3. Reverse encryption (if key supplied).\n4. Reverse compression (`zlib`) if enabled.\n5. Validate target path safety.\n6. Write file bytes.\n\n---\n\n## Security and Trust Model\n\n## Current Security Controls\n\n| Control | Status | Description |\n|---|---|---|\n| Path traversal prevention | Enabled | Restore checks target path boundaries via realpath/commonpath logic |\n| Format sanity checks | Enabled | `validate_pakem` checks magic/version/header consistency |\n| Binary detection | Enabled | Binary files skipped during source pack stage |\n\n## Important Cryptography Note\n\nDefault encryption profiles are authenticated encryption (`aes-gcm` and `chacha20-poly1305`) with metadata-bound authentication.\n\nLegacy reversible xor mode is available only under explicit legacy mode for restoration compatibility paths.\n\n---\n\n## Performance and Scalability Tuning\n\n## Worker Strategy\n\nBy default, worker count is derived from CPU count (`cpu_count * 4` minimum 1).\n\nGuidance:\n\n| Repository profile | Suggested `--workers` |\n|---|---:|\n| Small (\u003c5k files) | auto |\n| Medium (5k-50k files) | 8-16 |\n| Large monorepo | 16-32 (validate host IO limits) |\n\n## Throughput Considerations\n\n| Factor | Impact |\n|---|---|\n| Binary file prevalence | More binaries means faster total run due to skip behavior |\n| Tokenizer backend | `tiktoken` can improve model parity; regex fallback avoids dependency |\n| State availability | Existing state can reduce expensive hash recompute paths |\n| Analysis cache mode | `local` and `memory` cache reduce repeated analysis work |\n| mmap hashing | Large-file hashing uses mmap when available for lower memory churn |\n| Pakem transforms | Compression and encryption increase CPU load |\n\n## Flowchart: Performance Path\n\n```mermaid\nflowchart TD\n    A[File discovered] --\u003e B{binary?}\n    B --\u003e|yes| C[skip]\n    B --\u003e|no| D[analyze + hash]\n    D --\u003e E{delta + unchanged?}\n    E --\u003e|yes| F[exclude from artifact]\n    E --\u003e|no| G[serialize]\n    G --\u003e H{format pakem?}\n    H --\u003e|no| I[write structured output]\n    H --\u003e|yes| J[compress/encrypt/split]\n```\n\n---\n\n## Validation and Quality Gates\n\n## Built-in Validators\n\n| Validator | Target |\n|---|---|\n| `validate_xml(path)` | XML artifacts |\n| `validate_json(path)` | JSON artifacts |\n| `validate_proto(path)` | Proto artifacts |\n| `validate_pakem(path)` | pakem binary artifacts |\n| `validate(path, format=None)` | Format inference + dispatch |\n\n## Project Quality Commands\n\n| Goal | Command |\n|---|---|\n| Run tests | `pytest -q` |\n| Run linter | `ruff check .` |\n| Compile check | `python -m compileall -q .` |\n| All checks (poe) | `poe check` |\n\n---\n\n## Python API Usage\n\n## Basic Pack\n\n```python\nfrom pakem import IgnoreRules, RepoPacker\nfrom pakem.fs import FileWalker\n\nroot = \"/path/to/repo\"\nout = \"repo.xml\"\n\nrules = IgnoreRules.from_defaults(root, extra_patterns=[\"*.tmp\"])\nwalker = FileWalker(root, rules, output_path=out)\n\npacker = RepoPacker(\n    root_dir=root,\n    output_file=out,\n    ignore_rules=rules,\n    walker=walker,\n    output_format=\"xml\",\n)\n\nexit_code = packer.pack()\nprint(exit_code)\n```\n\n## Advanced Pack (pakem)\n\n```python\nfrom pakem import IgnoreRules, RepoPacker\nfrom pakem.fs import FileWalker\n\nroot = \"/path/to/repo\"\nout = \"archive.pakem\"\n\nrules = IgnoreRules.from_defaults(root)\nwalker = FileWalker(root, rules, output_path=out)\n\npacker = RepoPacker(\n    root_dir=root,\n    output_file=out,\n    ignore_rules=rules,\n    walker=walker,\n    state_path=\".pakem-state.json\",\n    delta=True,\n    output_format=\"pakem\",\n    compression=\"zlib\",\n    encryption_key=\"demo-key\",\n    split_size=2_000_000,\n)\n\npacker.pack()\n```\n\n## Restore via API\n\n```python\nfrom pakem import IgnoreRules, RepoPacker\nfrom pakem.fs import FileWalker\n\ntarget = \"./restored\"\nrules = IgnoreRules.from_defaults(target)\n\npacker = RepoPacker(\n    root_dir=target,\n    output_file=\"archive.pakem\",\n    ignore_rules=rules,\n    walker=FileWalker(target, rules),\n    output_format=\"pakem\",\n    compression=\"zlib\",\n    encryption_key=\"demo-key\",\n)\n\npacker.restore(\"archive.pakem\", target)\n```\n\n---\n\n## Operational Playbooks\n\n## Playbook A: Daily Incremental Snapshot\n\n```mermaid\nflowchart LR\n    A[Load previous state] --\u003e B[Run pack --delta]\n    B --\u003e C[Publish artifact]\n    C --\u003e D[Store new state]\n    D --\u003e E[Run validate]\n```\n\nSteps:\n\n1. Keep a persistent state file per repository.\n2. Run `pakem pack --state state.json --delta`.\n3. Store resulting artifact and updated state file together.\n4. Optionally run `pakem diff --state state.json --diff-out diff.json` for change reports.\n\n## Playbook B: Transfer as Split Binary\n\n1. `pakem pack --format pakem --split-size 1048576 --out archive`\n2. Transfer all `.partNNN` files.\n3. Restore with:\n   `pakem restore --in archive.pakem --target ./target`\n\n## Playbook C: Ignore Rules Debug Session\n\n1. Add new patterns via `--ignore` or `--ignore-file`.\n2. Execute `pakem pack --list-ignored --path ...`.\n3. Validate expected paths appear in output list.\n\n---\n\n## Troubleshooting\n\n| Symptom | Likely Cause | Corrective Action |\n|---|---|---|\n| `restore` returns `1` with no files | Wrong format or invalid magic/header | Verify archive starts with `PAKM`, run `validate_pakem` |\n| Missing expected files in output | Ignore rules filtering them | Use `--list-ignored` and adjust patterns |\n| Output extension not what you expected | `--out` had explicit suffix | Remove suffix from `--out` to use auto-extension |\n| Token counts seem generic | `tiktoken` not installed or model unsupported | Install extras and pass `--model` |\n| Delta output too large | State file missing or stale | Keep state persisted and scoped per repository |\n| Split archive not restoring | Part files missing/out of order | Ensure contiguous `.partNNN` files are present |\n\n### Diagnostic Commands\n\n```bash\n# Check lints and imports\nruff check .\n\n# Verify runtime behavior\npytest -q\n\n# Validate artifact (Python snippet)\npython -c \"from pakem.validation import validate; validate('archive.pakem', 'pakem')\"\n```\n\n---\n\n## Module-by-Module Internals\n\n| Module | Responsibility | Key Types/Functions |\n|---|---|---|\n| `pakem/cli.py` | Argument parsing and subcommand routing | `main`, `resolve_output_path` |\n| `pakem/commands.py` | Command execution adapters | `PackCommand`, `DiffCommand`, `RestoreCommand`, `ArchiveDiffCommand`, `ExploreCommand`, `SetupPrecommitCommand` |\n| `pakem/packer.py` | Core orchestration pipeline | `RepoPacker.pack`, `RepoPacker.diff`, `RepoPacker.restore` |\n| `pakem/fs.py` | Deterministic file traversal | `FileWalker`, `FileEntry` |\n| `pakem/ignore.py` | Ignore pattern loading and matching | `IgnoreRules` |\n| `pakem/analyze.py` | Metadata extraction and token/line counting | `FileMetadata`, `analyze_text` |\n| `pakem/tokenizer.py` | Token counting backends | `RegexTokenCounter`, `TiktokenTokenCounter` |\n| `pakem/state.py` | Incremental file state and diffs | `RepoState`, `FileState`, `diff_paths` |\n| `pakem/serialize.py` | XML/JSON/Proto/Pakem serializers | `XmlSerializer`, `JsonSerializer`, `ProtoSerializer`, `PakemSerializer` |\n| `pakem/cloud_io.py` | Local/cloud read-write abstraction | `read_bytes`, `write_text`, `write_bytes` |\n| `pakem/cache.py` | Analysis cache backends | `AnalysisCache`, `create_cache` |\n| `pakem/plugins.py` | Runtime plugin loading | `load_plugins`, `register_analyzer` |\n| `pakem/reports.py` | HTML reporting | `render_html_diff_report` |\n| `pakem/loaders.py` | Ecosystem data loaders | `PakemLangChainLoader`, `PakemLlamaIndexReader` |\n| `pakem/tui.py` | Archive exploration UI | `explore_archive` |\n| `pakem/validation.py` | Artifact validation and path safety | `validate`, `validate_pakem`, `is_path_safe` |\n| `pakem/proto.py` | Dynamic protobuf schema descriptor | `get_repository_message_class` |\n\n---\n\n## Contributing and Release Workflow\n\n## Development Workflow\n\n```bash\npip install -e \".[dev,extra]\"\nruff check .\npytest -q\n```\n\n## Suggested Pull Request Checklist\n\n| Check | Status |\n|---|---|\n| New behavior covered by tests | Required |\n| Lint passes (`ruff check .`) | Required |\n| README updated for CLI/API changes | Required |\n| Backward compatibility considered | Recommended |\n| State/format migrations documented | Recommended |\n\n## Packaging Tasks\n\n| Task | Command |\n|---|---|\n| Build source + wheel | `poe build` |\n| Build wheel only | `poe build-wheel` |\n| Install pre-commit hooks | `poe hook` |\n\n---\n\n## Glossary\n\n| Term | Meaning |\n|---|---|\n| Artifact | Final output file(s) produced by a run |\n| Delta mode | Serialization of only changed files relative to previous state |\n| State file | JSON map of file path to mtime/size/hash used for incremental processing |\n| Payload length | Byte length of one file's serialized bytes inside pakem payload stream |\n| Split archive | Multi-part output generated when blob exceeds `--split-size` |\n\n---\n\n## License\n\nThis project is licensed under the GNU General Public License v3.0 or later.\n\nSee [LICENSE](LICENSE) for full terms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyaronkoresh%2Fpakem","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyaronkoresh%2Fpakem","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyaronkoresh%2Fpakem/lists"}