{"id":51157348,"url":"https://github.com/ngpepin/lshash","last_synced_at":"2026-06-26T11:30:23.266Z","repository":{"id":354299683,"uuid":"1222909350","full_name":"ngpepin/lshash","owner":"ngpepin","description":"A corpus-hygiene utility for RAG data pipelines that identifies duplicate content risk, quantifies duplication with actionable statistics, and supports controlled remediation before indexing. It enables staged audit-then-cull workflows that improve retrieval quality, reduce embedding/indexing cost, and strengthen governance in knowledge curation.","archived":false,"fork":false,"pushed_at":"2026-05-05T20:13:01.000Z","size":39484,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-05T20:24:50.954Z","etag":null,"topics":["bash","corpus-hygiene","data-curation","data-governance","data-quality","document-deduplication","dotnet","file-deduplication","knowledge-management","rag","retrieval-augmented-generation"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ngpepin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-27T20:34:46.000Z","updated_at":"2026-05-05T20:13:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ngpepin/lshash","commit_stats":null,"previous_names":["ngpepin/lshash"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ngpepin/lshash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngpepin%2Flshash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngpepin%2Flshash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngpepin%2Flshash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngpepin%2Flshash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ngpepin","download_url":"https://codeload.github.com/ngpepin/lshash/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ngpepin%2Flshash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34815669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-26T02:00:06.560Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","corpus-hygiene","data-curation","data-governance","data-quality","document-deduplication","dotnet","file-deduplication","knowledge-management","rag","retrieval-augmented-generation"],"created_at":"2026-06-26T11:30:21.135Z","updated_at":"2026-06-26T11:30:23.245Z","avatar_url":"https://github.com/ngpepin.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lshash\n\nA corpus-hygiene utility for RAG data pipelines that identifies duplicate content risk, quantifies duplication with actionable statistics, and supports controlled remediation before indexing. It enables staged audit-then-cull workflows that improve retrieval quality, reduce embedding/indexing cost, and strengthen governance in knowledge curation operations.\n\nTopic tags: rag, retrieval-augmented-generation, data-curation, data-governance, corpus-hygiene, document-deduplication, file-deduplication, knowledge-management, data-quality, bash, dotnet\n\n## Documentation map\n\n- `README.md`: quick-start reference and command/flag summary\n- `USERGUIDE.md`: step-by-step tutorial with practical workflows\n- `ARCHITECTURE.md`: internal design and implementation architecture (Bash + .NET)\n\n## Features\n\n- Sorts files alphabetically.\n- Aligns the hash column based on the longest displayed file name.\n- Supports multiple hash algorithms.\n- Defaults to BLAKE3.\n- Can recurse into subdirectories.\n- Supports built-in exclusions plus user-defined exclusion patterns.\n- Ignores `.dups/` directories by default.\n- In recursive mode, processes and prints results directory-by-directory as traversal encounters them.\n- Continues processing on per-file access errors and emits warnings instead of halting.\n- Highlights adjacent matching hashes in green.\n- Optional dedupe mode to keep one file and move duplicates into hidden `.dups/` directories.\n- Prints a completion summary with duplicate counts and percentages.\n- Supports macOS Catalina-compatible traversal behavior (no GNU `find -printf` / `sort -z` dependency).\n- Built-in exclusions include common VCS/editor/temp artifacts and `*.lshash.json` sidecars.\n\n## Upfront use-case perspective\n\nThis tool was developed as a corpus-hygiene control for RAG pipelines.\n\nIn production RAG systems, duplicate files can create duplicate chunks, increase embedding/indexing spend, and over-weight repeated content during retrieval. That can reduce answer quality and make retrieval behavior less predictable.\n\nThe intended workflow is a staged curation process:\n\n- Phase 1 (audit, no mutation): run without `-d` to profile duplication as part of pre-ingestion assessment. Use the completion statistics to quantify duplicate-file rate before chunking and embedding.\n- Phase 2 (remediation, optional): run with `-d` (and optionally `--directory` for full-directory grouping) to quarantine duplicates into `.dups/`, reducing corpus redundancy before indexing.\n- Phase 3 (post-curation validation): re-run audit and compare summary metrics to confirm that curation improved corpus quality.\n\nThis separation of discovery and action supports safer change control, clearer governance, and repeatable RAG data-preparation practice.\n\n## Script\n\n- `lshash.sh`\n\n## Implementations\n\n- Bash implementation:\n  - Script: `lshash.sh`\n  - Supports contiguous dedupe and `--directory` dedupe (with `--all-directory` as a compatibility alias)\n- .NET implementation:\n  - Project: `dotnet/`\n  - Supports the same runtime options and dedupe variants as Bash\n\n## Requirements\n\n- Bash 3.2+\n- Standard Unix tools: `find`, `sort`, `awk`, `stat`, `mv`\n- Hash command for selected algorithm:\n  - `b3sum` for `blake3`\n  - `sha256sum` for `sha256`\n  - `sha512sum` for `sha512`\n  - `sha1sum` for `sha1`\n  - `md5sum` for `md5`\n  - `b2sum` for `blake2`\n\n### macOS note\n\n- The script now runs on macOS Catalina or later shell/tooling for traversal and sorting behavior.\n- Hash command requirements still apply by algorithm choice. On macOS, `blake3` is typically the easiest path because `b3sum` can be auto-installed when package tooling is available.\n- For non-BLAKE3 algorithms on macOS, the script prefers GNU `*sum` tools when installed, but automatically falls back to native commands where possible (`shasum` for `sha256`/`sha512`/`sha1`, and `md5` for `md5`).\n- On legacy Bash (for example macOS system Bash 3.2), the script relaxes `nounset` (`set +u`) internally to avoid known empty-array expansion failures while preserving other strict-mode protections.\n\n### BLAKE3 auto-install behavior\n\nIf `blake3` is selected and `b3sum` is missing, the script attempts an automatic install using a detected package manager.\n\n- Uses non-interactive elevation (`sudo -n`) when needed.\n- Uses a timeout for install attempts.\n- Timeout defaults to 20 seconds and can be overridden:\n\n```bash\nLSHASH_INSTALL_TIMEOUT=10 ./lshash.sh\n```\n\nIf installation cannot be done automatically, the script exits with guidance.\n\n## .NET 10 implementation\n\nThis repository also includes a .NET 10 C# implementation with behavior parity to the Bash script.\n\n### Build a self-contained single-file executable\n\n```bash\ncd dotnet\n./build.sh\n```\n\nOptional runtime identifier argument:\n\n```bash\ncd dotnet\n./build.sh linux-x64\n```\n\nOutput executable:\n\n- `dotnet/dist/linux-x64/lshash`\n\nThe publish configuration is self-contained and single-file, so no .NET runtime is required on the target host.\nThe .NET build also enables invariant globalization, so `libicu` is not required on minimal Linux containers.\n\n### Build native macOS self-contained binaries\n\n```bash\ncd dotnet\n./build-macos.sh\n```\n\nBy default, `build-macos.sh` publishes `net6.0` binaries for better macOS Catalina compatibility.\n\nOptional target selection:\n\n```bash\ncd dotnet\n./build-macos.sh osx-arm64\n./build-macos.sh osx-x64\n./build-macos.sh --framework net10.0 osx-arm64\n```\n\nOutput executables:\n\n- `dotnet/dist/osx-arm64/lshash`\n- `dotnet/dist/osx-x64/lshash`\n\n### macOS deployment for .NET implementation\n\nIf you prefer a containerized execution path, use the Docker deployment bundle:\n\n```bash\ncd dotnet/deploy/macos\n./deploy.sh build\n./deploy.sh audit /path/to/scan\n./deploy.sh cull /path/to/scan\n```\n\nThe deployment wrapper is documented in `dotnet/deploy/macos/README.md`.\n\n### Run from source\n\n```bash\ncd dotnet\ndotnet run -c Release -- --help\n```\n\n### .NET options\n\nThe .NET implementation supports the same options as Bash (`--algorithm`, `-r/--recursive`, `-e/--exclude`, `-d/--dedupe`, `--directory` (alias `--all-directory`), `--global`, `--prompt-delete`, `--move-dups`, `-q/--quiet`, optional `DIRECTORY`):\n\n- `--directory` (alias: `--all-directory`)\n  - With `-d/--dedupe`, dedupe by hash across all files in each directory, ignoring filename adjacency\n  - Without `-d/--dedupe`, this flag is a no-op\n- `--global`\n  - With `-d/--dedupe` and `-r/--recursive`, dedupe by hash across the entire recursive tree\n  - With `-d/--dedupe` without `-r/--recursive`, behaves like `--directory` on the selected directory\n  - Sidecar metadata files `\u003cmoved-file\u003e.lshash.json` are created only in recursive global mode (`-r -d --global`)\n  - In dedupe mode, any directory containing `.lshash-exclude` is skipped with descendants\n  - Without `-d/--dedupe`, this flag is a no-op\n- `--prompt-delete`\n  - With `-d/--dedupe`, after listing `.dups` directories, prompts `y/N` to delete them\n  - Used alone (or with only `DIRECTORY`), recursively gathers existing `.dups` directories, lists them, and prompts `y/N` to delete them\n  - When combined with other non-dedupe options, this flag is a no-op\n- `--move-dups PATH` / `--move-dups=PATH`\n  - Standalone mode (optionally scoped by `DIRECTORY`) that recursively finds existing `.dups` directories and moves files from them under `PATH` using original relative paths\n  - Copying the resulting archive tree back onto the source tree restores duplicates (plus sidecars, when present)\n\n### .NET BLAKE3 backend selection\n\n- Default backend is CPU.\n- Override backend with environment variable `LSHASH_BLAKE3_BACKEND`:\n  - `cpu` (default)\n  - `gpu`\n- If GPU backend initialization or hashing fails at runtime, the process falls back to CPU BLAKE3 for the remainder of that run.\n- Optional GPU chunk budget override:\n  - `LSHASH_BLAKE3_GPU_MAX_CHUNKS` (positive integer)\n  - Default: `1048576` (`1 \u003c\u003c 20`)\n\n### .NET performance tuning environment variables\n\n- `LSHASH_DIAGNOSTICS=1` enables tuning diagnostics output.\n- Network filesystems (for example `cifs`, `smb3`, `nfs`) auto-enable diagnostics even without `LSHASH_DIAGNOSTICS`.\n- `LSHASH_HASH_WORKERS=\u003cN\u003e` pins a fixed worker count (disables adaptive worker tuning).\n- `LSHASH_READ_BUFFER_KB=\u003cN\u003e` sets read buffer size for sequential hashing.\n\n### .NET examples\n\n```bash\ndotnet/dist/linux-x64/lshash -q\ndotnet/dist/linux-x64/lshash -rq /path/to/scan\ndotnet/dist/linux-x64/lshash -r -d shorter -q\ndotnet/dist/linux-x64/lshash --directory                # no-op without -d\ndotnet/dist/linux-x64/lshash -d shorter --directory\ndotnet/dist/linux-x64/lshash -d shorter --global\ndotnet/dist/linux-x64/lshash -r -d shorter --global\ndotnet/dist/linux-x64/lshash -d shorter --prompt-delete\ndotnet/dist/linux-x64/lshash --prompt-delete\ndotnet/dist/linux-x64/lshash --prompt-delete /path/to/scan\ndotnet/dist/linux-x64/lshash --move-dups /path/to/archive\ndotnet/dist/linux-x64/lshash --move-dups=/path/to/archive /path/to/scan\n```\n\n## Usage\n\n```bash\n./lshash.sh [--algorithm NAME] [-r|--recursive] [-e PATTERN] [--exclude PATTERN] [-d [MODE]] [--directory] [--global] [--prompt-delete] [--move-dups PATH] [-q|--quiet] [DIRECTORY]\n```\n\n## macOS execution quick guide\n\n### Bash implementation (native, including Catalina)\n\n```bash\ncd /path/to/lshash\nchmod +x ./lshash.sh\n./lshash.sh --algorithm sha256 -r /path/to/scan\n```\n\n### .NET implementation on modern macOS (native)\n\n```bash\ncd dotnet\n./build-macos.sh\n./dist/osx-arm64/lshash --help   # Apple Silicon\n./dist/osx-x64/lshash --help     # Intel\n```\n\n### .NET implementation on macOS Catalina (Docker Desktop)\n\n```bash\ncd dotnet/deploy/macos\n./deploy.sh build\n./deploy.sh audit /path/to/scan\n./deploy.sh cull /path/to/scan\n```\n\n## Options\n\n- `--algorithm NAME`\n  - Hash algorithm: `blake3`, `sha256`, `sha512`, `sha1`, `md5`, `blake2`\n- `-r`, `--recursive`\n  - Include files in subdirectories\n  - Hidden `.dups/` directories are skipped by default\n  - Output is emitted progressively per directory encountered during traversal\n- `-e PATTERN`\n- `--exclude PATTERN`\n- `--exclude=PATTERN`\n  - Exclude files matching glob pattern (repeatable)\n  - Built-in exclusions are always active (for example `.dups` traversal skip, `.lshash-exclude`, `.git/.hg/.svn`, `.gitignore`, `.mdexplore-*.json`, `*.lshash.json`, and common temp/editor files)\n- `-d [MODE]`, `--dedupe [MODE]`, `--dedup [MODE]`\n- `-d=MODE`, `--dedupe=MODE`, `--dedup=MODE`\n  - Dedupe files with identical hash in the same directory\n  - Valid `MODE` values: `newer`, `older`, `shorter`, `longer`\n  - Default mode when omitted: `shorter`\n  - `shorter` / `longer` compare full root-relative path length (directory path + basename), not basename-only length\n- `--directory` (alias: `--all-directory`)\n  - With `-d/--dedupe`, uses full-directory hash grouping instead of contiguous-neighbor grouping\n  - Without `-d/--dedupe`, no-op\n- `--global`\n  - With `-d/--dedupe` and `-r/--recursive`, dedupes by hash across all scanned files in the recursive tree (not per-directory)\n  - With `-d/--dedupe` without `-r/--recursive`, behaves like `--directory` for the selected directory\n  - In recursive global mode (`-r -d --global`), each moved duplicate gets a sidecar metadata JSON file `\u003cmoved-file\u003e.lshash.json` in `.dups/` describing duplicate peers (full paths) and statuses (`kept`/`moved`)\n  - In dedupe mode, any directory containing `.lshash-exclude` is skipped with descendants\n  - Without `-d/--dedupe`, no-op\n- `--prompt-delete`\n  - With `-d/--dedupe`, after printing `.dups` directory paths, prompts `y/N` to delete them\n  - Used alone (or with only `DIRECTORY`), recursively gathers existing `.dups` directories, lists them, and prompts `y/N` to delete them\n  - When combined with other non-dedupe options, no-op\n- `--move-dups PATH` / `--move-dups=PATH`\n  - Standalone mode (optionally scoped by `DIRECTORY`) that recursively finds existing `.dups` directories and moves files from them under `PATH` using original relative paths\n  - Copying that archive tree back over the source tree restores duplicates (plus sidecars, when present)\n- `-q`, `--quiet`\n  - Only print duplicate lines (the lines that would be highlighted green in normal output)\n  - Works with and without dedupe, and with and without recursive mode\n- `DIRECTORY` (optional positional argument)\n  - Scan this directory instead of the current working directory\n  - Output paths remain relative to the selected directory root\n- One-letter short switches are stackable in any order (for example `-rd`, `-dr`, `-rq`, `-re '*.log'`).\n\n## Output formatting\n\n- Hash values are left-justified in a single aligned column.\n- If the previous listed file has the same hash, the current hash is shown in green.\n- When dedupe moves a file, the file name is italicized and annotated:\n  - `(moved to .dups/)`\n- Completion summary reports duplicate count and percentage of scanned files.\n- With `-r/--recursive`, summary also reports directories traversed.\n- With `-d/--dedupe`, summary wording changes to duplicates \"found and moved\".\n\n## Dedupe behavior\n\nWhen dedupe is enabled:\n\n- Primary use case: remove copy/restore/merge artifacts where duplicate files usually sort next to each other (for example names containing `(copy)`, version suffixes, or sync-conflict tags).\n- Duplicate groups are determined by contiguous same-hash blocks in alphabetical listing order within each directory.\n- Files that cannot be hashed are skipped for block matching, so they do not break a contiguous duplicate block among hashable neighbors.\n- Genuine executable program files are excluded from dedupe matching and never moved (requires execute permission plus program/script detection, for example MIME types such as `application/x-pie-executable` or `text/x-shellscript`; shebang scripts are also treated as executable programs even if MIME resolves to `text/plain`; many file managers show these as `Program`).\n- One file is kept in place based on selected mode.\n- All other duplicates in that directory are moved to that directory's `.dups/` subdirectory.\n- In recursive mode, dedupe is still per directory encountered during traversal.\n- Tie-breaking rule: first file in sorted listing order is kept.\n- If a destination name already exists in `.dups/`, a `.dupN` suffix is added.\n- `--directory` provides a more thorough filename-blind mode that checks duplicates across the full directory. It only takes effect when used with `-d/--dedupe`.\n- `--global` extends dedupe scope across the full recursive tree when combined with `-d` and `-r`, and writes provenance JSON sidecars (`\u003cmoved-file\u003e.lshash.json`) for moved files.\n\n### Dedupe scope matrix\n\n| Flags | Duplicate scope | Grouping method | Moved file destination | Sidecar metadata |\n| --- | --- | --- | --- | --- |\n| `-d` | Per directory | Contiguous same-hash runs in sorted filename order | Same directory `.dups/` | No |\n| `-d --directory` | Per directory | Full-directory hash grouping (filename adjacency ignored) | Same directory `.dups/` | No |\n| `-d --global` | Selected directory only | Full-directory hash grouping (same as `--directory`) | Same directory `.dups/` | No |\n| `-d -r --global` | Full recursive tree | Whole-tree hash grouping across directories | Each file's own source directory `.dups/` | Yes (`.lshash.json`) |\n\n### Global mode metadata (`\u003cmoved-file\u003e.lshash.json`)\n\nIn recursive `--global` mode (`-r -d --global`), every moved duplicate gets a sidecar metadata file next to it in `.dups/`:\n\n- Name: `\u003cmoved-file\u003e.lshash.json`\n- Location: same `.dups/` directory as the moved file\n- Purpose: explain the duplicate set peers and which file was kept vs moved\n\nJSON structure:\n\n```json\n{\n  \"hash\": \"\u003chash\u003e\",\n  \"dedupeMode\": \"shorter\",\n  \"subject\": {\n    \"path\": \"/abs/path/to/dir/.dups/file.ext\",\n    \"status\": \"moved\"\n  },\n  \"others\": [\n    {\n      \"path\": \"/abs/path/to/kept/file.ext\",\n      \"status\": \"kept\"\n    },\n    {\n      \"path\": \"/abs/path/to/another/dir/.dups/file2.ext\",\n      \"status\": \"moved\"\n    }\n  ]\n}\n```\n\n## Dedupe flow diagrams\n\nTechnical flow diagrams are maintained in `ARCHITECTURE.md`.\n\n### Strategy summary\n\n- Default (`-d`): optimized for copy/restore/merge artifacts where duplicate names are often alphabetically adjacent.\n- `--directory` with `-d`: more thorough and filename-blind dedupe across the entire directory.\n- `--directory` without `-d`: no-op (normal non-dedupe listing behavior).\n- `--global` with `-d -r`: cross-directory, whole-tree hash dedupe with per-moved-file metadata JSON.\n- `--global` with `-d` (no `-r`): equivalent dedupe scope to `--directory` on the selected directory and does not emit sidecar JSON.\n\n## Examples\n\n### Basic listing (default BLAKE3)\n\n```bash\n./lshash.sh\n```\n\n### Use SHA-256\n\n```bash\n./lshash.sh --algorithm sha256\n```\n\n### Recursive listing\n\n```bash\n./lshash.sh -r\n```\n\n### Exclude multiple patterns\n\n```bash\n./lshash.sh -r -e '*.log' --exclude '*.tmp' --exclude='build/*'\n```\n\n### Dedupe with default mode (`shorter`)\n\n```bash\n./lshash.sh -d\n```\n\n### Dedupe and keep newest file\n\n```bash\n./lshash.sh -r --dedupe newer\n```\n\n### Dedupe and keep longest full relative path\n\n```bash\n./lshash.sh --dedupe=longer\n```\n\n### Global dedupe in one directory (non-recursive)\n\n```bash\n./lshash.sh -d shorter --global /path/to/scan\n```\n\nThis uses full-directory hash grouping for that single directory (same scope behavior as `--directory`) and does not write sidecar metadata.\n\n### Global dedupe across full recursive tree\n\n```bash\n./lshash.sh -r -d shorter --global /path/to/scan\n```\n\nThis compares hashable files across all directories in the tree, moves losers to each file's local `.dups/`, and writes `\u003cmoved-file\u003e.lshash.json` sidecars.\n\n### Global dedupe with a different keep policy\n\n```bash\n./lshash.sh -r --dedupe newer --global /path/to/scan\n```\n\nIn each duplicate set, the newest file is kept in place and all others are moved to their source-directory `.dups/` folders.\n\n### Inspect generated sidecar metadata\n\n```bash\nfind /path/to/scan -path '*/.dups/*.lshash.json' -maxdepth 6 -print\ncat /path/to/scan/some/dir/.dups/example.txt.lshash.json\n```\n\nIf `jq` is available:\n\n```bash\njq . /path/to/scan/some/dir/.dups/example.txt.lshash.json\n```\n\n### Only show duplicate lines\n\n```bash\n./lshash.sh -q\n./lshash.sh -rq /path/to/scan\n```\n\n### Prompt-delete garbage collection mode\n\n```bash\n./lshash.sh --prompt-delete\n./lshash.sh --prompt-delete /path/to/scan\n```\n\n### Rehydrate duplicates from existing `.dups` into an archive tree\n\n```bash\n./lshash.sh --move-dups /path/to/archive\n./lshash.sh --move-dups=/path/to/archive /path/to/scan\n```\n\n### Summary message examples (hypothetical)\n\nThese examples use made-up file sets to show how the completion summary text changes by mode.\n\n#### 1. Audit pass (no `-d`): duplicates found\n\nHypothetical files in one directory:\n\n```text\na.txt         (content: same)\nb.txt         (content: same)\nc.txt         (content: different)\n```\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256\n```\n\nExpected output shape:\n\n```text\na.txt  \u003chash-A\u003e\nb.txt  \u003chash-A\u003e\nc.txt  \u003chash-C\u003e\nSummary: scanned 3 file(s); 1 duplicate file(s) were found (33.33% of scanned files).\n```\n\n#### 2. Recursive audit (`-r`, no `-d`): adds traversed directories\n\nHypothetical tree:\n\n```text\n./a.txt             (content: same)\n./b.txt             (content: same)\n./sub/c.txt         (content: unique)\n```\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256 -r\n```\n\nExpected output shape:\n\n```text\na.txt      \u003chash-A\u003e\nb.txt      \u003chash-A\u003e\nsub/c.txt  \u003chash-C\u003e\nSummary: scanned 3 file(s); 1 duplicate file(s) were found (33.33% of scanned files); 2 directories were traversed.\n```\n\n#### 3. Cull pass (`-d`): duplicates found and moved\n\nHypothetical files in one directory:\n\n```text\na.txt         (content: same)\naa.txt        (content: same)\naaa.txt       (content: same)\n```\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256 -d shorter\n```\n\nExpected output shape:\n\n```text\na.txt                         \u003chash-A\u003e\naa.txt (moved to .dups/)      \u003chash-A\u003e\naaa.txt (moved to .dups/)     \u003chash-A\u003e\nSummary: scanned 3 file(s); 2 duplicate file(s) were found and moved (66.66% of scanned files).\n```\n\nExpected result on disk:\n\n```text\n.dups/aa.txt\n.dups/aaa.txt\n```\n\n#### 4. Audit pass with no duplicates: zero percentage\n\nHypothetical files in one directory:\n\n```text\na.txt         (content: alpha)\nb.txt         (content: bravo)\nc.txt         (content: charlie)\n```\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256\n```\n\nExpected output shape:\n\n```text\na.txt  \u003chash-A\u003e\nb.txt  \u003chash-B\u003e\nc.txt  \u003chash-C\u003e\nSummary: scanned 3 file(s); 0 duplicate file(s) were found (0.00% of scanned files).\n```\n\n#### 5. Recursive cull (`-r -d`): moved count plus traversed directories\n\nHypothetical tree:\n\n```text\n./a.txt            (content: same)\n./aa.txt           (content: same)\n./sub/p.txt        (content: same)\n./sub/pp.txt       (content: same)\n```\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256 -r -d shorter\n```\n\nExpected output shape:\n\n```text\na.txt                         \u003chash-A\u003e\naa.txt (moved to .dups/)      \u003chash-A\u003e\nsub/p.txt                     \u003chash-P\u003e\nsub/pp.txt (moved to .dups/)  \u003chash-P\u003e\nSummary: scanned 4 file(s); 2 duplicate file(s) were found and moved (50.00% of scanned files); 2 directories were traversed.\n```\n\n#### 6. `--directory` without `-d`: modifier no-op\n\nHypothetical files in one directory (non-adjacent duplicate content):\n\n```text\na-copy.txt         (content: same)\nm-middle.txt       (content: unique)\nz-sync.txt         (content: same)\n```\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256 --directory\n```\n\nExpected output shape:\n\n```text\na-copy.txt  \u003chash-S\u003e\nm-middle.txt  \u003chash-M\u003e\nz-sync.txt  \u003chash-S\u003e\nSummary: scanned 3 file(s); 0 duplicate file(s) were found (0.00% of scanned files).\n```\n\n#### 7. `--directory` with `-d`: non-adjacent duplicates moved\n\nUse the same hypothetical files as example 6.\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256 -d shorter --directory\n```\n\nExpected output shape:\n\n```text\na-copy.txt                      \u003chash-S\u003e\nm-middle.txt                    \u003chash-M\u003e\nz-sync.txt (moved to .dups/)    \u003chash-S\u003e\nSummary: scanned 3 file(s); 1 duplicate file(s) were found and moved (33.33% of scanned files).\n```\n\n#### 8. Quiet mode (`-q`) still prints summary\n\nHypothetical files in one directory:\n\n```text\na.txt         (content: same)\nb.txt         (content: same)\nc.txt         (content: unique)\n```\n\nCommand:\n\n```bash\n./lshash.sh --algorithm sha256 -q\n```\n\nExpected output shape:\n\n```text\nb.txt  \u003chash-A\u003e\nSummary: scanned 3 file(s); 1 duplicate file(s) were found (33.33% of scanned files).\n```\n\n## Notes\n\n- Dedupe moves files; it does not delete them.\n- Review output carefully before running dedupe on important directories.\n\n## Troubleshooting\n\n### Default run seems slow or pauses\n\n- First run with `blake3` may try to auto-install `b3sum` if missing.\n- Use another algorithm immediately:\n\n```bash\n./lshash.sh --algorithm sha256\n```\n\n- Reduce install wait time:\n\n```bash\nLSHASH_INSTALL_TIMEOUT=5 ./lshash.sh\n```\n\n### `b3sum` not found\n\n- Install it manually, or use another algorithm.\n- Example fallback:\n\n```bash\n./lshash.sh --algorithm sha512\n```\n\n### Permission or file access errors\n\n- If a file cannot be read (for hash or metadata), the tool prints a warning and continues.\n- Output for those files shows `\u003chash unavailable\u003e`.\n- In dedupe mode, inaccessible files are ignored for contiguous block matching; hashable neighbors can still form a duplicate block across them.\n\n### Permission issues during auto-install\n\n- Auto-install uses non-interactive sudo (`sudo -n`) and will fail fast if credentials are not already available.\n- Fix by installing `b3sum` manually or run with a different algorithm.\n\n### Dedupe did not move files as expected\n\n- Dedupe only groups contiguous same-hash neighbors (in alphabetical listing order) within the same directory.\n- With `-r`, grouping is still per directory, not across the entire tree.\n- For cross-directory dedupe across the full tree, use `-r -d --global`.\n- Confirm mode selection:\n  - `newer` keeps newest\n  - `older` keeps oldest\n  - `shorter` keeps shortest full root-relative path (default)\n  - `longer` keeps longest full root-relative path\n\n### Quiet mode printed nothing\n\n- `-q/--quiet` only prints duplicate lines (green lines in normal mode).\n- If no adjacent duplicate hashes are encountered in listing order, quiet output will be empty.\n\n### Unexpected shell warnings about current directory\n\n- If your shell says it cannot access the current directory (`getcwd` warnings), your working directory may have been deleted.\n- Change into a valid directory before running again:\n\n```bash\ncd /home/npepin/Projects/lshash\n```\n\n### macOS error: `cannot make pipe for process substitution: Too many open files`\n\n- Recent versions of the script avoid high-frequency process substitutions in recursive traversal/sorting paths to prevent file-descriptor exhaustion on legacy macOS Bash.\n- If you still see this, ensure you are running the latest script revision from this repository.\n\n## FAQ\n\n### How do I run a simple hash listing in the current directory?\n\n```bash\n./lshash.sh\n```\n\n### How do I scan a different directory?\n\n```bash\n./lshash.sh /path/to/scan\n./lshash.sh -rq /path/to/scan\n```\n\n### How do I recurse but skip common noise directories and file types?\n\n```bash\n./lshash.sh -r -e '.git/*' -e '.dups/*' -e 'node_modules/*' -e '*.log' -e '*.tmp'\n```\n\n### How do I use a non-BLAKE3 algorithm quickly?\n\n```bash\n./lshash.sh --algorithm sha256\n```\n\n### How do I dedupe recursively and keep the newest file in each duplicate set?\n\n```bash\n./lshash.sh -r --dedupe newer\n```\n\n### How do I dedupe across the entire recursive tree (not per-directory)?\n\n```bash\n./lshash.sh -r -d shorter --global /path/to/scan\n```\n\nFor each moved duplicate in recursive global mode (`-r -d --global`), a sidecar file `\u003cmoved-file\u003e.lshash.json` is created in `.dups/` with peer paths and `kept`/`moved` status.\n\n### How do I dedupe but keep the shortest full relative path instead?\n\n```bash\n./lshash.sh -d\n```\n\nThis uses `shorter`, which compares full root-relative path length.\n\n### How do I archive existing `.dups` content without re-running dedupe?\n\n```bash\n./lshash.sh --move-dups /path/to/archive\n./lshash.sh --move-dups=/path/to/archive /path/to/scan\n```\n\n`--move-dups` writes files under the archive using original relative paths, so copying that archive tree back over the source tree restores duplicates (plus sidecars, when present).\n\n### Where do moved duplicates go?\n\n- Duplicates are moved into a hidden `.dups/` subdirectory under the same directory where the duplicate was found.\n\n### What if I want dedupe aliases?\n\n- All of these are accepted:\n  - `--dedupe`\n  - `--dedup`\n  - `-d`\n\n## Regression tests\n\nRun the parity/regression checks (Bash + .NET):\n\n```bash\nchmod +x tests/regression.sh\n./tests/regression.sh\n```\n\nFor hash-algorithm rationale and comparison notes, see the BLAKE3 appendix in `ARCHITECTURE.md`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fngpepin%2Flshash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fngpepin%2Flshash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fngpepin%2Flshash/lists"}