{"id":13558373,"url":"https://github.com/kornelski/dupe-krill","last_synced_at":"2025-04-12T15:38:18.417Z","repository":{"id":49142099,"uuid":"86804562","full_name":"kornelski/dupe-krill","owner":"kornelski","description":"A fast file deduplicator","archived":false,"fork":false,"pushed_at":"2023-09-04T15:14:24.000Z","size":118,"stargazers_count":185,"open_issues_count":3,"forks_count":10,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-10-10T20:19:13.541Z","etag":null,"topics":["dedupe","dupes","file-deduplication","hardlinks","macos","rust-library"],"latest_commit_sha":null,"homepage":"https://lib.rs/dupe-krill","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kornelski.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-03-31T09:52:49.000Z","updated_at":"2024-09-03T19:55:25.000Z","dependencies_parsed_at":"2024-01-12T23:43:06.628Z","dependency_job_id":"b230ae88-7245-4e85-b1fa-42afde4e4b35","html_url":"https://github.com/kornelski/dupe-krill","commit_stats":{"total_commits":86,"total_committers":5,"mean_commits":17.2,"dds":0.08139534883720934,"last_synced_commit":"a0ff2939ea4110d5d22fb034207e2bbf5b492f9e"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kornelski%2Fdupe-krill","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kornelski%2Fdupe-krill/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kornelski%2Fdupe-krill/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kornelski%2Fdupe-krill/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kornelski","download_url":"https://codeload.github.com/kornelski/dupe-krill/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248590570,"owners_count":21129850,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dedupe","dupes","file-deduplication","hardlinks","macos","rust-library"],"created_at":"2024-08-01T12:04:55.067Z","updated_at":"2025-04-12T15:38:18.390Z","avatar_url":"https://github.com/kornelski.png","language":"Rust","funding_links":[],"categories":["Rust","macos","Other"],"sub_categories":[],"readme":"# Dupe k*r*ill — a fast file deduplicator\n\nReplaces files that have identical content with hardlinks, so that file data of all copies is stored only once, saving disk space. Useful for reducing sizes of multiple backups, messy collections of photos and music, countless copies of `node_modules`, macOS app bundles, and anything else that's usually immutable (since all hardlinked copies of a file will change when any one of them is changed).\n\n## Features\n\n* It's very fast and reasonably memory-efficient.\n* Deduplicates incrementally as soon as duplicates are found.\n* Replaces files atomically and it's safe to interrupt at any time.\n* Proven to be reliable. Used for years without an issue.\n* It's aware of existing hardlinks and supports merging of multiple groups of hardlinks.\n* Gracefully handles symlinks and special files.\n\n## Usage\n\n[Download binaries from the releases page](https://github.com/kornelski/dupe-krill/releases).\n\nWorks on macOS and Linux. Windows is not supported.\n\nIf you have the [latest stable Rust](https://www.rust-lang.org/) (1.42+), build the program with either `cargo install dupe-krill` or clone this repo and `cargo build --release`.\n\n```sh\ndupe-krill -d \u003cfiles or directories\u003e # find dupes without doing anything\ndupe-krill \u003cfiles or directories\u003e # find and replace with hardlinks\n```\n\nSee `dupe-krill -h` for details.\n\n### Output\n\nIt prints one duplicate per line. It prints *both* paths on the same line with the difference between them highlighted as `{first =\u003e second}`. \n\nProgress shows:\n\n\u003e `\u003cnumber unique file bodies\u003e`+`\u003cnumber of hardlinks\u003e` dupes. `\u003cfiles checked\u003e`+`\u003cfiles skipped\u003e` files scanned.\n\nSymlinks, special device files, and 0-sized files are always skipped.\n\nDon't try to parse program's usual output. Add `--json` option if you want machine-readable output. You can also use this program as a Rust library for seamless integration.\n\n## How does hardlinking work?\n\nFiles are deduplicated by making a hardlink. They're not deleted. Instead, litreally the same file will exist in two or more directories at once. Unlike symlinks, the hardlinks behave like real files. Deleting one of hardlinks leaves other hardlinks unchanged. Editing a hardlinked file edits it in all places at once (except in some applications that delete \u0026 create a new file, instead of overwriting existing files). Hardlinking will make all duplicates of a file have the same file permissions.\n\nThis program will only deduplicate files larger than a single disk block (4KB, usually), because in many filesystems hardlinking tiny files may not actually save space. You can add `-s` flag to dedupe small files, too.\n\n### Nerding out about the fast deduplication algorithm\n\nIn short: it uses Rust's standard library `BTreeMap` for deduplication, but with a twist that allows it to compare files lazily, reading only as little file content as necessary.\n\n----\n\nTheoretically, you could find all duplicate files by putting them in a giant hash table aggregating file paths and using file content as the key:\n\n```rust\nHashMap\u003cVec\u003cu8\u003e, Vec\u003cPath\u003e\u003e\n```\n\nbut of course that would use ludicrous amounts of memory. You can fix it by using hashes of the content instead of the content itself.\n\n\u003e BTW, I can't stress enough how mind-bogglingly improbable accidental cryptographic hash collisions are. It's not just \"you're probably safe if you're lucky\". It's \"creating this many files would take more energy than our civilisation has ever produced in all of its history\".\n\n```rust\nHashMap\u003c[u8; 16], Vec\u003cPath\u003e\u003e\n```\n\nbut that's still pretty slow, since you still read entire content of all the files. You can save some work by comparing file sizes first:\n\n```rust\nHashMap\u003cu64, HashMap\u003c[u8; 20], Vec\u003cPath\u003e\u003e\n```\n\nbut it helps only a little, since files with identical sizes are surprisingly common. You can eliminate a bit more of near-duplicates by comparing only beginnings of the files first:\n\n```rust\nHashMap\u003cu64, HashMap\u003c[u8; 20], HashMap\u003c[u8; 20], Vec\u003cPath\u003e\u003e\u003e\n```\n\nand then maybe compare only the ends, and maybe a few more fragments in the middle, etc.:\n\n```rust\nHashMap\u003cu64, HashMap\u003c[u8; 20], HashMap\u003c[u8; 20], HashMap\u003c[u8; 20], Vec\u003cPath\u003e\u003e\u003e\u003e\nHashMap\u003cu64, HashMap\u003c[u8; 20], HashMap\u003c[u8; 20], HashMap\u003c[u8; 20], HashMap\u003c[u8; 20], HashMap\u003c[u8; 20], …\u003e\u003e\u003e\u003e\n```\n\nThese endlessly nested hashmaps can be generalized. `BTreeMap` doesn't need to see the whole key at once. It only compares keys with each other, and the comparison can be done incrementally — by only reading enough of the file to show that its key is unique, without even knowing the full key.\n\n```rust\nBTreeMap\u003cLazilyHashing\u003cFile\u003e, Vec\u003cPath\u003e\u003e\n```\n\nAnd that's what this program does (and a bit of wrangling with inodes).\n\nThe whole heavy lifting of deduplication is done by Rust's standard library `BTreeMap` and overloaded `\u003c`/`\u003e` operators that incrementally hash the files (yes, operator overloading that does file I/O is a brilliant idea. I couldn't use `\u003c\u003c`, unfortunately).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkornelski%2Fdupe-krill","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkornelski%2Fdupe-krill","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkornelski%2Fdupe-krill/lists"}