{"id":17219315,"url":"https://github.com/eminence/deduprs","last_synced_at":"2025-03-25T14:41:57.391Z","repository":{"id":29644892,"uuid":"33186389","full_name":"eminence/deduprs","owner":"eminence","description":"Hardlink deduplication tool for Linux","archived":false,"fork":false,"pushed_at":"2022-06-17T15:52:44.000Z","size":15,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-30T13:26:12.608Z","etag":null,"topics":["dedup","deduplication","hard-link","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eminence.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-31T13:17:01.000Z","updated_at":"2023-02-25T20:01:45.000Z","dependencies_parsed_at":"2022-09-03T18:01:21.172Z","dependency_job_id":null,"html_url":"https://github.com/eminence/deduprs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eminence%2Fdeduprs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eminence%2Fdeduprs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eminence%2Fdeduprs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eminence%2Fdeduprs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eminence","download_url":"https://codeload.github.com/eminence/deduprs/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245484599,"owners_count":20623115,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dedup","deduplication","hard-link","rust"],"created_at":"2024-10-15T03:49:36.685Z","updated_at":"2025-03-25T14:41:57.366Z","avatar_url":"https://github.com/eminence.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Deduplication tool\n\nThis tool can deduplicate multiple directories that share the same directory structure.  Deduplication is done by hard-linking.\n(While hard-linking is supported on Windows, this tool is only tested on Linux).\n\n\n## How it works\n\nImagine you had the following directory structures:\n\n```\n./foo\n|-- one/\n|   |-- a.txt\n|   `-- b.txt\n`-- c.txt    \n\n\n```\n\nwhich was replicated in several different directories, for example `testA/`, `testB/` and `testC/`.  The directories passed in on the\ncommand line are called \"root directories\" or \"roots\".  \n\n\nIf you run `./dedup test*`, the first directory will be used as the primary tree to be walked.  For each file in it,\ndedup.rs will check to see if it exists in the other roots.  For example, does `testB/foo/one/a.txt` and `testC/foo/one/a.txt` exist?\n\nEach file that exists in multiple roots will be checked for content sameness.  Any files that are identical will be hard-linked together.  \n\nTo create the links, first a new link is created to a temporary file, and then then temporary file is moved on top of the real file.\n\n## Important note\n\nWhen using hardlinking for deduplication, it's important to remember that editing a file will change *every* path that links to that\nfile.  This can be very surprising in some situations.  Thus it is strongly recommended that once a folder is deduplicated, it be marked\nas read-only, and never written to.  \n\n\n## Details\n\nWhen looking for files to deduplicate, they must exist in at least 2 of the roots.  They need not exist in every root.\n\nGiven a set of files with the same name, the set is first partitioned based on content sameness.  As an example, if you had 5 files total, and files 1 and 2 were the same, and files 3 and 4 where the same, and file 5 was different from everything else, then files 1 and 2 would be hardlinked and files 3 and 4 would be hardlinked.  \n\nAs an optimiation, files with different mtimes and file sizes are never considered the same.  If they are, then the files are then\nhashed with with [xxHash](https://github.com/Cyan4973/xxHash).  \n\nGiven a set of files that have all been confirmed to be the same, the file with the most number of hardlinks is considered to be\nthe \"master\".  All other files are then linked to point to this master file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feminence%2Fdeduprs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feminence%2Fdeduprs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feminence%2Fdeduprs/lists"}