{"id":17526837,"url":"https://github.com/erikreed/pydupes","last_synced_at":"2025-03-06T06:30:55.452Z","repository":{"id":43685840,"uuid":"430306211","full_name":"erikreed/pydupes","owner":"erikreed","description":"A duplicate file finder like rdfind/fdupes et al that may be faster in environments with millions of files and terabytes of data or over high latency filesystems (e.g. NFS).","archived":false,"fork":false,"pushed_at":"2023-09-21T19:54:12.000Z","size":44,"stargazers_count":3,"open_issues_count":1,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-04-23T19:38:49.343Z","etag":null,"topics":["duplicate-detection","duplication","files"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erikreed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-21T07:57:25.000Z","updated_at":"2023-09-08T18:28:51.000Z","dependencies_parsed_at":"2023-09-22T02:44:46.850Z","dependency_job_id":"9fadb4c8-7b8f-45f2-a94b-6ca67372332c","html_url":"https://github.com/erikreed/pydupes","commit_stats":{"total_commits":22,"total_committers":1,"mean_commits":22.0,"dds":0.0,"last_synced_commit":"c153774a6f0bac6d616e6781e9229865286ddc24"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikreed%2Fpydupes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikreed%2Fpydupes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikreed%2Fpydupes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erikreed%2Fpydupes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erikreed","download_url":"https://codeload.github.com/erikreed/pydupes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242161426,"owners_count":20081869,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duplicate-detection","duplication","files"],"created_at":"2024-10-20T15:02:33.521Z","updated_at":"2025-03-06T06:30:55.441Z","avatar_url":"https://github.com/erikreed.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"`pydupes` is yet another duplicate file finder like rdfind/fdupes et al\nthat may be faster in environments with millions of files and terabytes\nof data or over high latency filesystems (e.g. NFS).\n\n[![PyPI version](https://badge.fury.io/py/pydupes.svg)](https://badge.fury.io/py/pydupes)\n\n-------------------\n\nThe algorithm is similar to [rdfind](https://github.com/pauldreik/rdfind) with threading and consolidation of\nfiltering logic (instead of separate passes).\n- traverse the input paths, collecting the inodes and file sizes\n- for each set of files with the same size:\n  - further split by matching 4KB on beginning/ends of files\n  - for each non-unique (by size, boundaries) candidate set, compute the sha256 and emit pairs with matching hash\n\nConstraints:\n- traversals do not span multiple devices\n- symlink following not implemented\n- concurrent modification of a traversed directory could produce false duplicate pairs \n(modification after hash computation)\n\n## Setup\n```bash\n# via pip\npip3 install --user --upgrade pydupes\n\n# or simply if pipx installed:\npipx run pydupes --help\n```\n\n## Usage\n\n```bash\n# Collect counts and stage the duplicate files, null-delimited source-target pairs:\npydupes /path1 /path2 --progress --output dupes.txt\n\n# Sanity check a hardlinking of all matches:\nxargs -0 -n2 echo ln --force --verbose \u003c dupes.txt\n```\n\n## Benchmarks\nHardware is a 6 spinning disk RAID5 ext4 with\n250GB memory, Ubuntu 18.04. Peak memory and runtimes via:\n```/usr/bin/time -v```.\n\n### Dataset 1:\n- Directories: ~33k\n- Files: ~14 million, 1 million duplicate\n- Total size: ~11TB, 300GB duplicate\n\n#### pydupes\n- Elapsed (wall clock) time (h:mm:ss or m:ss): 39:04.73\n- Maximum resident set size (kbytes): 3356936 (~3GB)\n```\nINFO:pydupes:Traversing input paths: ['/raid/erik']\nINFO:pydupes:Traversal time: 209.6s\nINFO:pydupes:Cursory file count: 14416742 (10.9TiB), excluding symlinks and dupe inodes\nINFO:pydupes:Directory count: 33376\nINFO:pydupes:Number of candidate groups: 720263\nINFO:pydupes:Size filter reduced file count to: 14114518 (7.3TiB)\nINFO:pydupes:Comparison time: 2134.6s\nINFO:pydupes:Total time elapsed: 2344.2s\nINFO:pydupes:Number of duplicate files: 936948\nINFO:pydupes:Size of duplicate content: 304.1GiB\n```\n\n#### rdfind\n- Elapsed (wall clock) time (h:mm:ss or m:ss): 1:57:20\n- Maximum resident set size (kbytes): 3636396 (~3GB)\n```\nNow scanning \"/raid/erik\", found 14419182 files.\nNow have 14419182 files in total.\nRemoved 44 files due to nonunique device and inode.\nNow removing files with zero size from list...removed 2396 files\nTotal size is 11961280180699 bytes or 11 TiB\nNow sorting on size:removed 301978 files due to unique sizes from list.14114764 files left.\nNow eliminating candidates based on first bytes:removed 8678999 files from list.5435765 files left.\nNow eliminating candidates based on last bytes:removed 3633992 files from list.1801773 files left.\nNow eliminating candidates based on md5 checksum:removed 158638 files from list.1643135 files left.\nIt seems like you have 1643135 files that are not unique\nTotally, 304 GiB can be reduced.\n```\n\n#### fdupes\nNote that this isn't a fair comparison since fdupes additionally performs a byte-by-byte comparison on\nMD5 match. Invocation with \"fdupes --size --summarize --recurse --quiet\".\n- Elapsed (wall clock) time (h:mm:ss or m:ss): 2:58:32\n- Maximum resident set size (kbytes): 3649420 (~3GB)\n```\n939588 duplicate files (in 705943 sets), occupying 326547.7 megabytes\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferikreed%2Fpydupes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferikreed%2Fpydupes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferikreed%2Fpydupes/lists"}