{"id":21861607,"url":"https://github.com/elcorto/findsame","last_synced_at":"2025-09-06T06:37:13.356Z","repository":{"id":57429608,"uuid":"77504222","full_name":"elcorto/findsame","owner":"elcorto","description":"Find duplicate files and directories based on file hashes.","archived":false,"fork":false,"pushed_at":"2021-12-26T12:38:54.000Z","size":255,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-28T08:03:03.454Z","etag":null,"topics":["duplicate-detection","duplicate-files","duplicatefilefinder","file-hashing","merkletree","multiprocessing","multithreading","python"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/findsame","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elcorto.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-12-28T05:01:06.000Z","updated_at":"2023-12-19T01:37:32.000Z","dependencies_parsed_at":"2022-08-26T02:41:24.556Z","dependency_job_id":null,"html_url":"https://github.com/elcorto/findsame","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcorto%2Ffindsame","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcorto%2Ffindsame/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcorto%2Ffindsame/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcorto%2Ffindsame/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elcorto","download_url":"https://codeload.github.com/elcorto/findsame/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248948111,"owners_count":21187809,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duplicate-detection","duplicate-files","duplicatefilefinder","file-hashing","merkletree","multiprocessing","multithreading","python"],"created_at":"2024-11-28T03:12:11.155Z","updated_at":"2025-04-14T19:40:43.774Z","avatar_url":"https://github.com/elcorto.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"About\n=====\n\nFind duplicate files and directories.\n\nAs other tools we use file hashes but additionally, we report duplicate\ndirectories as well, using a Merkle tree for directory hash calculation.\n\nTo increase performance, we use\n\n* parallel hash calculation (`-t/--nthreads` option), see Benchmarks below\n* optional limits on data to be hashed (`-l/--limit` option)\n\nInstall\n=======\n\nFrom pypi:\n\n```sh\n    $ pip install findsame\n```\n\nDev install of this repo:\n\n```sh\n    $ git clone ...\n    $ cd findsame\n    $ pip install -e .\n```\n\nThe core part (package `findsame` and the CLI `bin/findsame`) has no\nexternal dependencies. If you want to run the benchmarks (see\n\"Benchmarks\" below), install:\n\n```sh\n    $ pip install -r requirements_benchmark.txt\n```\n\nUsage\n=====\n\n    usage: findsame [-h] [-b BLOCKSIZE] [-l LIMIT] [-p NPROCS] [-t NTHREADS]\n                    [-o OUTMODE] [-v]\n                    file/dir [file/dir ...]\n\n    Find same files and dirs based on file hashes.\n\n    positional arguments:\n      file/dir              files and/or dirs to compare\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -b BLOCKSIZE, --blocksize BLOCKSIZE\n                            blocksize in hash calculation, use units K,M,G as in\n                            100M, 256K or just 1024 (bytes), if LIMIT is used and\n                            BLOCKSIZE \u003c LIMIT then we require mod(LIMIT,\n                            BLOCKSIZE) = 0 else we set BLOCKSIZE = LIMIT [default:\n                            256.0K]\n      -l LIMIT, --limit LIMIT\n                            read limit (bytes, see also BLOCKSIZE), calculate hash\n                            only over the first LIMIT bytes, makes things go\n                            faster for may large files, try 512K [default: None]\n      -p NPROCS, --nprocs NPROCS\n                            number of parallel processes [default: 1]\n      -t NTHREADS, --nthreads NTHREADS\n                            threads per process [default: 4]\n      -o OUTMODE, --outmode OUTMODE\n                            1: list of dicts (values of dict from mode 2), one\n                            dict per hash, 2: dict of dicts (full result), keys\n                            are hashes, 3: compact, sort by type (file, dir)\n                            [default: 3]\n      -v, --verbose         enable verbose/debugging output\n\nThe output format is json, see `-o/--outmode`, default is `-o 3`. An\nexample using the test suite data:\n\n```sh\n    $ cd findsame/tests\n    $ findsame data | jq .\n    {\n      \"dir:empty\": [\n        [\n          \"data/dir2/empty_dir\",\n          \"data/dir2/empty_dir_copy\",\n          \"data/empty_dir\",\n          \"data/empty_dir_copy\"\n        ]\n      ],\n      \"dir\": [\n        [\n          \"data/dir1\",\n          \"data/dir1_copy\"\n        ]\n      ],\n      \"file:empty\": [\n        [\n          \"data/dir2/empty_dir/empty_file\",\n          \"data/dir2/empty_dir_copy/empty_file\",\n          \"data/empty_dir/empty_file\",\n          \"data/empty_dir_copy/empty_file\",\n          \"data/empty_file\",\n          \"data/empty_file_copy\"\n        ]\n      ],\n      \"file\": [\n        [\n          \"data/dir1/file2\",\n          \"data/dir1/file2_copy\",\n          \"data/dir1_copy/file2\",\n          \"data/dir1_copy/file2_copy\",\n          \"data/file2\"\n        ],\n        [\n          \"data/lena.png\",\n          \"data/lena_copy.png\"\n        ],\n        [\n          \"data/file1\",\n          \"data/file1_copy\"\n        ]\n      ]\n    }\n```\n\nThis returns a dict whose keys are the path type (file, dir). Values are\nnested lists. Each sub-list contains paths having the same hash. Note that we\nalso report empty files and dirs.\n\nUse [jq](https://stedolan.github.io/jq) for pretty-printing, e.g.\n\n```sh\n    $ findsame data | jq .\n\n    # keep colors in less(1)\n    $ findsame data | jq . -C | less -R\n```\n\nTo check out large amounts of data (as in GiB) for the first time, use the\n`-l/--limit` option for speed and use `less -n` as well (don't wait for input\nto load)\n\n```sh\n    $ findsame -l512K data | jq . -C | less -nR\n```\nPost-processing is only limited by your ability to process json (using\n`jq`, Python, ...).\n\nNote that the order of key-value entries in the output from both\n`findsame` and `jq` is random.\n\nNote that currently, we skip symlinks.\n\nPerformance\n===========\n\nParallel hash calculation\n-------------------------\n\nBy default, we use `--nthreads` equal to the number of cores. See\n\"Benchmarks\" below.\n\nLimit data to be hashed\n-----------------------\n\nApart from parallelization, by far the most speed is gained by using\n`--limit`. Note that this may lead to false positives, if files are\nexactly equal in the first `LIMIT` bytes. Finding a good enough value\ncan be done by trial and error. Try 512K. This is still quite fast and\nseems to cover most real-world data.\n\nTests\n=====\n\nRun `nosetests`, `pytest` or any other test runner with test discovery.\n\nBenchmarks\n==========\n\nYou may run the benchmark script to find the best blocksize and number\nthreads and/or processes for hash calculations on your machine.\n\n```sh\n    $ cd findsame/benchmark\n    $ ./clean.sh\n    $ ./benchmark.py\n    $ ./plot.py\n```\n\nThis writes test files of various size to `benchmark/files` and runs a\ncouple of benchmarks (runtime \\~10 min for all benchmarks). Make sure to\navoid doing any other extensive IO tasks while the benchmarks run, of\ncourse.\n\n**The default value of \"maxsize\" in benchmark.py (in the `__main__`\npart) is only some MiB to allow quick testing. This needs to be changed\nto, say, 1 GiB in order to have meaningful benchmarks.**\n\nObservations:\n\n* blocksizes below 512 KiB (`-b/--blocksize 512K`) work best for all file\n  sizes on most systems, even though the variation to worst timings is\n  at most factor 1.25 (e.g. 1 vs. 1.25 seconds)\n* multithreading (`-t/--nthreads`): up to 2x speedup on dual-core box\n  -- very efficient, use NTHREADS = number of cores for good baseline\n  performance (problem is mostly IO-bound)\n* multiprocessing (`-p/--nprocs`): less efficient speedup, but on some\n  systems NPROCS + NTHREADS is even a bit faster than NTHREADS alone,\n  testing is mandatory\n* we have a linear increase of runtime with filesize, of course\n\nOutput modes\n============\n\nDefault (`-o3`)\n---------------\n\nThe default output format is `-o3` (same as the initial example above).\n\n```sh\n    $ findsame -o3 data | jq .\n    {\n      \"dir:empty\": [\n        [\n          \"data/dir2/empty_dir\",\n          \"data/dir2/empty_dir_copy\",\n          \"data/empty_dir\",\n          \"data/empty_dir_copy\"\n        ]\n      ],\n      \"dir\": [\n        [\n          \"data/dir1\",\n          \"data/dir1_copy\"\n        ]\n      ],\n      \"file:empty\": [\n        [\n          \"data/dir2/empty_dir/empty_file\",\n          \"data/dir2/empty_dir_copy/empty_file\",\n          \"data/empty_dir/empty_file\",\n          \"data/empty_dir_copy/empty_file\",\n          \"data/empty_file\",\n          \"data/empty_file_copy\"\n        ]\n      ],\n      \"file\": [\n        [\n          \"data/dir1/file2\",\n          \"data/dir1/file2_copy\",\n          \"data/dir1_copy/file2\",\n          \"data/dir1_copy/file2_copy\",\n          \"data/file2\"\n        ],\n        [\n          \"data/lena.png\",\n          \"data/lena_copy.png\"\n        ],\n        [\n          \"data/file1\",\n          \"data/file1_copy\"\n        ]\n      ]\n    }\n```\n\nOutput with hashes (`-o2`)\n--------------------------\n\n```sh\n    $ findsame -o2 data | jq .\n    {\n      \"da39a3ee5e6b4b0d3255bfef95601890afd80709\": {\n        \"dir:empty\": [\n          \"data/dir2/empty_dir\",\n          \"data/dir2/empty_dir_copy\",\n          \"data/empty_dir\",\n          \"data/empty_dir_copy\"\n        ],\n        \"file:empty\": [\n          \"data/dir2/empty_dir/empty_file\",\n          \"data/dir2/empty_dir_copy/empty_file\",\n          \"data/empty_dir/empty_file\",\n          \"data/empty_dir_copy/empty_file\",\n          \"data/empty_file\",\n          \"data/empty_file_copy\"\n        ]\n      },\n      \"55341fe74a3497b53438f9b724b3e8cdaf728edc\": {\n        \"dir\": [\n          \"data/dir1\",\n          \"data/dir1_copy\"\n        ]\n      },\n      \"9619a9b308cdebee40f6cef018fef0f4d0de2939\": {\n        \"file\": [\n          \"data/dir1/file2\",\n          \"data/dir1/file2_copy\",\n          \"data/dir1_copy/file2\",\n          \"data/dir1_copy/file2_copy\",\n          \"data/file2\"\n        ]\n      },\n      \"0a96c2e755258bd46abdde729f8ee97d234dd04e\": {\n        \"file\": [\n          \"data/lena.png\",\n          \"data/lena_copy.png\"\n        ]\n      },\n      \"312382290f4f71e7fb7f00449fb529fce3b8ec95\": {\n        \"file\": [\n          \"data/file1\",\n          \"data/file1_copy\"\n        ]\n      }\n    }\n```\n\nThe output is one dict (json object) where all same-hash files/dirs are\nfound at the same key (hash).\n\nDict values (`-o1`)\n-------------------\n\nThe format `-o1` lists only the dict values from `-o2`, i.e. a list of\ndicts.\n\n```sh\n    $ findsame -o1 data | jq .\n    [\n      {\n        \"dir:empty\": [\n          \"data/dir2/empty_dir\",\n          \"data/dir2/empty_dir_copy\",\n          \"data/empty_dir\",\n          \"data/empty_dir_copy\"\n        ],\n        \"file:empty\": [\n          \"data/dir2/empty_dir/empty_file\",\n          \"data/dir2/empty_dir_copy/empty_file\",\n          \"data/empty_dir/empty_file\",\n          \"data/empty_dir_copy/empty_file\",\n          \"data/empty_file\",\n          \"data/empty_file_copy\"\n        ]\n      },\n      {\n        \"dir\": [\n          \"data/dir1\",\n          \"data/dir1_copy\"\n        ]\n      },\n      {\n        \"file\": [\n          \"data/file1\",\n          \"data/file1_copy\"\n        ]\n      },\n      {\n        \"file\": [\n          \"data/dir1/file2\",\n          \"data/dir1/file2_copy\",\n          \"data/dir1_copy/file2\",\n          \"data/dir1_copy/file2_copy\",\n          \"data/file2\"\n        ]\n      },\n      {\n        \"file\": [\n          \"data/lena.png\",\n          \"data/lena_copy.png\"\n        ]\n      }\n    ]\n```\n\nMore usage examples\n===================\n\nHere we show examples of common post-processing tasks using `jq`. When\nthe `jq` command works for all three output modes, we don't specify the\n`-o` option.\n\nCount the total number of all equals:\n\n```sh\n    $ findsame data | jq '.[]|.[]|.[]' | wc -l\n```\n\nFind only groups of equal dirs:\n\n```sh\n    $ findsame -o1 data | jq '.[]|select(.dir)|.dir'\n    $ findsame -o2 data | jq '.[]|select(.dir)|.dir'\n    $ findsame -o3 data | jq '.dir|.[]'\n    [\n      \"data/dir1\",\n      \"data/dir1_copy\"\n    ]\n```\n\nGroups of equal files:\n\n```sh\n    $ findsame -o1 data | jq '.[]|select(.file)|.file'\n    $ findsame -o2 data | jq '.[]|select(.file)|.file'\n    $ findsame -o3 data | jq '.file|.[]'\n    [\n      \"data/dir1/file2\",\n      \"data/dir1/file2_copy\",\n      \"data/dir1_copy/file2\",\n      \"data/dir1_copy/file2_copy\",\n      \"data/file2\"\n    ]\n    [\n      \"data/lena.png\",\n      \"data/lena_copy.png\"\n    ]\n    [\n      \"data/file1\",\n      \"data/file1_copy\"\n    ]\n```\n\nFind the first element in a group of equal items (file or dir):\n\n```sh\n    $ findsame data | jq '.[]|.[]|[.[0]]'\n    [\n      \"data/lena.png\"\n    ]\n    [\n      \"data/dir2/empty_dir\"\n    ]\n    [\n      \"data/dir2/empty_dir/empty_file\"\n    ]\n    [\n      \"data/dir1/file2\"\n    ]\n    [\n      \"data/file1\"\n    ]\n    [\n      \"data/dir1\"\n    ]\n```\n\nor more compact w/o the length-1 list:\n\n```sh\n    $ findsame data | jq '.[]|.[]|.[0]'\n    \"data/dir2/empty_dir\"\n    \"data/dir2/empty_dir/empty_file\"\n    \"data/dir1/file2\"\n    \"data/lena.png\"\n    \"data/file1\"\n    \"data/dir1\"\n```\n\nFind *all but the first* element in a group of equal items (file or\ndir):\n\n```sh\n    $ findsame data | jq '.[]|.[]|.[1:]'\n    [\n      \"data/dir1_copy\"\n    ]\n    [\n      \"data/lena_copy.png\"\n    ]\n    [\n      \"data/dir1/file2_copy\",\n      \"data/dir1_copy/file2\",\n      \"data/dir1_copy/file2_copy\",\n      \"data/file2\"\n    ]\n    [\n      \"data/dir2/empty_dir_copy/empty_file\",\n      \"data/empty_dir/empty_file\",\n      \"data/empty_dir_copy/empty_file\",\n      \"data/empty_file\",\n      \"data/empty_file_copy\"\n    ]\n    [\n      \"data/dir2/empty_dir_copy\",\n      \"data/empty_dir\",\n      \"data/empty_dir_copy\"\n    ]\n    [\n      \"data/file1_copy\"\n    ]\n```\n\nAnd more compact:\n\n```sh\n    $ findsame data | jq '.[]|.[]|.[1:]|.[]'\n    \"data/file1_copy\"\n    \"data/dir1/file2_copy\"\n    \"data/dir1_copy/file2\"\n    \"data/dir1_copy/file2_copy\"\n    \"data/file2\"\n    \"data/lena_copy.png\"\n    \"data/dir2/empty_dir_copy/empty_file\"\n    \"data/empty_dir/empty_file\"\n    \"data/empty_dir_copy/empty_file\"\n    \"data/empty_file\"\n    \"data/empty_file_copy\"\n    \"data/dir2/empty_dir_copy\"\n    \"data/empty_dir\"\n    \"data/empty_dir_copy\"\n    \"data/dir1_copy\"\n```\n\nThe last one can be used to remove all but the first in a group of equal\nfiles/dirs:\n\n```sh\n    $ findsame data | jq '.[]|.[]|.[1:]|.[]' | xargs cp -rvt duplicates/\n```\n\nOther tools\n===========\n\n`fdupes`, `jdupes`, `duff`, `rdfind`, `rmlint`, `findup` (from `fslint`)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felcorto%2Ffindsame","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felcorto%2Ffindsame","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felcorto%2Ffindsame/lists"}