https://github.com/elcorto/findsame

Find duplicate files and directories based on file hashes.
https://github.com/elcorto/findsame
duplicate-detection duplicate-files duplicatefilefinder file-hashing merkletree multiprocessing multithreading python
Last synced: about 1 month ago
JSON representation
Find duplicate files and directories based on file hashes.
Host: GitHub
URL: https://github.com/elcorto/findsame
Owner: elcorto
License: bsd-3-clause
Created: 2016-12-28T05:01:06.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2021-12-26T12:38:54.000Z (almost 4 years ago)
Last Synced: 2025-03-28T08:03:03.454Z (7 months ago)
Topics: duplicate-detection, duplicate-files, duplicatefilefinder, file-hashing, merkletree, multiprocessing, multithreading, python
Language: Python
Homepage: https://pypi.org/project/findsame
Size: 249 KB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          About

=====

Find duplicate files and directories.

As other tools we use file hashes but additionally, we report duplicate

directories as well, using a Merkle tree for directory hash calculation.

To increase performance, we use

* parallel hash calculation (`-t/--nthreads` option), see Benchmarks below

* optional limits on data to be hashed (`-l/--limit` option)

Install

=======

From pypi:

```sh

    $ pip install findsame

```

Dev install of this repo:

```sh

    $ git clone ...

    $ cd findsame

    $ pip install -e .

```

The core part (package `findsame` and the CLI `bin/findsame`) has no

external dependencies. If you want to run the benchmarks (see

"Benchmarks" below), install:

```sh

    $ pip install -r requirements_benchmark.txt

```

Usage

=====

    usage: findsame [-h] [-b BLOCKSIZE] [-l LIMIT] [-p NPROCS] [-t NTHREADS]

                    [-o OUTMODE] [-v]

                    file/dir [file/dir ...]

    Find same files and dirs based on file hashes.

    positional arguments:

      file/dir              files and/or dirs to compare

    optional arguments:

      -h, --help            show this help message and exit

      -b BLOCKSIZE, --blocksize BLOCKSIZE

                            blocksize in hash calculation, use units K,M,G as in

                            100M, 256K or just 1024 (bytes), if LIMIT is used and

                            BLOCKSIZE < LIMIT then we require mod(LIMIT,

                            BLOCKSIZE) = 0 else we set BLOCKSIZE = LIMIT [default:

                            256.0K]

      -l LIMIT, --limit LIMIT

                            read limit (bytes, see also BLOCKSIZE), calculate hash

                            only over the first LIMIT bytes, makes things go

                            faster for may large files, try 512K [default: None]

      -p NPROCS, --nprocs NPROCS

                            number of parallel processes [default: 1]

      -t NTHREADS, --nthreads NTHREADS

                            threads per process [default: 4]

      -o OUTMODE, --outmode OUTMODE

                            1: list of dicts (values of dict from mode 2), one

                            dict per hash, 2: dict of dicts (full result), keys

                            are hashes, 3: compact, sort by type (file, dir)

                            [default: 3]

      -v, --verbose         enable verbose/debugging output

The output format is json, see `-o/--outmode`, default is `-o 3`. An

example using the test suite data:

```sh

    $ cd findsame/tests

    $ findsame data | jq .

    {

      "dir:empty": [

        [

          "data/dir2/empty_dir",

          "data/dir2/empty_dir_copy",

          "data/empty_dir",

          "data/empty_dir_copy"

        ]

      ],

      "dir": [

        [

          "data/dir1",

          "data/dir1_copy"

        ]

      ],

      "file:empty": [

        [

          "data/dir2/empty_dir/empty_file",

          "data/dir2/empty_dir_copy/empty_file",

          "data/empty_dir/empty_file",

          "data/empty_dir_copy/empty_file",

          "data/empty_file",

          "data/empty_file_copy"

        ]

      ],

      "file": [

        [

          "data/dir1/file2",

          "data/dir1/file2_copy",

          "data/dir1_copy/file2",

          "data/dir1_copy/file2_copy",

          "data/file2"

        ],

        [

          "data/lena.png",

          "data/lena_copy.png"

        ],

        [

          "data/file1",

          "data/file1_copy"

        ]

      ]

    }

```

This returns a dict whose keys are the path type (file, dir). Values are

nested lists. Each sub-list contains paths having the same hash. Note that we

also report empty files and dirs.

Use [jq](https://stedolan.github.io/jq) for pretty-printing, e.g.

```sh

    $ findsame data | jq .

    # keep colors in less(1)

    $ findsame data | jq . -C | less -R

```

To check out large amounts of data (as in GiB) for the first time, use the

`-l/--limit` option for speed and use `less -n` as well (don't wait for input

to load)

```sh

    $ findsame -l512K data | jq . -C | less -nR

```

Post-processing is only limited by your ability to process json (using

`jq`, Python, ...).

Note that the order of key-value entries in the output from both

`findsame` and `jq` is random.

Note that currently, we skip symlinks.

Performance

===========

Parallel hash calculation

-------------------------

By default, we use `--nthreads` equal to the number of cores. See

"Benchmarks" below.

Limit data to be hashed

-----------------------

Apart from parallelization, by far the most speed is gained by using

`--limit`. Note that this may lead to false positives, if files are

exactly equal in the first `LIMIT` bytes. Finding a good enough value

can be done by trial and error. Try 512K. This is still quite fast and

seems to cover most real-world data.

Tests

=====

Run `nosetests`, `pytest` or any other test runner with test discovery.

Benchmarks

==========

You may run the benchmark script to find the best blocksize and number

threads and/or processes for hash calculations on your machine.

```sh

    $ cd findsame/benchmark

    $ ./clean.sh

    $ ./benchmark.py

    $ ./plot.py

```

This writes test files of various size to `benchmark/files` and runs a

couple of benchmarks (runtime \~10 min for all benchmarks). Make sure to

avoid doing any other extensive IO tasks while the benchmarks run, of

course.

**The default value of "maxsize" in benchmark.py (in the `__main__`

part) is only some MiB to allow quick testing. This needs to be changed

to, say, 1 GiB in order to have meaningful benchmarks.**

Observations:

* blocksizes below 512 KiB (`-b/--blocksize 512K`) work best for all file

  sizes on most systems, even though the variation to worst timings is

  at most factor 1.25 (e.g. 1 vs. 1.25 seconds)

* multithreading (`-t/--nthreads`): up to 2x speedup on dual-core box

  -- very efficient, use NTHREADS = number of cores for good baseline

  performance (problem is mostly IO-bound)

* multiprocessing (`-p/--nprocs`): less efficient speedup, but on some

  systems NPROCS + NTHREADS is even a bit faster than NTHREADS alone,

  testing is mandatory

* we have a linear increase of runtime with filesize, of course

Output modes

============

Default (`-o3`)

---------------

The default output format is `-o3` (same as the initial example above).

```sh

    $ findsame -o3 data | jq .

    {

      "dir:empty": [

        [

          "data/dir2/empty_dir",

          "data/dir2/empty_dir_copy",

          "data/empty_dir",

          "data/empty_dir_copy"

        ]

      ],

      "dir": [

        [

          "data/dir1",

          "data/dir1_copy"

        ]

      ],

      "file:empty": [

        [

          "data/dir2/empty_dir/empty_file",

          "data/dir2/empty_dir_copy/empty_file",

          "data/empty_dir/empty_file",

          "data/empty_dir_copy/empty_file",

          "data/empty_file",

          "data/empty_file_copy"

        ]

      ],

      "file": [

        [

          "data/dir1/file2",

          "data/dir1/file2_copy",

          "data/dir1_copy/file2",

          "data/dir1_copy/file2_copy",

          "data/file2"

        ],

        [

          "data/lena.png",

          "data/lena_copy.png"

        ],

        [

          "data/file1",

          "data/file1_copy"

        ]

      ]

    }

```

Output with hashes (`-o2`)

--------------------------

```sh

    $ findsame -o2 data | jq .

    {

      "da39a3ee5e6b4b0d3255bfef95601890afd80709": {

        "dir:empty": [

          "data/dir2/empty_dir",

          "data/dir2/empty_dir_copy",

          "data/empty_dir",

          "data/empty_dir_copy"

        ],

        "file:empty": [

          "data/dir2/empty_dir/empty_file",

          "data/dir2/empty_dir_copy/empty_file",

          "data/empty_dir/empty_file",

          "data/empty_dir_copy/empty_file",

          "data/empty_file",

          "data/empty_file_copy"

        ]

      },

      "55341fe74a3497b53438f9b724b3e8cdaf728edc": {

        "dir": [

          "data/dir1",

          "data/dir1_copy"

        ]

      },

      "9619a9b308cdebee40f6cef018fef0f4d0de2939": {

        "file": [

          "data/dir1/file2",

          "data/dir1/file2_copy",

          "data/dir1_copy/file2",

          "data/dir1_copy/file2_copy",

          "data/file2"

        ]

      },

      "0a96c2e755258bd46abdde729f8ee97d234dd04e": {

        "file": [

          "data/lena.png",

          "data/lena_copy.png"

        ]

      },

      "312382290f4f71e7fb7f00449fb529fce3b8ec95": {

        "file": [

          "data/file1",

          "data/file1_copy"

        ]

      }

    }

```

The output is one dict (json object) where all same-hash files/dirs are

found at the same key (hash).

Dict values (`-o1`)

-------------------

The format `-o1` lists only the dict values from `-o2`, i.e. a list of

dicts.

```sh

    $ findsame -o1 data | jq .

    [

      {

        "dir:empty": [

          "data/dir2/empty_dir",

          "data/dir2/empty_dir_copy",

          "data/empty_dir",

          "data/empty_dir_copy"

        ],

        "file:empty": [

          "data/dir2/empty_dir/empty_file",

          "data/dir2/empty_dir_copy/empty_file",

          "data/empty_dir/empty_file",

          "data/empty_dir_copy/empty_file",

          "data/empty_file",

          "data/empty_file_copy"

        ]

      },

      {

        "dir": [

          "data/dir1",

          "data/dir1_copy"

        ]

      },

      {

        "file": [

          "data/file1",

          "data/file1_copy"

        ]

      },

      {

        "file": [

          "data/dir1/file2",

          "data/dir1/file2_copy",

          "data/dir1_copy/file2",

          "data/dir1_copy/file2_copy",

          "data/file2"

        ]

      },

      {

        "file": [

          "data/lena.png",

          "data/lena_copy.png"

        ]

      }

    ]

```

More usage examples

===================

Here we show examples of common post-processing tasks using `jq`. When

the `jq` command works for all three output modes, we don't specify the

`-o` option.

Count the total number of all equals:

```sh

    $ findsame data | jq '.[]|.[]|.[]' | wc -l

```

Find only groups of equal dirs:

```sh

    $ findsame -o1 data | jq '.[]|select(.dir)|.dir'

    $ findsame -o2 data | jq '.[]|select(.dir)|.dir'

    $ findsame -o3 data | jq '.dir|.[]'

    [

      "data/dir1",

      "data/dir1_copy"

    ]

```

Groups of equal files:

```sh

    $ findsame -o1 data | jq '.[]|select(.file)|.file'

    $ findsame -o2 data | jq '.[]|select(.file)|.file'

    $ findsame -o3 data | jq '.file|.[]'

    [

      "data/dir1/file2",

      "data/dir1/file2_copy",

      "data/dir1_copy/file2",

      "data/dir1_copy/file2_copy",

      "data/file2"

    ]

    [

      "data/lena.png",

      "data/lena_copy.png"

    ]

    [

      "data/file1",

      "data/file1_copy"

    ]

```

Find the first element in a group of equal items (file or dir):

```sh

    $ findsame data | jq '.[]|.[]|[.[0]]'

    [

      "data/lena.png"

    ]

    [

      "data/dir2/empty_dir"

    ]

    [

      "data/dir2/empty_dir/empty_file"

    ]

    [

      "data/dir1/file2"

    ]

    [

      "data/file1"

    ]

    [

      "data/dir1"

    ]

```

or more compact w/o the length-1 list:

```sh

    $ findsame data | jq '.[]|.[]|.[0]'

    "data/dir2/empty_dir"

    "data/dir2/empty_dir/empty_file"

    "data/dir1/file2"

    "data/lena.png"

    "data/file1"

    "data/dir1"

```

Find *all but the first* element in a group of equal items (file or

dir):

```sh

    $ findsame data | jq '.[]|.[]|.[1:]'

    [

      "data/dir1_copy"

    ]

    [

      "data/lena_copy.png"

    ]

    [

      "data/dir1/file2_copy",

      "data/dir1_copy/file2",

      "data/dir1_copy/file2_copy",

      "data/file2"

    ]

    [

      "data/dir2/empty_dir_copy/empty_file",

      "data/empty_dir/empty_file",

      "data/empty_dir_copy/empty_file",

      "data/empty_file",

      "data/empty_file_copy"

    ]

    [

      "data/dir2/empty_dir_copy",

      "data/empty_dir",

      "data/empty_dir_copy"

    ]

    [

      "data/file1_copy"

    ]

```

And more compact:

```sh

    $ findsame data | jq '.[]|.[]|.[1:]|.[]'

    "data/file1_copy"

    "data/dir1/file2_copy"

    "data/dir1_copy/file2"

    "data/dir1_copy/file2_copy"

    "data/file2"

    "data/lena_copy.png"

    "data/dir2/empty_dir_copy/empty_file"

    "data/empty_dir/empty_file"

    "data/empty_dir_copy/empty_file"

    "data/empty_file"

    "data/empty_file_copy"

    "data/dir2/empty_dir_copy"

    "data/empty_dir"

    "data/empty_dir_copy"

    "data/dir1_copy"

```

The last one can be used to remove all but the first in a group of equal

files/dirs:

```sh

    $ findsame data | jq '.[]|.[]|.[1:]|.[]' | xargs cp -rvt duplicates/

```

Other tools

===========

`fdupes`, `jdupes`, `duff`, `rdfind`, `rmlint`, `findup` (from `fslint`)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/elcorto/findsame

Awesome Lists containing this project

README