https://github.com/twpayne/find-duplicates

Find duplicate files quickly.
https://github.com/twpayne/find-duplicates

duplicate-detection duplicate-files duplicates find

Last synced: 7 months ago
JSON representation

Find duplicate files quickly.

Host: GitHub
URL: https://github.com/twpayne/find-duplicates
Owner: twpayne
License: mit
Created: 2023-01-23T01:04:07.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2025-02-25T20:14:53.000Z (8 months ago)
Last Synced: 2025-03-15T02:47:14.853Z (7 months ago)
Topics: duplicate-detection, duplicate-files, duplicates, find
Language: Go
Homepage:
Size: 82 KB
Stars: 54
Watchers: 3
Forks: 1
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# `find-duplicates`

`find-duplicates` finds duplicate files quickly based on the
[xxHashes](https://xxhash.com/) of their contents.

## Installation

```console
$ go install github.com/twpayne/find-duplicates@latest
```

## Example

```console
$ find-duplicates
{
"cdb8979062cbdf9c169563ccc54704f0": [
".git/refs/remotes/origin/main",
".git/refs/heads/main",
".git/ORIG_HEAD"
]
}
```

## Usage

```
find-duplicates [options] [paths...]
```

`paths` are directories to walk recursively. If no `paths` are given then the
current directory is walked.

The output is a JSON object with properties for each observed xxHash and values
arrays of filenames with contents with that xxHash.

Options are:

`--keep-going` or `-k` keep going after errors.

`--output=` or `-o ` write output to ``, default is stdout.

`--threshold=` or `-t ` sets the minimum number of files with the same
content to be considered duplicates. The default is 2.

`--statistics` or `-s` prints statistics to stderr.

## How does `find-duplicates` work?

`find-duplicates` aims to be as fast as possible by doing as little work as
possible, using each CPU core efficiently, and using all the CPU cores on your
machine.

It consists of multiple components:

1. Firstly, it walks the the filesystem concurrently, spawning one goroutine per
subdirectory.
2. Secondly, with the observation that files can only be duplicates if they are
the same size, it only reads file contents once it has found at more than one
file with the same size. This significantly reduces both the number of
syscalls and the amount of data read. Furthermore, as the shortest possible
runtime is the time taken to read the largest file, larger files are read
earlier.
3. Thirdly, files contents are hashed with a fast, non-cryptographic hash.

All components run concurrently.

## Media

* ["Finding duplicate files unbelievably fast: a small CLI project using Go's concurrency"](https://www.youtube.com/watch?v=wJ7-Y55Esio) talk from [Zürich Gophers](https://www.meetup.com/zurich-gophers/).

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/twpayne/find-duplicates

Awesome Lists containing this project

README