An open API service indexing awesome lists of open source software.

https://github.com/uxcn/yafd

yet another file deduplicator
https://github.com/uxcn/yafd

c deduplicator freebsd linux osx windows

Last synced: 5 months ago
JSON representation

yet another file deduplicator

Awesome Lists containing this project

README

        

# yafd #

[![build status](https://travis-ci.org/uxcn/yafd.svg?branch=master)](https://travis-ci.org/uxcn/yafd)
[![build status](https://ci.appveyor.com/api/projects/status/tfikjw9me77nvuw5?svg=true)](https://ci.appveyor.com/project/uxcn/yafd)
[![coverage status](https://coveralls.io/repos/github/uxcn/yafd/badge.svg?branch=master)](https://coveralls.io/github/uxcn/yafd?branch=master)
[![issues](https://img.shields.io/github/issues/uxcn/yafd.svg)](https://github.com/uxcn/yafd/issues)

yafd is a (yet another) file deduplicator.

## Usage ##

For detailed info, see [USAGE](https://github.com/uxcn/yafd/blob/master/USAGE.md) or the
manpage (`man yafd`).

The easiest way to use yafd is to pass a directory or a set of directories to
it.

jason@io ~ yafd .

It can recurse as well.

jason@io ~ yafd -r .

Another easy way to use yafd is passing files to check as arguments. Shell
globbing helps.

jason@io ~ yafd **/*.c **/*.h

You can also pipe paths to yafd via stdin. This makes it easy to limit the sets
of files to check.

jason@io ~ find . -size +1M | yafd

The output can also be piped to other commands to do things with the duplicate files.

jason@io ~ find /usr/src -size +1M | yafd | xargs du -b | awk '{ x+=$1; } END { print x; }'
12659698

## Performance ##

As of yet, yafd is not always the fastest deduplicator (see hdd performance).
If performance is a concern, it may be worth considering another deduplicator
like [rmlint](https://github.com/sahib/rmlint). Performance can be optimized
using command arguments (`--bytes`, `--blocksize`, `--threads`, etc...),
although yafd with defaults should be usable for most tasks.

Here are some metrics for reference.

**SSD (btrfs)**

timethroughputthroughput (dup)
yafd4.30s267.88 MiB/s175.70 MiB/s
rmlint7.43s155.13 MiB/s101.74 MiB/s
fdupes30.34s37.99 MiB/s24.92 MiB/s
duff25.14s45.86 MiB/s30.08 MiB/s
yafd (cached)0.61s1.84 GiB/s1.20 GiB/s
rmlint (cached)2.46s466.21 MiB/s307.40 MiB/s
fdupes (cached)12.27s93.94 MiB/s61.61 MiB/s
duff (cached)6.51s176.17 MiB/s116.12 MiB/s

**HDD (ext4)**

timethroughputthroughput (dup)
yafd1087.59s1.05 MiB/s711.99 KiB/s
rmlint65.03s163.46 MiB/s107.21 MiB/s
fdupes322.57s3.57 MiB/s2.34 MiB/s
duff954.70s1.20 MiB/s811.10 KiB/s
yafd (cached)7.05s163.46 MiB/s107.21 MiB/s
rmlint (cached)2.84s406.37 MiB/s266.53 MiB/s
fdupes (cached)12.44s92.64 MiB/s60.76 MiB/s
duff (cached)6.56s175.76 MiB/s115.28 MiB/s

**NFS (v4)**

timethroughputthroughput (dup)
yafd197.08s5.85 MiB/s3.83 MiB/s
rmlint461.26s2.49 MiB/s1.63 MiB/s
fdupes648.24s1.77 MiB/s1.16 MiB/s
duff466.69s2.47 MiB/s1.62 MiB/s
yafd (cached)95.04s12.13 MiB/s7.95 MiB/s
rmlint (cached)423.90s2.71 MiB/s1.78 MiB/s
fdupes (cached)611.19s1.88 MiB/s1.23 MiB/s
duff (cached)403.72s2.85 MiB/s1.87 MiB/s

(1) The linux sources were searched for identical files (4.3, 4.4)

(2) For an equivalent comparison, the following command arguments were used
(also [see](https://github.com/uxcn/yafd/tree/master/perf/src/python/benchmark))

yafd --recurse --zero
rmlint --algorithm=paranoid --hidden -o fdupes:stdout
fdupes --recurse
duff -rpta -f#

(3) Linux 4.4.0 and Intel Ivy Bridge (i7-3632QM) were used for benchmarks

## Install ##

You can download a copy of the source
[here](https://github.com/uxcn/yafd/releases) or you can clone the repository
using git.

jason@io ~ git clone git://github.com:uxcn/yafd.git

It's a good idea to check out a specific release.

jason@io ~/yafd git checkout v0.1

In the project directory, run the autoconf script.

jason@io ~/yafd ./autoconf.sh CFLAGS='-march=native -mtune=native -O2'

Adding the architecture allows algorithms that rely on architecutre specific
implementations to be used. The easiest way to do this is normally
`-march=native`. You can also explicitly enable instruction sets
via autoconf.

jason@io ~/yafd ./autoconf.sh --enable-sse4_2

To install to a directory other than /usr/local, you can manually configure the
prefix. If you do, make sure your `PATH` and `MANPATH` are set correctly.

jason@io ~/yafd ./autoconf.sh --prefix=$HOME

Run `make install` to compile and install.

jason@io ~/yafd $ make install

Currently yafd compiles and is tested on Linux, FreeBSD, OSX, and Windows.
Although, patches and pull requests for others are definitely welcome.

## Versions ##

0.1 - alpha release

## FAQ ##

Why write another file deduplicater?

*A lot of the current ones were more complicated than I wanted, didn't perform
well, or weren't portable.*

Why doesn't yafd do *X*?

*Most likely nobody asked for X yet. If you think something's missing, send a
feature [request](https://github.com/uxcn/yafd/issues) or even better, a [pull
request](https://github.com/uxcn/yafd/pull/new/master).*

How does yafd work?

*The basic algorithm is to group files by their sizes, compute a hash on a small
(random) chunk of each file, and then compare files that have the same hash.
This is a bit of an oversimplicification though. For a better understanding, it
may help to try reading the
[code](https://github.com/uxcn/yafd/blob/master/src/c/worker.c).*

## other deduplicators ##

* [rmlint](https://github.com/sahib/rmlint)
* [fdupes](https://github.com/adrianlopezroche/fdupes)
* [duff](https://github.com/elmindreda/duff)
* others...