https://github.com/uxcn/yafd

yet another file deduplicator
https://github.com/uxcn/yafd

c deduplicator freebsd linux osx windows

Last synced: 5 months ago
JSON representation

yet another file deduplicator

Host: GitHub
URL: https://github.com/uxcn/yafd
Owner: uxcn
License: gpl-3.0
Created: 2016-02-10T04:36:15.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2019-10-04T04:58:51.000Z (over 5 years ago)
Last Synced: 2024-11-12T11:43:06.851Z (7 months ago)
Topics: c, deduplicator, freebsd, linux, osx, windows
Language: C
Size: 216 KB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # yafd #

[![build status](https://travis-ci.org/uxcn/yafd.svg?branch=master)](https://travis-ci.org/uxcn/yafd)

[![build status](https://ci.appveyor.com/api/projects/status/tfikjw9me77nvuw5?svg=true)](https://ci.appveyor.com/project/uxcn/yafd)

[![coverage status](https://coveralls.io/repos/github/uxcn/yafd/badge.svg?branch=master)](https://coveralls.io/github/uxcn/yafd?branch=master)

[![issues](https://img.shields.io/github/issues/uxcn/yafd.svg)](https://github.com/uxcn/yafd/issues)

yafd is a (yet another) file deduplicator.

## Usage ##

For detailed info, see [USAGE](https://github.com/uxcn/yafd/blob/master/USAGE.md) or the

manpage (`man yafd`).

The easiest way to use yafd is to pass a directory or a set of directories to

it.

    jason@io ~ yafd .

It can recurse as well.

    jason@io ~ yafd -r .

Another easy way to use yafd is passing files to check as arguments.  Shell

globbing helps.

    jason@io ~ yafd **/*.c **/*.h

You can also pipe paths to yafd via stdin.  This makes it easy to limit the sets

of files to check.

    jason@io ~ find . -size +1M | yafd

The output can also be piped to other commands to do things with the duplicate files.

    jason@io ~ find /usr/src -size +1M | yafd | xargs du -b | awk '{ x+=$1; } END { print x; }'

    12659698

## Performance ##

As of yet, yafd is not always the fastest deduplicator (see hdd performance).

If performance is a concern, it may be worth considering another deduplicator

like [rmlint](https://github.com/sahib/rmlint).  Performance can be optimized

using command arguments (`--bytes`, `--blocksize`, `--threads`, etc...),

although yafd with defaults should be usable for most tasks.

Here are some metrics for reference.

**SSD (btrfs)**

timethroughputthroughput (dup)

yafd4.30s267.88 MiB/s175.70 MiB/s

rmlint7.43s155.13 MiB/s101.74 MiB/s

fdupes30.34s37.99 MiB/s24.92 MiB/s

duff25.14s45.86 MiB/s30.08 MiB/s

yafd (cached)0.61s1.84 GiB/s1.20 GiB/s

rmlint (cached)2.46s466.21 MiB/s307.40 MiB/s

fdupes (cached)12.27s93.94 MiB/s61.61 MiB/s

duff (cached)6.51s176.17 MiB/s116.12 MiB/s

**HDD (ext4)**

timethroughputthroughput (dup)

yafd1087.59s1.05 MiB/s711.99 KiB/s

rmlint65.03s163.46 MiB/s107.21 MiB/s

fdupes322.57s3.57 MiB/s2.34 MiB/s

duff954.70s1.20 MiB/s811.10 KiB/s

yafd (cached)7.05s163.46 MiB/s107.21 MiB/s

rmlint (cached)2.84s406.37 MiB/s266.53 MiB/s

fdupes (cached)12.44s92.64 MiB/s60.76 MiB/s

duff (cached)6.56s175.76 MiB/s115.28 MiB/s

**NFS (v4)**

timethroughputthroughput (dup)

yafd197.08s5.85 MiB/s3.83 MiB/s

rmlint461.26s2.49 MiB/s1.63 MiB/s

fdupes648.24s1.77 MiB/s1.16 MiB/s

duff466.69s2.47 MiB/s1.62 MiB/s

yafd (cached)95.04s12.13 MiB/s7.95 MiB/s

rmlint (cached)423.90s2.71 MiB/s1.78 MiB/s

fdupes (cached)611.19s1.88 MiB/s1.23 MiB/s

duff (cached)403.72s2.85 MiB/s1.87 MiB/s

(1) The linux sources were searched for identical files (4.3, 4.4)

(2) For an equivalent comparison, the following command arguments were used

(also [see](https://github.com/uxcn/yafd/tree/master/perf/src/python/benchmark))

    yafd --recurse --zero

    rmlint --algorithm=paranoid --hidden -o fdupes:stdout

    fdupes --recurse

    duff -rpta -f#

(3) Linux 4.4.0 and Intel Ivy Bridge (i7-3632QM) were used for benchmarks

## Install ##

You can download a copy of the source

[here](https://github.com/uxcn/yafd/releases) or you can clone the repository

using git.

    jason@io ~ git clone git://github.com:uxcn/yafd.git

It's a good idea to check out a specific release.

    jason@io ~/yafd git checkout v0.1

In the project directory, run the autoconf script.

    jason@io ~/yafd ./autoconf.sh CFLAGS='-march=native -mtune=native -O2'

Adding the architecture allows algorithms that rely on architecutre specific

implementations to be used.  The easiest way to do this is normally

`-march=native`.   You can also explicitly enable instruction sets

via autoconf.

    jason@io ~/yafd ./autoconf.sh --enable-sse4_2

To install to a directory other than /usr/local, you can manually configure the

prefix.  If you do, make sure your `PATH` and `MANPATH` are set correctly.

    jason@io ~/yafd ./autoconf.sh --prefix=$HOME

Run `make install` to compile and install.

    jason@io ~/yafd $ make install

Currently yafd compiles and is tested on Linux, FreeBSD, OSX, and Windows.

Although, patches and pull requests for others are definitely welcome.

## Versions ##

0.1 - alpha release

## FAQ ##

Why write another file deduplicater?

*A lot of the current ones were more complicated than I wanted, didn't perform

well, or weren't portable.*

Why doesn't yafd do *X*?

*Most likely nobody asked for X yet.  If you think something's missing, send a

feature [request](https://github.com/uxcn/yafd/issues) or even better, a [pull

request](https://github.com/uxcn/yafd/pull/new/master).*

How does yafd work?

*The basic algorithm is to group files by their sizes, compute a hash on a small

(random) chunk of each file, and then compare files that have the same hash.

This is a bit of an oversimplicification though.  For a better understanding, it

may help to try reading the

[code](https://github.com/uxcn/yafd/blob/master/src/c/worker.c).*

## other deduplicators ##

* [rmlint](https://github.com/sahib/rmlint)

* [fdupes](https://github.com/adrianlopezroche/fdupes)

* [duff](https://github.com/elmindreda/duff)

* others...

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/uxcn/yafd

Awesome Lists containing this project

README