https://github.com/uxcn/yafd
yet another file deduplicator
https://github.com/uxcn/yafd
c deduplicator freebsd linux osx windows
Last synced: 5 months ago
JSON representation
yet another file deduplicator
- Host: GitHub
- URL: https://github.com/uxcn/yafd
- Owner: uxcn
- License: gpl-3.0
- Created: 2016-02-10T04:36:15.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2019-10-04T04:58:51.000Z (over 5 years ago)
- Last Synced: 2024-11-12T11:43:06.851Z (7 months ago)
- Topics: c, deduplicator, freebsd, linux, osx, windows
- Language: C
- Size: 216 KB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# yafd #
[](https://travis-ci.org/uxcn/yafd)
[](https://ci.appveyor.com/project/uxcn/yafd)
[](https://coveralls.io/github/uxcn/yafd?branch=master)
[](https://github.com/uxcn/yafd/issues)yafd is a (yet another) file deduplicator.
## Usage ##
For detailed info, see [USAGE](https://github.com/uxcn/yafd/blob/master/USAGE.md) or the
manpage (`man yafd`).The easiest way to use yafd is to pass a directory or a set of directories to
it.jason@io ~ yafd .
It can recurse as well.
jason@io ~ yafd -r .
Another easy way to use yafd is passing files to check as arguments. Shell
globbing helps.jason@io ~ yafd **/*.c **/*.h
You can also pipe paths to yafd via stdin. This makes it easy to limit the sets
of files to check.jason@io ~ find . -size +1M | yafd
The output can also be piped to other commands to do things with the duplicate files.
jason@io ~ find /usr/src -size +1M | yafd | xargs du -b | awk '{ x+=$1; } END { print x; }'
12659698## Performance ##
As of yet, yafd is not always the fastest deduplicator (see hdd performance).
If performance is a concern, it may be worth considering another deduplicator
like [rmlint](https://github.com/sahib/rmlint). Performance can be optimized
using command arguments (`--bytes`, `--blocksize`, `--threads`, etc...),
although yafd with defaults should be usable for most tasks.Here are some metrics for reference.
**SSD (btrfs)**
timethroughputthroughput (dup)
yafd
4.30s267.88 MiB/s175.70 MiB/srmlint
7.43s155.13 MiB/s101.74 MiB/sfdupes
30.34s37.99 MiB/s24.92 MiB/sduff
25.14s45.86 MiB/s30.08 MiB/syafd (cached)
0.61s1.84 GiB/s1.20 GiB/srmlint (cached)
2.46s466.21 MiB/s307.40 MiB/sfdupes (cached)
12.27s93.94 MiB/s61.61 MiB/sduff (cached)
6.51s176.17 MiB/s116.12 MiB/s**HDD (ext4)**
timethroughputthroughput (dup)
yafd
1087.59s1.05 MiB/s711.99 KiB/srmlint
65.03s163.46 MiB/s107.21 MiB/sfdupes
322.57s3.57 MiB/s2.34 MiB/sduff
954.70s1.20 MiB/s811.10 KiB/syafd (cached)
7.05s163.46 MiB/s107.21 MiB/srmlint (cached)
2.84s406.37 MiB/s266.53 MiB/sfdupes (cached)
12.44s92.64 MiB/s60.76 MiB/sduff (cached)
6.56s175.76 MiB/s115.28 MiB/s**NFS (v4)**
timethroughputthroughput (dup)
yafd
197.08s5.85 MiB/s3.83 MiB/srmlint
461.26s2.49 MiB/s1.63 MiB/sfdupes
648.24s1.77 MiB/s1.16 MiB/sduff
466.69s2.47 MiB/s1.62 MiB/syafd (cached)
95.04s12.13 MiB/s7.95 MiB/srmlint (cached)
423.90s2.71 MiB/s1.78 MiB/sfdupes (cached)
611.19s1.88 MiB/s1.23 MiB/sduff (cached)
403.72s2.85 MiB/s1.87 MiB/s(1) The linux sources were searched for identical files (4.3, 4.4)
(2) For an equivalent comparison, the following command arguments were used
(also [see](https://github.com/uxcn/yafd/tree/master/perf/src/python/benchmark))yafd --recurse --zero
rmlint --algorithm=paranoid --hidden -o fdupes:stdout
fdupes --recurse
duff -rpta -f#(3) Linux 4.4.0 and Intel Ivy Bridge (i7-3632QM) were used for benchmarks
## Install ##
You can download a copy of the source
[here](https://github.com/uxcn/yafd/releases) or you can clone the repository
using git.jason@io ~ git clone git://github.com:uxcn/yafd.git
It's a good idea to check out a specific release.
jason@io ~/yafd git checkout v0.1
In the project directory, run the autoconf script.
jason@io ~/yafd ./autoconf.sh CFLAGS='-march=native -mtune=native -O2'
Adding the architecture allows algorithms that rely on architecutre specific
implementations to be used. The easiest way to do this is normally
`-march=native`. You can also explicitly enable instruction sets
via autoconf.jason@io ~/yafd ./autoconf.sh --enable-sse4_2
To install to a directory other than /usr/local, you can manually configure the
prefix. If you do, make sure your `PATH` and `MANPATH` are set correctly.jason@io ~/yafd ./autoconf.sh --prefix=$HOME
Run `make install` to compile and install.
jason@io ~/yafd $ make install
Currently yafd compiles and is tested on Linux, FreeBSD, OSX, and Windows.
Although, patches and pull requests for others are definitely welcome.## Versions ##
0.1 - alpha release
## FAQ ##
Why write another file deduplicater?
*A lot of the current ones were more complicated than I wanted, didn't perform
well, or weren't portable.*Why doesn't yafd do *X*?
*Most likely nobody asked for X yet. If you think something's missing, send a
feature [request](https://github.com/uxcn/yafd/issues) or even better, a [pull
request](https://github.com/uxcn/yafd/pull/new/master).*How does yafd work?
*The basic algorithm is to group files by their sizes, compute a hash on a small
(random) chunk of each file, and then compare files that have the same hash.
This is a bit of an oversimplicification though. For a better understanding, it
may help to try reading the
[code](https://github.com/uxcn/yafd/blob/master/src/c/worker.c).*## other deduplicators ##
* [rmlint](https://github.com/sahib/rmlint)
* [fdupes](https://github.com/adrianlopezroche/fdupes)
* [duff](https://github.com/elmindreda/duff)
* others...