Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/glehmann/hld
Hard Link Deduplicator
https://github.com/glehmann/hld
dedup deduplication hardlinks reflinks rust
Last synced: 2 months ago
JSON representation
Hard Link Deduplicator
- Host: GitHub
- URL: https://github.com/glehmann/hld
- Owner: glehmann
- License: mit
- Created: 2018-12-17T12:45:46.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2024-09-02T11:00:27.000Z (4 months ago)
- Last Synced: 2024-10-10T23:31:13.834Z (3 months ago)
- Topics: dedup, deduplication, hardlinks, reflinks, rust
- Language: Rust
- Homepage:
- Size: 282 KB
- Stars: 8
- Watchers: 4
- Forks: 1
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Hard Link Deduplicator
======================`hld` finds the duplicated files and hardlinks them together in order to save
some disk space. And it's made to be fast!Here is an example session on a modern (2017) laptop:
```fish
$ du -sh myproject ~/.m2
896M myproject
912M .m2
$ time hld -r -c ~/.m2 myproject
420.23 MB saved in the deduplication of 675 files
real 0.47
user 1.17
sys 0.22
```420MB — 46% of the build directory size — saved in just 0.5 seconds :-)
[![CI Status](https://github.com/glehmann/hld/actions/workflows/ci.yml/badge.svg)]([text](https://github.com/glehmann/hld/actions))
Features
--------It works with all the available core by default and uses the [BLAKE3](https://blake3.io/)
hashing function in order to be both very fast and with an extremely low
chance of collision.Because of its caching feature, it is an efficient way to deduplicate files
that might have been copied by some automated process — for example a maven
build.Usage
-----#### globs
`hld` takes a set of globs as argument. The globs are used to find the
candidate files for deduplication. They support the `**` notation to traverse
any number of directories. For example:* `hld "target/*.jar"` deduplicates all the `jar` files directly in the `target`
directory;
* `hld "target/**/*.jar"` deduplicates all the `jar` files in the `target`
directory and its subdirectories.Several globs may be passed on the command line in order to work with
several directories and/or several file name patterns. For example:
`hld "target/*.jar" "images/**/*.png"`.Note: the quotes are important to avoid the glob expansion by the shell.
In case of large directories, the shell may not be able to pass all the
files contained there.#### caching
In addition to the raw globs of the previous chapter, some cached globs may
be used. They act all the same than the raw globs, but their BLAKE3 digest
value is saved for a latter reuse. They must be used on files that are
guarenteed to *not* change. Cached globs are passed with a `--cache`,
or `-c` option.For example: `hld "target/*" --cache "stable/*"` will deduplicate
all the files in both `target` and `stable`, and will also cache the
digests of the files in `stable`. The cached digests of `stable` will
then be reused at a latter `hld` call, in order to speed up the execution.The quotes are very important in this case: without them, the globs would
be expanded by the shell, and only the first file of the set would be
cached.The cache path may be specified with the `--cache-path` option or `-C`,
in order to deal with several sets of caches, depending on the execution
context.The cache may be cleared with the option `--clear-cache`.
#### recursive
The `--recursive` or `-r` option simplify the command line usage when working
with all the files in some directories. For example, the two following
commands are strictly equivalents:```fish
hld -r -c ~/.m2 myproject
``````fish
hld -c "$HOME/.m2/**/*" "myproject/**/*"
```#### dry run
Using the option `--dry-run` or `-n` prevents `hld` to modify anytring on
the disk, cache included.For example: `hld "target/*" --cache "stable/*" --dry-run` only show how many
files would be deduplicated and how much space would be saved, but actually
does nothing.#### log level
The amount of output displayed by `hld` can be controlled by the `--log-level`
or `-l` option. It accepts the following values, from the most verbose to
the most quiet: `trace`, `debug`, `info` (the default level), `warn`, `error`.#### parallelism
By default `hld` maximize the number of cores it is working on, in order to
complete its task as fast of possible. The `--parallel` or `-j` options let
you change the number of threads to run in parallel.For example, `hld -j1 "myproject/*"` forces `hld` to run single threaded.
#### shell completion
`hld` can generate the completion code for several shells (fish, zsh, bash, …).
Just run it with the `--completion` option followed by the shell type, and save
the produce code in the appropriate location. For example, for fish:```fish
hld --completion fish > ~/.config/fish/completions/hld.fish
```The completion is usually activated in the new shell instances, but may be
activated by sourcing the file. Again for fish:```fish
source ~/.config/fish/completions/hld.fish
```Install
-------`hld` is currently only available from sources. To install it, you need
a [Rust installation](https://www.rust-lang.org/). `hld` compiles with rust
stable or newer. In general, `hld` tracks the latest stable release of the
Rust compiler.```
$ git clone https://github.com/glehmann/hld
...
$ cd hld
$ cargo install
...
$ $HOME/.cargo/bin/hld --version
hld 0.1.0
```Building
--------You need a [Rust installation](https://www.rust-lang.org/). `hld` compiles
with rust stable or newer. In general, `hld` tracks the latest stable release
of the Rust compiler.To build `hld`:
```
$ git clone https://github.com/glehmann/hld
...
$ cd hld
$ cargo build --release
...
$ ./target/release/hld --version
hld 0.1.0
```Testing
-------To run the full test suite, use:
```
$ cargo test
...
test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out```
from the repository root.
Releasing
---------In order to produce a small easy to download executable, just do a release
build followed by:```
$ strip target/release/hld
$ upx --ultra-brute target/release/hld
```Code coverage
-------------The code coverage may be computed with [kcov](https://simonkagstrom.github.io/kcov/).
Make sure the `kcov` executable is in the `PATH` then run:```fish
$ cargo test --features kcov -- --test-threads 1
```The report is available in `target/x86_64-unknown-linux-gnu/debug/coverage/index.html`.
TODO
----* factorize the computation of the digest in the cached and non cached files
* which duplicate do we keep when symlinking? The first one? From the caches if possible?