https://github.com/amoghe/dedup
Deduplicator
https://github.com/amoghe/dedup
compression compressor dedupe deduplication
Last synced: 5 months ago
JSON representation
Deduplicator
- Host: GitHub
- URL: https://github.com/amoghe/dedup
- Owner: amoghe
- License: mit
- Created: 2017-07-01T02:59:07.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-08-09T00:39:33.000Z (almost 9 years ago)
- Last Synced: 2024-11-15T09:52:21.013Z (over 1 year ago)
- Topics: compression, compressor, dedupe, deduplication
- Language: Go
- Size: 38.1 KB
- Stars: 4
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# `dedup` - a deduplication tool (+ library)
[](https://travis-ci.org/amoghe/dedup) [](http://godoc.org/github.com/amoghe/dedup)
## Why
Deduplication can be thought of as a coarse grained compression that detects
duplicate data over much larger windows than most compressors work across. As
a result, deduplication is a good first step and passing deduplicated data into
a downstream compressor often results in much better compression performance
(in terms of compression ratio and sometimes speed)
## Library
`dedup` is a golang lib that allows for arbitrary data to be deduplicated
(input from an io.Reader). The `dedup.Deduplicator` and `dedup.Reduplicator`
are the workhorses that actually do all the work. For examples on how to use
them, see the dedup tool itself (in `cmd/dedup/main.go`).
#### Installation
As with any go(lang) pkg, you can `go get` by
```
shell> go get github.com/amoghe/dedup
```
#### Usage
```
import (
"github.com/amoghe/dedup"
)
// somewhere in your code
err := dedup.NewDeduplicator(windowSize, mask).Do(os.Stdin, os.Stdout)
err := dedup.NewReduplicator().Do(os.Stdin, os.Stdout)
```
## Binary
This codebase also builds a cmdline tool named `dedup` (see `cmd/dedup`) that
can be used to deduplicate data.
#### Installation
The tool can be installed by either building from source
```
shell> go get github.com/amoghe/dedup && \
cd $GOPATH/src/github.com/amoghe/dedup && \
go install
```
Alternatively, you can download a release binary from the Releases section of
this github project.
#### Usage
Consider this workload where we save two similar docker containers:
```
akshay@spitfire:~/$ time docker save redmine bitnami/redmine | gzip | wc --bytes
497548816 # <-- 474.49 MB (or MiB)
real 1m7.900s
user 0m58.536s
sys 0m1.780s
akshay@spitfire:~/$ time docker save redmine bitnami/redmine | dedup | gzip | wc --bytes
295793793 # <-- 282.09 MB (or MiB)
real 0m50.261s
user 0m56.312s
sys 0m3.688s
```
As you can see, some workloads can benefit greatly from a combination of
deduplication + compression (in terms of both compression ratio and speed)
## Compression
Note that this lib (and tool) probably won't ever support built-in support for compression of the output stream. You should pick an appropriate compressor "downstream" from this lib/tool. You'll find that standalone compressors such as
`gzip`, `bzip2`, `xz` (and their parallel implementations - `pigz`, `pbzip2`,
`pxz`) are readily available on most linux distributions. These compressors
support pipelining (i.e. i/o can be pipelined via the shell) so there is no need
for this library to provide this functionality.
## TODO:
- Currently the deduplication lib consumes memory that is proportional to the
size of the input file. (See issue #1)
- Document the usage and impact of the windowSize and zeroBits parameters used
by the `Deduplicator`
- Add progress reporting when input is a large file (not stdin)
- Make cmdline args fully compatible with other compression tools ('-k', '-v')
- Add tests! (unit tests, fuzz tests)
## LICENSE
See the LICENSE file