https://github.com/mpdn/sketch-duplicates
Find duplicate lines probabilistically
https://github.com/mpdn/sketch-duplicates
Last synced: about 1 year ago
JSON representation
Find duplicate lines probabilistically
- Host: GitHub
- URL: https://github.com/mpdn/sketch-duplicates
- Owner: mpdn
- License: mit
- Created: 2020-07-05T17:34:27.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2020-07-05T17:41:52.000Z (almost 6 years ago)
- Last Synced: 2025-03-07T17:04:06.564Z (over 1 year ago)
- Language: Rust
- Size: 10.7 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sketch-duplicates
Find duplicate lines probabilistically.
## Motivation
Let's say you have a directorty of gzipped text files that you want to check for duplicate lines in.
The usual way to do this might look something like this:
```shell
zcat *.gz | sort | uniq -d
```
The problem with this, is that it can become very slow for large files. `sketch-duplicates` provides
a way to remove *most* unique lines, leaving *mostly* duplicate lines in the output.
`sketch-duplicates` is probabilistic and is therefore not guarenteed to remove *all* unique lines.
It is therefore still necessary to have a `sort | uniq -d` in the end but this will be much faster
due to the input having most unique lines removed.
The above can example can be written to use a sketch like this:
```shell
zcat *.gz | sketch-duplicates build > sketch
zcat *.gz | sketch-duplicates filter sketch | sort | uniq -d
```
Multiple sketches can be combined using `sketch-duplicates combine`. This can be used to parallelize
the construction of the sketch (here using GNU Parallel):
```shell
echo *.gz | parallel 'zcat {} | sketch-duplicates build' | sketch-duplicates combine > sketch
echo *.gz | parallel 'zcat {} | sketch-duplicates filter sketch' | sort | uniq -d
```
## Options
- `-s`, `--size`: Size of the sketch. Increasing this improves filtering accuracy but consumes more
memory. This is set to a conservative default of 8MiB and can often be increased depending on the
specific use case.
- `-p`, `--probes`: Number of probes to do in the sketch.
- `-0`, `--zero-terminated`: Use NULL bytes as line delimiters.
## Install
Install Cargo (eg. using [rustup](https://www.rust-lang.org/tools/install)), then run
`cargo install sketch-duplicates`.