An open API service indexing awesome lists of open source software.

https://github.com/veqryn/dedup

Deduplicate lines in files
https://github.com/veqryn/dedup

Last synced: about 1 month ago
JSON representation

Deduplicate lines in files

Awesome Lists containing this project

README

        

# dedup
Deduplicate string data

### How to execute
The main executable is located in the `cmd/` dir, and it has the following flags:
* `--out` output file location
* `--tmp-file-bytes` maximum temporary file bytes (default 250000000)
* `--in` input file location

How to compile and run:
* `cd `
* `go build -o ./dedup github.com/veqryn/dedup/cmd`
* `./dedup --out=deduped.log --in=testdata/testdata.log`

### Input and Output format
The input should be a single new-line delimited file containing a single string on each line.
The output will be a single new-line delimited file containing sorted deduplicated strings.

### How it works
This package is given a file to write to, a file to read from, and the temporary file size for when it needs to spill to disk. It will de-duplicate strings/URL's by reading the input file line by line into a set (value-less hashmap), and writing out the set to a temporary file each time the set approaches the `--tmp-file-bytes` limit. It will then merge the temporary files while deduplicating the lines, into the final output file.

##### Design considerations
When the deduplicated content is larger in bytes than our machine's memory, we will not be able to hold the final file in memory. This presents a problem: even if we split the input file and deduplicate each chunk, how do we recombine without allowing duplicates if we cannot hold the chunks all in memory at the same time.

The solution chosen for this implementation deduplicates AND sorts the chucks before writing them. Then, when the chunks are being merged again, we need only read the first line from each chunk, and compare it against the first line from all other chunks. Whichever line would come first lexicographically will be written to the output (merged) file. We are guaranteed that by doing so, the merge algorithm will see any duplicates between the files in sequence, and we deduplicate by skipping all but the first.

The resulting output (merged) file is then fully deduplicated, and it is also sorted as a side effect of choosing this implementation.

A second side benefit of this implementation is that this program can be run against an input file of arbitrary size (>petabytes) and it can run using very little memory (