https://github.com/mterron/swuniq
A command-line tool for deduplicating entries in a file or stream with constant memory usage
https://github.com/mterron/swuniq
cli dedupe deduping deduplicate deduplication filter sliding-window uniq
Last synced: 3 months ago
JSON representation
A command-line tool for deduplicating entries in a file or stream with constant memory usage
- Host: GitHub
- URL: https://github.com/mterron/swuniq
- Owner: mterron
- License: mit
- Created: 2018-10-19T04:47:52.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-04-11T21:15:55.000Z (about 4 years ago)
- Last Synced: 2025-12-11T01:57:26.563Z (5 months ago)
- Topics: cli, dedupe, deduping, deduplicate, deduplication, filter, sliding-window, uniq
- Language: C
- Homepage:
- Size: 124 KB
- Stars: 5
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG
- License: LICENSE
Awesome Lists containing this project
README
# swuniq
  [](https://lgtm.com/projects/g/mterron/swuniq/context:cpp)
Deduplicate matching lines (within a configurable window) from a file or standard input, writing to standard output.
Like uniq but works on unsorted input to be used as a pipe filter with constant memory usage.
#### Why?
Sometimes you need consume a data stream (Certificate Transparency log for example) that have non consecutive duplicates and you don't want to deal with them. The usual solution involving `awk` has unbounded memory usage so that might be a problem, this one doesn't.
#### Memory Usage
swuniq uses a ringbuffer of configurable size (-w option) as a FIFO queue to store hashes of each line to keep memory use constant (64bits * -w value).
#### Example
```sh
# swuniq -h
Usage: swuniq [-w N] [INPUT]
Filter matching lines (within a configurable window) from INPUT
(or standard input), writing to standard output.
-w N Size of the sliding window to use for deduplication
Note: By default swuniq will use a window of 100 lines.
# cat input.txt
apple
apple
apple
banana
banana
strawberry
blueberry
apple
banana
strawberry
blueberry
kiwifruit
orange
peach
watermelon
orange
watermelon
kiwifruit
banana
banana
banana
apple
kiwifruit
# swuniq < input.txt
apple
banana
strawberry
blueberry
kiwifruit
orange
peach
watermelon
# swuniq -w 4 < input.txt
apple
banana
strawberry
blueberry
kiwifruit
orange
peach
watermelon
banana
apple
kiwifruit
# swuniq -w 2 < input.txt
apple
banana
strawberry
blueberry
apple
banana
strawberry
blueberry
kiwifruit
orange
peach
watermelon
orange
kiwifruit
banana
apple
kiwifruit
```