https://github.com/snawoot/terse
Output randomly sampled lines from input stream or file
https://github.com/snawoot/terse
random-sampling reservoir-sampling
Last synced: 6 months ago
JSON representation
Output randomly sampled lines from input stream or file
- Host: GitHub
- URL: https://github.com/snawoot/terse
- Owner: Snawoot
- License: mit
- Created: 2023-02-02T13:33:43.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2023-02-03T15:37:41.000Z (about 3 years ago)
- Last Synced: 2024-11-27T14:48:58.771Z (over 1 year ago)
- Topics: random-sampling, reservoir-sampling
- Language: Go
- Homepage:
- Size: 21.5 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# terse
Output randomly sampled lines from input stream or file. Uses simple [reservoir sampling](http://www.cs.umd.edu/~samir/498/vitter.pdf) algorithm to process input with linear time complexity. Suitable for processing streams, seeing each line only once. Retains relative order of lines.
## Usage example
```
> seq 1000000 | terse -n 5
349893
539678
576919
738393
758023
```
## Performance
Comparison against `shuf -n` on real data: 5.1GB nginx log with 17451712 lines in it.
```
root@logger:~# ls -lh /var/log/remote/nginx/2023_02_02_18.log
-rw-r----- 1 root logs 5.1G Feb 2 18:59 /var/log/remote/nginx/2023_02_02_18.log
root@logger:~# wc -l /var/log/remote/nginx/2023_02_02_18.log
17451712 /var/log/remote/nginx/2023_02_02_18.log
root@logger:~# time terse -i /var/log/remote/nginx/2023_02_02_18.log -n 25 > /dev/null
real 0m2.656s
user 0m1.315s
sys 0m1.372s
root@logger:~# time shuf -n 25 /var/log/remote/nginx/2023_02_02_18.log > /dev/null
real 0m22.784s
user 0m21.059s
sys 0m1.703s
```
It processes about tens of millions of lines per second on modern computer. Most likely I/O will become bottleneck in such sampling rather than application performance will be an issue.
## Installation
#### Binaries
Pre-built binaries are available [here](https://github.com/Snawoot/terse/releases/latest).
#### Build from source
Alternatively, you may install terse from source. Run the following within the source directory:
```
make install
```
#### Docker
A docker image is available as well. Here is an example of running terse in a pipeline with docker:
```sh
seq 5 | docker run -i --rm yarmak/terse
```
## Synopsis
```
> terse -h
Usage:
terse [OPTION]...
Options:
-buffered
buffer control (default true)
-i string
use input file instead of stdin
-n int
number of lines to sample (default 25)
-o string
use output file instead of stdout
-seed value
use fixed random seed (default is a value from CSPRNG)
-version
show program version and exit
-z line delimiter is NUL, not newline
```