Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joshwalters/edcount
Estimate distinct count of values on the command line
https://github.com/joshwalters/edcount
Last synced: 2 months ago
JSON representation
Estimate distinct count of values on the command line
- Host: GitHub
- URL: https://github.com/joshwalters/edcount
- Owner: joshwalters
- License: gpl-3.0
- Created: 2018-12-08T16:03:17.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2021-11-18T01:25:41.000Z (about 3 years ago)
- Last Synced: 2024-05-23T02:34:33.785Z (8 months ago)
- Language: C
- Homepage:
- Size: 99.6 KB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
- Changelog: ChangeLog
- Contributing: CONTRIBUTING
- License: COPYING
- Authors: AUTHORS
Awesome Lists containing this project
- awesome-command-line-apps - edcount - Estimate distinct count of values from standard input. (\*nix/\*nux / Data Processing)
README
edcount
-------
Estimate distinct count of values from standard input. Provides a very fast
way to perform unique count estimates on the command line.A value is considered terminated with a newline.
Why use it?
-----------If you want to quickly process a large number of values and find out roughly
how many of them are unique.The program is well suited to estimating unique counts of billions of unique
values.How it works
------------The edcount program implements HyperLogLog, with some minor modifications,
as detailed by by Flajolet et. al. in the paper "HyperLogLog: the analysis of
a near-optimal cardinality estimation algorithm".Additionally, the memory footprint of the program is constant, at a few
megabytes. This memory use is constant regardless of the number of records
counted, and does not degrade in accuracy.Benchmark
---------Counting 10 million unique records with sort and wc:
```
$ seq 1 10000000 | shuf | sort -u | wc -l
10000000
real 1m16.213s
user 1m13.516s
sys 0m1.516s
```Counting 10 million unique records with edcount:
```
$ seq 1 10000000 | shuf | edcount
9988659
real 0m1.110s
user 0m1.172s
sys 0m0.180s
```This is a 0.11% difference from the actual unique value.
Fast Install Instructions
-------------------------Required software: `make`,`gcc`
```
curl -L -o edcount-1.2.0.tar.gz https://github.com/joshwalters/edcount/releases/download/v1.2.0/edcount-1.2.0.tar.gz
tar -xzvf edcount-1.2.0.tar.gz
cd edcount-1.2.0
./configure
make
make install
```You might have to use `sudo make install` instead of `make install`.
For more detail, see the INSTALL file.
Contributing / Development
--------------------------See the CONTRIBUTING file.
License
-------GPLv3+, see the COPYING file.