Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/wroberts/count

UNIX line counting utilities
https://github.com/wroberts/count

c-plus-plus count counting-utilities line-by-line sort text-processing unix

Last synced: 14 days ago
JSON representation

UNIX line counting utilities

Host: GitHub
URL: https://github.com/wroberts/count
Owner: wroberts
License: mit
Created: 2014-11-27T14:27:01.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2018-08-04T12:04:32.000Z (over 6 years ago)
Last Synced: 2024-11-10T13:08:23.909Z (2 months ago)
Topics: c-plus-plus, count, counting-utilities, line-by-line, sort, text-processing, unix
Language: C++
Homepage:
Size: 36.1 KB
Stars: 2
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        count - UNIX line counting utilities

====================================

Copyright (c) 2014 Will Roberts \

Homepage: https://github.com/wroberts/count

This project is licensed under the terms of the MIT license (see

LICENSE.md).

Overview

--------

`count` works similarly to `sort fruit | uniq -c`.  The output is

tab-separated and in alphabetical order.

`addcount` sums two count files produced by `count`, assuming that the

files are sorted in alphabetical order.

`sortalph` takes count data as produced by `count` and sorts it

alphabetically; it can also be used to sum two (or more) count files

together (even if they're not in alphabetical order):

    `cat COUNT1 COUNT2 | sortalph`

`sortnum` is a script that calls `sort -nr`.

`threshcount` reads a count file as produced by `count` and outputs

only those lines whose counts are greater than the given threshold

argument.

`shuffle` is a short Python script which reads in a file and outputs

its lines in random order.  `shuf` in the

[GNU Coreutils](https://www.gnu.org/software/coreutils/) is faster and

more flexible.

Install

-------

From tarball:

    tar xf count-1.0.tar.gz

    cd count-1.0/

    ./configure

    make install

From github:

    autoreconf --install

    mkdir build

    cd build

    ../configure

    make install

Speed Test

----------

`count` is faster than `sort | uniq -c`, but can use much more memory:

    $ cat BIGFILE | wc

     1653677 21751482 75598346

    $ time (cat BIGFILE | sort | uniq -c > /dev/null)

    real   0m50.933s

    user   0m55.267s

    sys    0m0.347s

    $ time (cat BIGFILE | count > /dev/null)

    real   0m9.233s

    user   0m9.357s

    sys    0m0.453s

Awk Equivalents

---------------

Most of the `count` tools can be replicated with trivial `awk` scripts.

Usually, the compiled binaries are faster.

`count` is equivalent to, though faster than:

    awk '{c[$0]++} END {OFS="\t"; for (x in c) print c[x], x}' | sort -k2

`sortalph` is equivalent to, though faster than:

    awk 'BEGIN{FS=OFS="\t"} {v=$1; $1=""; c[substr($0,2)]+=v} END {for (x in c) print c[x], x}' | sort -k2

`threshcount 2` is equivalent to, but slower than:

    awk '{if (2 < $1) print $0}'