Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ei-grad/bincount

No-copy parallelized bincount returning dict
https://github.com/ei-grad/bincount

bincount cython numpy python statistics

Last synced: 22 days ago
JSON representation

No-copy parallelized bincount returning dict

Awesome Lists containing this project

README

        

bincount
========

No-copy parallelized bincount returning dict.

Install
-------

Prequirements: C-compiler with OpenMP support.

Install with pip:

.. code-block::

pip install bincount

Usage
-----

There is a ``bincount`` (a parallel version) and a ``bincount_single`` (which don't
parallelize the calculation) functions, both returning the dict containing the
number of occurrences of each byte value in the passed bytes-like object:

.. code-block::

>>> from bincount import bincount
>>> bincount(open('a-tiny-file.txt', 'rb').read())
{59: 2, 65: 5, 66: 1, 67: 3, 68: 2, 69: 3, 73: 4, 76: 7, 84: 3, 86: 1, 95: 4}

Motivation
----------

As of Nov 2018, ``np.bincount`` is unusable with large memmaps:

.. code-block::

>>> import numpy as np
>>> np.bincount(np.memmap('some-5gb-file.txt', mode='r'))
Traceback (most recent call last):
File "", line 1, in
MemoryError

The most effective pure-python solution for ``wc -l`` is a bit slow:

.. code-block::

In [6]: %%time
...: sum(1 for i in open('some-5gb-file.txt', mode='rb'))
...:
CPU times: user 3.5 s, sys: 878 ms, total: 4.38 s
Wall time: 4.38 s
Out[6]: 58941384

It is 3x times slower than ``wc -l``:

.. code-block::

In [1]: %%time
...: !wc -l some-5gb-file.txt
...:
58941384 some-5gb-file.txt
CPU times: user 1.48 ms, sys: 3.48 ms, total: 4.96 ms
Wall time: 1.24 s

While it should be faster on modern multicore SMP systems:

.. code-block::

In [1]: import numpy as np

In [2]: from bincount import bincount

In [3]: %%time
...: bincount(np.memmap('some-5gb-file.txt', mode='r'))[10]
...:
CPU times: user 6.83 s, sys: 354 ms, total: 7.19 s
Wall time: 705 ms
Out[4]: 58941384