Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ei-grad/bincount
No-copy parallelized bincount returning dict
https://github.com/ei-grad/bincount
bincount cython numpy python statistics
Last synced: 9 days ago
JSON representation
No-copy parallelized bincount returning dict
- Host: GitHub
- URL: https://github.com/ei-grad/bincount
- Owner: ei-grad
- Created: 2018-11-19T15:30:14.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2024-04-09T06:11:12.000Z (9 months ago)
- Last Synced: 2024-12-15T11:17:03.836Z (14 days ago)
- Topics: bincount, cython, numpy, python, statistics
- Language: Cython
- Homepage:
- Size: 3.91 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
bincount
========No-copy parallelized bincount returning dict.
Install
-------Prequirements: C-compiler with OpenMP support.
Install with pip:
.. code-block::
pip install bincount
Usage
-----There is a ``bincount`` (a parallel version) and a ``bincount_single`` (which don't
parallelize the calculation) functions, both returning the dict containing the
number of occurrences of each byte value in the passed bytes-like object:.. code-block::
>>> from bincount import bincount
>>> bincount(open('a-tiny-file.txt', 'rb').read())
{59: 2, 65: 5, 66: 1, 67: 3, 68: 2, 69: 3, 73: 4, 76: 7, 84: 3, 86: 1, 95: 4}Motivation
----------As of Nov 2018, ``np.bincount`` is unusable with large memmaps:
.. code-block::
>>> import numpy as np
>>> np.bincount(np.memmap('some-5gb-file.txt', mode='r'))
Traceback (most recent call last):
File "", line 1, in
MemoryErrorThe most effective pure-python solution for ``wc -l`` is a bit slow:
.. code-block::
In [6]: %%time
...: sum(1 for i in open('some-5gb-file.txt', mode='rb'))
...:
CPU times: user 3.5 s, sys: 878 ms, total: 4.38 s
Wall time: 4.38 s
Out[6]: 58941384It is 3x times slower than ``wc -l``:
.. code-block::
In [1]: %%time
...: !wc -l some-5gb-file.txt
...:
58941384 some-5gb-file.txt
CPU times: user 1.48 ms, sys: 3.48 ms, total: 4.96 ms
Wall time: 1.24 sWhile it should be faster on modern multicore SMP systems:
.. code-block::
In [1]: import numpy as np
In [2]: from bincount import bincount
In [3]: %%time
...: bincount(np.memmap('some-5gb-file.txt', mode='r'))[10]
...:
CPU times: user 6.83 s, sys: 354 ms, total: 7.19 s
Wall time: 705 ms
Out[4]: 58941384