https://github.com/adroll/python-hll
  
  
    python-hll 
    https://github.com/adroll/python-hll
  
        Last synced: 7 months ago 
        JSON representation
    
python-hll
- Host: GitHub
- URL: https://github.com/adroll/python-hll
- Owner: AdRoll
- License: mit
- Created: 2019-09-11T20:19:42.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2022-12-26T20:47:53.000Z (almost 3 years ago)
- Last Synced: 2025-04-08T10:22:43.015Z (7 months ago)
- Language: Python
- Size: 1.98 MB
- Stars: 18
- Watchers: 7
- Forks: 6
- Open Issues: 4
- 
            Metadata Files:
            - Readme: README.rst
- Changelog: HISTORY.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE
 
Awesome Lists containing this project
README
          ==========
python-hll
==========
.. image:: https://img.shields.io/pypi/v/python_hll.svg
        :target: https://pypi.python.org/pypi/python_hll
.. image:: https://readthedocs.org/projects/python-hll/badge/?version=latest
        :target: https://python-hll.readthedocs.io/en/latest/?badge=latest
        :alt: Documentation Status
.. image:: https://img.shields.io/badge/github-python--hll-yellow
        :target: https://github.com/AdRoll/python-hll
A Python implementation of `HyperLogLog `_
whose goal is to be `storage compatible `_
with `java-hll `_, `js-hll `_
and `postgresql-hll `_.
**NOTE:** This is a fairly literal translation/port of `java-hll `_
to Python. Internally, bytes are represented as Java-style bytes (-128 to 127) rather than Python-style bytes (0 to 255).
Also this implementation is quite slow: for example, in Java ``HLLSerializationTest`` takes 12 seconds to run
while in Python ``test_hll_serialization`` takes 1.5 hours to run (about 400x slower).
* Runs on: Python 2.7 and 3
* Free software: MIT license
* Documentation: https://python-hll.readthedocs.io
* GitHub: https://github.com/AdRoll/python-hll
Overview
---------------
See `java-hll `_ for an overview of what HLLs are and how they work.
Usage
---------------
Hashing and adding a value to a new HLL::
    from python_hll.hll import HLL
    import mmh3
    value_to_hash = 'foo'
    hashed_value = mmh3.hash(value_to_hash)
    hll = HLL(13, 5) # log2m=13, regwidth=5
    hll.add_raw(hashed_value)
Retrieving the cardinality of an HLL::
    cardinality = hll.cardinality()
Unioning two HLLs together (and retrieving the resulting cardinality)::
    hll1 = HLL(13, 5) # log2m=13, regwidth=5
    hll2 = HLL(13, 5) # log2m=13, regwidth=5
    # ... (add values to both sets) ...
    hll1.union(hll2) # modifies hll1 to contain the union
    cardinalityUnion = hll1.cardinality()
Reading an HLL from a hex representation of
`storage specification, v1.0.0 `_
(for example, retrieved from a `PostgreSQL database `_)::
    from python_hll.util import NumberUtil
    input = '\\x128D7FFFFFFFFFF6A5C420'
    hex_string = input[2:]
    hll = HLL.from_bytes(NumberUtil.from_hex(hex_string, 0, len(hex_string)))
Writing an HLL to its hex representation of
`storage specification, v1.0.0 `_
(for example, to be inserted into a `PostgreSQL database `_)::
    bytes = hll.to_bytes()
    output = "\\x" + NumberUtil.to_hex(bytes, 0, len(bytes))
Also see the `API documentation `_.
Development
---------------
See `Contributing `_ for how to get started building, testing, and deploying the code.