{"id":17928743,"url":"https://github.com/bwoodsend/hirola","last_synced_at":"2025-08-08T06:32:22.451Z","repository":{"id":57437362,"uuid":"339233870","full_name":"bwoodsend/hirola","owner":"bwoodsend","description":"NumPy vectorized hash table for fast set and dict operations.","archived":false,"fork":false,"pushed_at":"2024-10-06T22:16:26.000Z","size":256,"stargazers_count":19,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-03T11:51:36.041Z","etag":null,"topics":["dict","hashtable","numpy","python","set"],"latest_commit_sha":null,"homepage":"https://hirola.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bwoodsend.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-15T23:19:44.000Z","updated_at":"2024-11-15T11:46:44.000Z","dependencies_parsed_at":"2024-07-16T16:07:50.791Z","dependency_job_id":null,"html_url":"https://github.com/bwoodsend/hirola","commit_stats":{"total_commits":86,"total_committers":1,"mean_commits":86.0,"dds":0.0,"last_synced_commit":"c319d296c6b317839777fc34e706cb2eee19aefa"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwoodsend%2Fhirola","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwoodsend%2Fhirola/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwoodsend%2Fhirola/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwoodsend%2Fhirola/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bwoodsend","download_url":"https://codeload.github.com/bwoodsend/hirola/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229097388,"owners_count":18019735,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dict","hashtable","numpy","python","set"],"created_at":"2024-10-28T21:05:10.159Z","updated_at":"2024-12-10T16:59:55.999Z","avatar_url":"https://github.com/bwoodsend.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"==================\nWelcome to hirola!\n==================\n\n∘\n`MIT license \u003chttps://github.com/bwoodsend/hirola/blob/master/LICENSE\u003e`_\n∘\n`PyPI \u003chttps://pypi.org/project/hirola\u003e`_\n∘\n`Documentation \u003chttps://hirola.readthedocs.io/\u003e`_\n∘\n`Source code \u003chttps://github.com/bwoodsend/hirola\u003e`_\n∘\n`Bug reports \u003chttps://github.com/bwoodsend/hirola/issues\u003e`_\n∘\n`Support \u003chttps://github.com/bwoodsend/hirola/discussions\u003e`_\n\nA vectorized hash table written in C for `fast\n\u003chttps://hirola.readthedocs.io/en/latest/benchmarks.html\u003e`_ ``set``/``dict``\nlike operations on NumPy arrays.\n\nHirola provides fast indexing and de-duplication of keys.\nIt can be used as an extension of `numpy.unique()\n\u003chttps://numpy.org/doc/stable/reference/generated/numpy.unique.html\u003e`_ and a\nvery light (20-30KB download size) and much faster alternative to\n`pandas.Categorical()\n\u003chttps://pandas.pydata.org/docs/reference/api/pandas.Categorical.categories.html\u003e`_.\nHirola obtains its speed in the same way that NumPy does – vectorising,\ntranslating into C and imposing the following constraints:\n\n* Keys must all be of the same predetermined type and size.\n* The maximum size of a table must be chosen in advance and managed explicitly.\n* To get any performance boost, operations should be done in bulk.\n* Elements can not be removed.\n\n\nInstallation\n------------\n\nInstall Hirola with pip:\n\n.. code-block:: console\n\n    pip install hirola\n\n\nQuickstart\n----------\n\n``HashTable``\n*************\n\nA ``HashTable`` can be though of as a ``dict`` but with only an enumeration for\nvalues.\nTo construct an empty hash table:\n\n.. code-block:: python\n\n    import numpy as np\n    import hirola\n\n    table = hirola.HashTable(\n        20,  # \u003c--- Maximum size for the table - up to 20 keys.\n        \"U10\",  # \u003c--- NumPy dtype - strings of up to 10 characters.\n    )\n\nKeys may be added individually...\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table.add(\"cat\")\n    0\n\n... But it's much more efficient to add in bulk.\nThe return value is an enumeration of when each key was first added.\nDuplicate keys are not re-added.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table.add([\"dog\", \"cat\", \"moose\", \"gruffalo\"])\n    array([1, 0, 2, 3])\n\n\nMultidimensional inputs give multidimensional outputs of matching shapes.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table.add([[\"rabbit\", \"cat\"],\n    ...            [\"gruffalo\", \"moose\"],\n    ...            [\"werewolf\", \"gremlin\"]])\n    array([[4, 0],\n           [3, 2],\n           [5, 6]])\n\nInspect all keys added so far via the ``keys`` attribute.\n(Note that, unlike ``dict.keys()``, it's a property instead of a method.)\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table.keys\n    array(['cat', 'dog', 'moose', 'gruffalo', 'rabbit', 'werewolf', 'gremlin'],\n          dtype='\u003cU10')\n\nKey indices can be retrieved with ``table.get(key)`` or just ``table[key]``.\nAgain, retrieval is NumPy vectorised and is much faster if given large arrays of\ninputs rather than one at a time.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table.get(\"dog\")\n    1\n    \u003e\u003e\u003e table[[\"moose\", \"gruffalo\"]]\n    array([2, 3])\n\nLike the Python dict,\nusing ``table[key]`` raises a ``KeyError`` if keys are missing\nbut using ``table.get(key)`` returns a configurable default.\nUnlike Python's dict, the default default is ``-1``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table[\"tortoise\"]\n    KeyError: \"key = 'tortoise' is not in this table.\"\n    \u003e\u003e\u003e table.get(\"tortoise\")\n    -1\n    \u003e\u003e\u003e table.get(\"tortoise\", default=99)\n    99\n    \u003e\u003e\u003e table.get([\"cat\", \"bear\", \"tortoise\"], default=[100, 101, 102])\n    array([  0, 101, 102])\n\n\nChoosing a ``max`` size\n.......................\n\nUnlike Python's ``set`` and ``dict``, ``Hirola`` does not manage its size\nautomatically (although `it can be reconfigured to \u003cautomatic-resize\u003e`_).\nTo prevent wasted resizing (which is what Python does under the hood),\nyou have full control of and responsibility for how much space the table uses.\nObviously the table has to be large enough to fit all the keys in it.\nAdditionally, when a hash table gets to close to full it becomes much slower.\nDepending on how much you favour speed over memory you should add 20-50% extra\nheadroom.\nIf you intend to a lot of looking up of the same small set of values then it can\ncontinue to run faster if you increase ``max`` to 2-3x its minimal size.\n\n\nStructured key data types\n.........................\n\nTo indicate that an array axis should be considered as a single key,\nuse NumPy's structured dtypes.\nIn the following example, the data type ``(points.dtype, 3)``\nindicates that a 3D point - a triplet of floats -\nshould be considered as one object.\nSee ``help(hirola.HashTable.dtype)`` for more information of specifying dtypes.\nOnly the last axis or last axes may be thought of as single keys.\nFor other setups, first convert with ``numpy.transpose()``.\n\n.. code-block:: python\n\n    import numpy as np\n    import hirola\n\n    # Create a cloud of 3D points with duplicates. This is 3000 points in total,\n    # with up to 1000 unique points.\n    points = np.random.uniform(-30, 30, (1000, 3))[np.random.choice(1000, 3000)]\n\n    # Create an empty hash table.\n    # In practice, you generally don't know how many unique elements there are\n    # so we'll pretend we don't either an assume the worst case of all 3000 are\n    # unique. We'll also give 25% padding for speed.\n    table = hirola.HashTable(len(points) * 1.25, (points.dtype, 3))\n\n    # Add all points to the table.\n    ids = table.add(points)\n\nDuplicate-free contents can be accessed from ``table.keys``:\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table.keys  # \u003c--- These are `points` but with no duplicates.\n    array([[  3.47736554, -15.17112511,  -9.51454466],\n           [ -6.46948046,  23.64504329, -16.25743105],\n           [-27.02527253, -16.1967225 , -10.11544157],\n           ...,\n           [  3.75972597,   1.24130412,  -8.14337206],\n           [-13.62256791,  11.76551455, -13.31312988],\n           [  0.19851678,   4.06221179, -22.69006592]])\n    \u003e\u003e\u003e table.keys.shape\n    (954, 3)\n\nEach point's location in ``table.keys`` is returned by ``table.add()``,\nlike ``numpy.unique(..., return_args=True)``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e ids  # \u003c--- These are the indices in `table.keys` of each point in `points`.\n    array([  0,   1,   2, ..., 290, 242, 669])\n    \u003e\u003e\u003e np.array_equal(table.keys[ids], points)\n    True\n\nLookup the indices of points without adding them using ``table.get()``.\n\n\n.. _automatic-resize:\n\nHandling of nearly full hash tables\n...................................\n\n``HashTable``\\ s become very slow when almost full.\nAs of v0.3.0, an efficiency warning will notify you if a table exceeds 90% full.\nThis warning can be reconfigured into an error, silenced or set to resize the\ntable automatically to make more room.\nThese are demonstrated in the example constructors below:\n\n.. code-block:: python\n\n    # The default: Issue a warning when the table is 90% full.\n    hirola.HashTable(..., almost_full=(0.9, \"warn\"))\n\n    # Disable all \"almost full\" behaviours.\n    hirola.HashTable(..., almost_full=None)\n\n    # To consider a table exceeding 80% full as an error use:\n    hirola.HashTable(..., almost_full=(0.8, \"raise\"))\n\n    # To automatically triple in size whenever the table exceeds 80% full use:\n    hirola.HashTable(..., almost_full=(0.8, 3.0))\n\nResizing tables is slow (it's only marginally optimized beyond creating a new\nbigger table and ``.add()``\\ -ing the existing keys) which is why it's not\nenabled by default. It should be avoided unless you really have no idea how big\nyour table will need to be and favour the memory savings of not overestimating\nover raw speed.\n\n\nRecipes\n*******\n\nA ``HashTable`` can be used to replicate a `dict \u003cas-a-dict\u003e`_,\n`set \u003cas-a-set\u003e`_ or a `collections.Counter \u003cas-a-collections.Counter\u003e`_.\nThese examples below might turn into their own proper classes in the future but\nso far I've never come across a real use case where they would actually fit.\n\n\n.. _as-a-dict:\n\nUsing a ``HashTable`` as a ``dict``\n...................................\n\nA ``dict`` can be imitated using a ``HashTable()`` with a second array for\nvalues.\nThe output of ``HashTable.add()``  and ``HashTable.get()`` should be used as\nindices of ``values``:\n\n.. code-block:: python\n\n    import numpy as np\n    import hirola\n\n    # The `keys` - will be populated with names of countries.\n    countries = hirola.HashTable(40, (str, 20))\n    # The `values` - will be populated with the names of each country's capital city.\n    capitals = np.empty(countries.max, (str, 20))\n\nAdd or set items using the pattern ``values[table.add(key)] = value``:\n\n.. code-block:: python\n\n    capitals[countries.add(\"Algeria\")] = \"Al Jaza'ir\"\n\nOr in bulk:\n\n.. code-block:: python\n\n    new_keys = [\"Angola\", \"Botswana\", \"Burkina Faso\"]\n    new_values = [\"Luanda\", \"Gaborone\", \"Ouagadougou\"]\n    capitals[countries.add(new_keys)] = new_values\n\nLike Python dicts, the syntax to overwrite values is exactly the same as to\nwrite them.\n\nRetrieve values with ``values[table[key]]``:\n\n.. code-block:: python\n\n    \u003e\u003e\u003e capitals[countries[\"Botswana\"]]\n    'Gaborone'\n    \u003e\u003e\u003e capitals[countries[\"Botswana\", \"Algeria\"]]\n    array(['Gaborone', \"Al Jaza'ir\"], dtype='\u003cU20')\n\nView all keys and values with ``table.keys`` and ``values[:len(table)]``.\nA ``HashTable`` remembers the order keys were first added so this dict is\nautomatically a sorted dict.\n\n.. code-block:: python\n\n    # keys\n    \u003e\u003e\u003e countries.keys\n    array(['Algeria', 'Angola', 'Botswana', 'Burkina Faso'], dtype='\u003cU20')\n    # values\n    \u003e\u003e\u003e capitals[:len(countries)]\n    array([\"Al Jaza'ir\", 'Luanda', 'Gaborone', 'Ouagadougou'], dtype='\u003cU20')\n\nDepending on the usage scenario,\nit may or may not make sense to want an equivalent to  ``dict.items()``.\nIf you do want an equivalent,\nuse ``numpy.rec.fromarrays([table.keys, values[:len(table)]])``,\npossibly adding a ``names=`` option:\n\n.. code-block:: python\n\n    \u003e\u003e\u003e np.rec.fromarrays([countries.keys, capitals[:len(countries)]],\n    ...                   names=\"countries,capitals\")\n    rec.array([('Algeria', \"Al Jaza'ir\"), ('Angola', 'Luanda'),\n               ('Botswana', 'Gaborone'), ('Burkina Faso', 'Ouagadougou')],\n              dtype=[('countries', '\u003cU20'), ('capitals', '\u003cU20')])\n\nIf the keys and values have the same dtype then ``numpy.c_`` works too.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e np.c_[countries.keys, capitals[:len(countries)]]\n    array([['Algeria', \"Al Jaza'ir\"],\n           ['Angola', 'Luanda'],\n           ['Botswana', 'Gaborone'],\n           ['Burkina Faso', 'Ouagadougou']], dtype='\u003cU20')\n\n\n.. _as-a-set:\n\nUsing a ``HashTable`` as a ``set``\n..................................\n\nTo get set-like capabilities from a ``HashTable``,\nleverage the ``contains()`` method.\nFor these examples we will experiment with integer multiples of 3 and 7.\n\n.. code-block:: python\n\n    import numpy as np\n\n    of_3s = np.arange(0, 100, 3)\n    of_7s = np.arange(0, 100, 7)\n\nWe'll only require one array to be converted into a hash table.\nThe other can remain as an array.\nIf both are hash tables, simply use one table's ``keys`` attribute as the array.\n\n.. code-block:: python\n\n    import hirola\n\n    table_of_3s = hirola.HashTable(len(of_3s) * 1.25, of_3s.dtype)\n    table_of_3s.add(of_3s)\n\nUse ``table.contains()`` as a vectorised version of ``in``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e table_of_3s.contains(of_7s)\n    array([ True, False, False,  True, False, False,  True, False, False,\n            True, False, False,  True, False, False])\n\nFrom the above, the common set operations can be derived:\n\n*   ``set.intersection()`` - Values in the array and in the set:\n\n.. code-block:: python\n\n        \u003e\u003e\u003e of_7s[table_of_3s.contains(of_7s)]\n        array([ 0, 21, 42, 63, 84])\n\n*   Set subtraction - Values in the array which are not in the set:\n\n.. code-block:: python\n\n        \u003e\u003e\u003e of_7s[~table_of_3s.contains(of_7s)]\n        array([ 7, 14, 28, 35, 49, 56, 70, 77, 91, 98])\n\n*   ``set.union()`` - Values in either the table or in the tested array (with no\n    duplicates):\n\n.. code-block:: python\n\n        \u003e\u003e\u003e np.concatenate([table_of_3s.keys, of_7s[~table_of_3s.contains(of_7s)]], axis=0)\n        array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48,\n               51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99,\n                7, 14, 28, 35, 49, 56, 70, 77, 91, 98])\n\n\n.. _`as-a-collections.Counter`:\n\nUsing a ``HashTable`` as a ``collections.Counter``\n..................................................\n\nFor this example,\nlet's give ourselves something a bit more substantial to work on.\nCounting word frequencies in Shakespeare's Hamlet play is the\ntrendy example for ``collections.Counter`` and it's what we'll use too.\n\n.. code-block:: python\n\n    from urllib.request import urlopen\n    import re\n    import numpy as np\n\n    hamlet = urlopen(\"https://gist.githubusercontent.com/provpup/2fc41686eab7400b796b/raw/b575bd01a58494dfddc1d6429ef0167e709abf9b/hamlet.txt\").read()\n    words = np.array(re.findall(rb\"([\\w']+)\", hamlet))\n\nA counter is just a ``dict`` with integer values and a ``dict`` is just a hash\ntable with a separate array for values.\n\n.. code-block:: python\n\n    import hirola\n\n    word_table = hirola.HashTable(len(words), words.dtype)\n    counts = np.zeros(word_table.max, dtype=int)\n\nThe only new functionality that is not defined in `using a hash table as a dict\n\u003cas-a-dict\u003e`_ is the ability to count keys as they are added.\nTo count new elements use the rather odd line\n``np.add(counts, table.add(keys), 1)``.\n\n.. code-block:: python\n\n    np.add.at(counts, word_table.add(words), 1)\n\nThis line does what you might expect ``counts[word_table.add(words)] += 1`` to\ndo but, due to the way NumPy works,\nthe latter form fails to increment each count more than once if ``words``\ncontains duplicates.\n\nUse NumPy's indirect sorting functions to get most or least common keys.\n\n.. code-block:: python\n\n    # Get the most common word.\n    \u003e\u003e\u003e word_table.keys[counts[:len(word_table)].argmax()]\n    b'the'\n\n    # Get the top 10 most common words. Note that these are unsorted.\n    \u003e\u003e\u003e word_table.keys[counts[:len(word_table)].argpartition(-10)[-10:]]\n    array([b'it', b'and', b'my', b'of', b'in', b'a', b'to', b'the', b'I',\n           b'you'], dtype='|S14')\n\n    # Get all words in ascending order of commonness.\n    \u003e\u003e\u003e word_table.keys[counts[:len(word_table)].argsort()]\n    array([b'END', b'whereat', b\"griev'd\", ..., b'to', b'and', b'the'],\n          dtype='|S14')\n\n\nA Security Note\n---------------\n\nUnlike the builtin ``hash()`` used internally by Python's ``set`` and ``dict``,\n``hirola`` does not randomise a hash seed on startup\nmaking an online server running ``hirola`` more vulnerable to denial of service\nattacks.\nIn such an attack, the attacker clogs up your server by sending it requests that\nhe/she knows will cause hash collisions and therefore slow it down.\nWhereas a Python hash table's size is always predictably the next power of 8\nabove ``len(table) * 3 / 2``, a ``hirola.HashTable()`` may be any size meaning\nthat you can make an attack considerably more difficult by adding a little\nrandomness to the sizes of your hash tables.\nBut if your writing an online server\nwhich performs dictionary lookup based on user input\nand your user-base doesn't like you much\nor you have some very spiteful below-the-belt competitors\nthen I recommend that you don't use this library.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbwoodsend%2Fhirola","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbwoodsend%2Fhirola","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbwoodsend%2Fhirola/lists"}