{"id":18419125,"url":"https://github.com/richard-hartmann/binfootprint","last_synced_at":"2025-04-07T13:31:29.111Z","repository":{"id":57415145,"uuid":"68849300","full_name":"richard-hartmann/binfootprint","owner":"richard-hartmann","description":"unique serialization of python objects","archived":false,"fork":false,"pushed_at":"2024-03-20T17:33:30.000Z","size":91,"stargazers_count":2,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-22T19:03:43.655Z","etag":null,"topics":["binary-data","caching","python","scientific-computing","serialization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/richard-hartmann.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-09-21T19:09:57.000Z","updated_at":"2022-12-22T20:04:38.000Z","dependencies_parsed_at":"2023-12-10T13:45:39.975Z","dependency_job_id":null,"html_url":"https://github.com/richard-hartmann/binfootprint","commit_stats":{"total_commits":46,"total_committers":2,"mean_commits":23.0,"dds":"0.17391304347826086","last_synced_commit":"2fd27d84740e6c05449b58cdc400a637e64cb4d3"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richard-hartmann%2Fbinfootprint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richard-hartmann%2Fbinfootprint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richard-hartmann%2Fbinfootprint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/richard-hartmann%2Fbinfootprint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/richard-hartmann","download_url":"https://codeload.github.com/richard-hartmann/binfootprint/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247661712,"owners_count":20975106,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["binary-data","caching","python","scientific-computing","serialization"],"created_at":"2024-11-06T04:15:54.977Z","updated_at":"2025-04-07T13:31:28.810Z","avatar_url":"https://github.com/richard-hartmann.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# binfootprint - unique serialization of python objects\n\n[![PyPI version](https://badge.fury.io/py/binfootprint.svg)](https://badge.fury.io/py/binfootprint)\n\n## Why unique serialization\n\nWhen caching computationally expansive function calls, the input arguments (*args, **kwargs)\nserve as key to look up the result of the function.\nTo perform efficient lookups these keys (often a large number of nested python objects) needs to be hashable.\nSince python's build-in hash function is randomly seeded (and applies to a few data types only) it is not\nsuited for persistent caching.\nAlternatively, standard hash functions, as provided by the \n[hashlib library](https://docs.python.org/3/library/hashlib.html), can be used.\nAs they relay on  byte sequences as input, python objects need to be converted to such a sequence first.\nSurely, python's pickle module provides such a serialization which, for our purpose, has the drawback that\nthe byte sequence is not guaranteed to be unique (e.g., a dictionary can be stored as different byte sequences,\nas the order of the (key, value) pairs is irrelevant).\n\nThe binfootprint module fills that gap.\nIt guarantees that a particular python object will have a unique binary representation which \ncan serve as input for any hash function.  \n\n## Quick start\n\n`binfootprint.dump(data)` generate a unique binary representation \n(binary footprint) of `data`.\n```python\nb = binfootprint.dump(['hallo', 42])\n```\n\nIts output can serve as suitable input for a hash function.\n```python\nhashlib.sha256(b).hexdigest()\n```\n\n`binfootprint.load(data)` reconstructs the original python object.\n```python\nob = binfootprint.load(b)\n```\n\nNumpy array can be serialized.\n```python\na = numpy.asarray([0, 2.3, 4])\nb = binfootprint.dump(a)\n```\n\nClasses which implement `__getstate__` (pickle interface) or `__bfkey__` can be\nserialized too.\n\n```python\nclass Point:\n    def __init__(self, x, y):\n        self.x = x\n        self.y = y\n    def __getstate__(self):\n        return [self.x, self.y]\n        \nob = Point(4, -2)\nb = binfootprint.dump(ob)\n```\nIf `__bfkey__` is implemented, it is used over `__getstate__`.\n\n*New since version 1.2.0:* \n[`functools.partial`](https://docs.python.org/3/library/functools.html?highlight=partial#functools.partial) \nobjects can now be serialized too. This allows to cache a function which takes a `functools.partial`\nas argument.\n\n```python\ndef gaussian(x, a, sigma, x0):\n    return a * math.exp(-(x-x0)**2 / 2 / sigma**2)\n\n@binfootprint.util.ShelveCacheDec()\ndef quad(f, x_min, x_max, dx):\n    r = 0\n    x = x_min\n    while x \u003c x_max:\n        r += f(x)\n        x += dx\n    return dx*r\n\ng = functools.partial(gaussian, a=1, sigma=1, x0=-2.34)\nquad(g, x_min=-10, x_max=10, dx=0.001)\n\n```\n\n### cache decorator \n\nUtilizing the unique binary representation of python objects, a persistent \ncache for quite general functions is provided by the `ShelveCache` class.\nThe decorator `ShelveCacheDec` makes it really easy to use: \n\n```python\n@binfootprint.ShelveCacheDec(path='.cache')\ndef area(p):\n    return p.x * p.y\n```\n\n### parameter base class\n\nTo conveniently organize a set of parameters suitable as key for caching \nyou can subclass `ABCParameter`. Why should you do that?\n\n- The `__bfkey__` method of `ABCParameter` ignores parameters that are `None`.\n  This allows to extend your function interface without loosing access to \n  cached results from earlier stages.\n- You can add informative information to the `__non_key__` member which\n  are not included in the binary representation of the parameter class.\n\n```python\nclass Param(bf.ABCParameter):\n    __slots__ = [\"x\", \"y\", \"__non_key__\"]\n\n    def __init__(self, x, y, msg=\"\"):\n        super().__init__()\n        self.x = x\n        self.y = y\n        self.__non_key__ = dict()\n        self.__non_key__[\"msg\"] = msg\n```\n\n\n## Which data types can be serialized\n\nPython's **fundamental data types** are supported\n* integer \n* float (64bit)\n* complex (128bit)\n* strings\n* byte arrays\n* special build-in constants: `True`, `False`, `None`\n\nas well as their **nested combination** by means of the **native data structures**\n- tuple\n- list\n- dictionary\n- namedtuple.\n\nIn addition, the following types are supported:\n- numpy `ndarray`: \n  The serialization makes use of numpy's \n  [format.write_array()](https://numpy.org/devdocs/reference/generated/numpy.lib.format.write_array.html) \n  function using version 1.0.\n- `functools.partial` objects (*new since version 1.2.0*)\n\n Furthermore, any class that implements \n\n- `__getstate__` (python's pickle interface)\n\ncan be serialized as well, given that the returned data from `__getstate__` can be serialized\n**and the returned data is not `None`**\nDistinction between objects is realized by adding the class name and the name of the module which defines \nthe class to the binary data.\nThis in turn allows to also reconstruct the original object by means of the `__setstate__` method.\n\nIn case the `__getstate__` method is not suitable, you can implement\n\n- `__bfkey__`\n\nwhich should return the necessary data to distinguish different objects.\nThe spirit of `__bfkey__` is very similar to that of `__getstate__`, although it is meant\nfor serialization only, and to for reconstruction the original object.\n\nNote that, if `__bfkey__` is implemented it will be used, regardless of `__getstate__`.\n\nNote: dumping older version is not supported anymore. If backwards compatibility is needed check out older\ncode from git. If needed converters should/will be written.\n\n### be carefull with functions\n\nSince a function objects seem to implement `__getstate__` which, however, returns `None`, \ndumping a function will fail.\n**Whether this makes sense, can be discussed.**\nImplementing your own callable ore using `partioal` objects can circumvent this.\n\n## Installation\n\n### pip\ninstall the latest version using pip\n\n    pip install binfootprint\n\n### poetry\nUsing poetry allows you to include this package in your project as a dependency.\n\n### git\ncheck out the code from github\n\n    git clone https://github.com/richard-hartmann/binfootprint.git\n\n### dependencies\n\n- python3\n- numpy\n\n## How to use the binfootprint module\n\n### data serialization\n\nGenerating the binary footprint is done using the `dump(obj)` method.\n\n#### very simple\n```python\nimport binfootprint as bf\nbf.dump(['hallo', 42])\n```\n\n#### more complex\n```python\nimport hashlib\nimport binfootprint as bf\n\nSIGMA_Z = 0x34\ndata = {\n    'Færøerne': {\n        'area': (1399, 'km^2'),\n        'population': 54000\n    },\n    SIGMA_Z: [[-1, 0],\n              [0, 1]],\n    'usefulness': None\n}\nb = bf.dump(data)\nprint(\"MD5 check sum:\", hashlib.md5(b).hexdigest())\n```\n\n### reconstruct serialized data\n\nAlthough the primary focus of this module is the binary representation,\nfor reasons of convenience or debugging it might be useful restore the original\npython object from the binary data. \nCalling the `load(bin_data)` function achieves that task. \n  \n```python\nimport binfootprint as bf\n\ndata = ['hallo', 42]\nb = bf.dump(data)\ndata_prime = bf.load(b)\nprint(data_prime)\n```\n\n### python objects - `__getstate__`\n\nSince `__getstate__` is assumed to uniquely represent the state of an\nobject by means of the returned data, it can be used to generate a unique binary\nrepresentation.\n\n```python\nimport binfootprint as bf\n\nclass Point:\n    def __init__(self, x, y):\n        self.x = x\n        self.y = y\n    def __getstate__(self):\n        return [self.x, self.y]\n    def __setstate__(self, state):\n        self.x = state[0]\n        self.y = state[1]\n\nob = Point(4, -2)\nb = bf.dump(ob)\n```\nSince `__setstate__` is implemented as well, the original object can be\nreconstructed.\n```python\nob_prime = bf.load(b)\nprint(\"type:\", type(ob_prime))\n# type: \u003cclass '__main__.Point'\u003e\nprint(\"member x:\", ob_prime.x)\n# member x: 4\nprint(\"member y:\", ob_prime.y)\n# member y: -2\n```\n\n### implement `__bfkey__` if `__getstate__` is not suited\n\nIn case `__getstate__` returns data which is not sufficient to uniquely label\nan object or if the data cannot be serialized by the binaryfootprint module,\nthe method `__bfkey__` should be implemented.\nIt is expected to return serializable data which uniquely identifies the state\nof the object.\nNote that, if `__bfkey__` is present, `__getstate__` is ignored.\n\n**Importantly**, when deserializing the binary data from an object implementing \n`__bfkey__`, the python object **is not returned**, since there is no \n`__setstate__`equivalent. Instead, the class name, the name of the module defining \nthe class and the data returned by `__bfkey__` is recovered.\nThis should not pose a problem, since the main focus of the binfootprint module is\nthe unique binary serialization of an object.\nTo ensure deserialization use python's `pickle` module.\n\n```python\nclass Point2(Point):\n    def __bfkey__(self):\n        return {'x': self.x, 'y': self.y}\n\nob = Point2(5, 3)\nb = bf.dump(ob)\n\nob_prime = bf.load(b)\nprint(\"load on bfkey:\", ob_prime)\n# load on bfkey: ('Point2', '__main__', {'x': 5, 'y': 3})\n```\n\n### numpy ndarrays\n\nNumpy's `ndarray` are supported by relaying on numpy's binary serialization \nusing [format.write_array()](https://numpy.org/devdocs/reference/generated/numpy.lib.format.write_array.html).\n\n```python\nimport binfootprint as bf\nimport numpy as np\n\na = np.asarray([0, 1, 1, 0])\nb1 = bf.dump(a)\n```\n\nAs expected, changing the shape or data type yield a different binary representation\n\n```python\na2 = a1.reshape(2,2)\nb2 = bf.dump(a2)\na3 = np.asarray(a1, dtype=np.complex128)\nb3 = bf.dump(a3)\nprint(\"            MD5 of int array :\", hashlib.md5(b1).hexdigest())\n# 949bfba1237c48007a066398f744a161\nprint(\"MD5 of int array shape (2,2) :\", hashlib.md5(b2).hexdigest())\n# e9049a19f82c6f282d65466a72360cd8\nprint(\"        MD5 of complex array :\", hashlib.md5(b3).hexdigest())\n# 2274ea54925d88ec4d53853050e55a82\n```\n\n# caching\n\nWith the binaryfootprint module, caching function calls is straight forward.\nAn implementation of such a cache using python's `shelve` for persistent storage\nis provided by the `ShelveCacheDec` class.\n\n```python\n@binfootprint.ShelveCacheDec()\ndef area(p):\n    print(\" * f(p(x={},y={})) called\".format(p.x, p.y))\n    return p.x * p.y\n```\n\nIt is safe to use the `ShelveCacheDec` with the same data location (`path`)   \non different functions, since the name of the function and the name of the \nmodule defining the function determined the name of the underlying database.  \n\nIn addition to caching the decorator extends the function signature by the \nkwarg `_cache_flag` which modifies the caching behavior as follows:\n\n- `_cache_flag = 'no_cache'`: Simple call of `fnc` with no caching.\n- `_cache_flag = 'update'`: Call `fnc` and update the cache with recent return value.\n- `_cache_flag = 'has_key'`: Return `True` if the call has already been cached, otherwise `False`.\n- `_cache_flag = 'cache_only'`: Raises a `KeyError` if the result has not been cached yet.\n\n```python\np = Point(10, 10)\nprint(\"first call results in\")\nprint(area(p))\n# * f(p(x=10,y=10)) called\n# 100\n\nprint(\"second call results in\")\nprint(area(p))\n# 100\np = Point(10, 11)\n\nprint(\"f(p(10, 11)) is in cache?\")\nprint(area(p, _cache_flag='has_key'))\n# False\n```\n\n# pitfalls\n\n### ints and floats\n\nSince the binary representation between ints and floats is different, `1` and `1.0`\nwill be treated as different things.\nThis means that the cached value of a function call with an argument being `1` is\nnot found when passing `1.0` as argument.\nAlthough the result of the function will most likely be the same.\nObviously, the same holds true for numpy array of different `dtype`.\n\n# Parameter class\n\nTha abstract base class `ABCParameter` allows to conveniently manage a set \nof parameters.\n\nRelevant parameters, explicitly specified as data member via `__slots__` \nmechanism, are returned by `__bfkey__` method (see above).\nTheir order in the `__slots__` definition is irrelevant.\n**Importantly**, class members are included only if they are not `None`.\nIn this way a parameter class definition can be extended while still being \nable to reproduce the binary footprint of an older class definition.\n\nIf present, the class member `__non_key__` has a special meaning.\nIt is not included in the parameter-values list returned by `__bfkey__`.\nIt is expected to be dictionary-like and allows storing \nadditional / informative information.\nThis is also reflected by the string representation of the class.\n\n```python\nclass Param(binfootprint.ABCParameter):\n    __slots__ = [\"x\", \"y\", \"__non_key__\"]\n\n    def __init__(self, x, y, msg=\"\"):\n        super().__init__()\n        self.x = x\n        self.y = y\n        self.__non_key__ = dict()\n        self.__non_key__['msg'] = msg\n\n\np = Param(3, 4.5)\nbfp = binfootprint.dump(p)\nprint(\"{}\\n has hex hash value {}...\".format(\n    p, binfootprint.hash_hex_from_bin_data(bfp)[:6])\n)\n# x : 3\n# y : 4.5\n# --- extra info ---\n# msg : \n# has hex hash value 38dbe8...\n\np = Param(3, 4.5, msg=\"I told you, don't use x=3!\")\nbfp = binfootprint.dump(p)\nprint(\"{}\\n has hex hash value {}...\".format(\n    p, binfootprint.hash_hex_from_bin_data(bfp)[:6])\n)\n# x : 3\n# y : 4.5\n# --- extra info ---\n# msg : I told you, don't use x=3!\n# has hex hash value 38dbe8...\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frichard-hartmann%2Fbinfootprint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frichard-hartmann%2Fbinfootprint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frichard-hartmann%2Fbinfootprint/lists"}