{"id":13501335,"url":"https://github.com/ronny-rentner/UltraDict","last_synced_at":"2025-03-29T08:32:25.547Z","repository":{"id":41040058,"uuid":"464358357","full_name":"ronny-rentner/UltraDict","owner":"ronny-rentner","description":"Sychronized, streaming Python dictionary that uses shared memory as a backend","archived":false,"fork":false,"pushed_at":"2025-02-28T09:14:26.000Z","size":395,"stargazers_count":278,"open_issues_count":9,"forks_count":25,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-02-28T16:11:20.406Z","etag":null,"topics":["dict","python","python3","shared-memory"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ronny-rentner.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-28T06:03:17.000Z","updated_at":"2025-02-23T16:02:48.000Z","dependencies_parsed_at":"2024-06-19T09:54:40.955Z","dependency_job_id":"fd4817fb-da0d-405b-b3ca-fb9d49ebdd86","html_url":"https://github.com/ronny-rentner/UltraDict","commit_stats":{"total_commits":47,"total_committers":3,"mean_commits":"15.666666666666666","dds":0.276595744680851,"last_synced_commit":"7f99bb9e2236e18179c828214be97b944d6211a1"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronny-rentner%2FUltraDict","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronny-rentner%2FUltraDict/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronny-rentner%2FUltraDict/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronny-rentner%2FUltraDict/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ronny-rentner","download_url":"https://codeload.github.com/ronny-rentner/UltraDict/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246162092,"owners_count":20733351,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dict","python","python3","shared-memory"],"created_at":"2024-07-31T22:01:33.762Z","updated_at":"2025-03-29T08:32:23.394Z","avatar_url":"https://github.com/ronny-rentner.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# UltraDict\n\nSychronized, streaming Python dictionary that uses shared memory as a backend\n\n**Warning: This is an early hack. There are only few unit tests and so on. Maybe not stable!**\n\nFeatures:\n* Fast (compared to other sharing solutions)\n* No running manager processes\n* Works in spawn and fork context\n* Safe locking between independent processes\n* Tested with Python \u003e= v3.8 on Linux, Windows and Mac\n* Convenient, no setter or getters necessary\n* Optional recursion for nested dicts\n\n[![PyPI Package](https://img.shields.io/pypi/v/ultradict.svg)](https://pypi.org/project/ultradict)\n[![Run Python Tests](https://github.com/ronny-rentner/UltraDict/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/ronny-rentner/UltraDict/actions/workflows/ci.yml)\n[![Python \u003e=3.8](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License](https://img.shields.io/github/license/ronny-rentner/UltraDict.svg)](https://github.com/ronny-rentner/UltraDict/blob/master/LICENSE.md)\n\n## General Concept\n\n`UltraDict` uses [multiprocessing.shared_memory](https://docs.python.org/3/library/multiprocessing.shared_memory.html#module-multiprocessing.shared_memory) to synchronize a dict between multiple processes.\n\nIt does so by using a *stream of updates* in a shared memory buffer. This is efficient because only changes have to be serialized and transferred.\n\nIf the buffer is full, `UltraDict` will automatically do a full dump to a new shared\nmemory space, reset the streaming buffer and continue to stream further updates. All users\nof the `UltraDict` will automatically load full dumps and continue using\nstreaming updates afterwards.\n\n## Issues\n\nOn Windows, if no process has any handles on the shared memory, the OS will gc all of the shared memory making it inaccessible for\nfuture processes. To work around this issue you can currently set `full_dump_size` which will cause the creator\nof the dict to set a static full dump memory of the requested size. This full dump memory will live as long as the creator lives.\nThis approach has the downside that you need to plan ahead for your data size and if it does not fit into the full dump memory, it will break.\n\n## Alternatives\n\nThere are many alternatives:\n\n * [multiprocessing.Manager](https://docs.python.org/3/library/multiprocessing.html#managers)\n * [shared-memory-dict](https://github.com/luizalabs/shared-memory-dict)\n * [mpdict](https://github.com/gatopeich/mpdict)\n * Redis\n * Memcached\n\n## How to use?\n\n### Simple\n\nIn one Python REPL:\n```python\nPython 3.9.2 on linux\n\u003e\u003e\u003e\n\u003e\u003e\u003e from UltraDict import UltraDict\n\u003e\u003e\u003e ultra = UltraDict({ 1:1 }, some_key='some_value')\n\u003e\u003e\u003e ultra\n{1: 1, 'some_key': 'some_value'}\n\u003e\u003e\u003e\n\u003e\u003e\u003e # We need the shared memory name in the other process.\n\u003e\u003e\u003e ultra.name\n'psm_ad73da69'\n```\n\nIn another Python REPL:\n```python\nPython 3.9.2 on linux\n\u003e\u003e\u003e\n\u003e\u003e\u003e from UltraDict import UltraDict\n\u003e\u003e\u003e # Connect to the shared memory with the name above\n\u003e\u003e\u003e other = UltraDict(name='psm_ad73da69')\n\u003e\u003e\u003e other\n{1: 1, 'some_key': 'some_value'}\n\u003e\u003e\u003e other[2] = 2\n```\n\nBack in the first Python REPL:\n```python\n\u003e\u003e\u003e ultra[2]\n2\n```\n\n### Nested\n\nIn one Python REPL:\n```python\nPython 3.9.2 on linux\n\u003e\u003e\u003e\n\u003e\u003e\u003e from UltraDict import UltraDict\n\u003e\u003e\u003e ultra = UltraDict(recurse=True)\n\u003e\u003e\u003e ultra['nested'] = { 'counter': 0 }\n\u003e\u003e\u003e type(ultra['nested'])\n\u003cclass 'UltraDict.UltraDict'\u003e\n\u003e\u003e\u003e ultra.name\n'psm_0a2713e4'\n```\n\nIn another Python REPL:\n```python\nPython 3.9.2 on linux\n\u003e\u003e\u003e\n\u003e\u003e\u003e from UltraDict import UltraDict\n\u003e\u003e\u003e other = UltraDict(name='psm_0a2713e4')\n\u003e\u003e\u003e other['nested']['counter'] += 1\n```\n\nBack in the first Python REPL:\n```python\n\u003e\u003e\u003e ultra['nested']['counter']\n1\n```\n\n## Performance comparison\n\nLets compare a classical Python dict, UltraDict, multiprocessing.Manager and Redis.\n\nNote that this comparison is not a real life workload. It was executed on Debian Linux 11\nwith Redis installed from the Debian package and with the default configuration of Redis.\n\n```python\nPython 3.9.2 on linux\n\u003e\u003e\u003e\n\u003e\u003e\u003e from UltraDict import UltraDict\n\u003e\u003e\u003e ultra = UltraDict()\n\u003e\u003e\u003e for i in range(10_000): ultra[i] = i\n...\n\u003e\u003e\u003e len(ultra)\n10000\n\u003e\u003e\u003e ultra[500]\n500\n\u003e\u003e\u003e # Now let's do some performance testing\n\u003e\u003e\u003e import multiprocessing, redis, timeit\n\u003e\u003e\u003e orig = dict(ultra)\n\u003e\u003e\u003e len(orig)\n10000\n\u003e\u003e\u003e orig[500]\n500\n\u003e\u003e\u003e managed = multiprocessing.Manager().dict(orig)\n\u003e\u003e\u003e len(managed)\n10000\n\u003e\u003e\u003e r = redis.Redis()\n\u003e\u003e\u003e r.flushall()\n\u003e\u003e\u003e r.mset(orig)\n```\n\n### Read performance\n\n\u003e\u003e\u003e\n```python\n\u003e\u003e\u003e timeit.timeit('orig[1]', globals=globals()) # original\n0.03832335816696286\n\u003e\u003e\u003e timeit.timeit('ultra[1]', globals=globals()) # UltraDict\n0.5248982920311391\n\u003e\u003e\u003e timeit.timeit('managed[1]', globals=globals()) # Manager\n40.85506196087226\n\u003e\u003e\u003e timeit.timeit('r.get(1)', globals=globals()) # Redis\n49.3497632863\n\u003e\u003e\u003e timeit.timeit('ultra.data[1]', globals=globals()) # UltraDict data cache\n0.04309639008715749\n```\n\nWe are factor 15 slower than a real, local dict, but way faster than using a Manager. If you need full read performance, you can access the underlying cache `ultra.data` directly and get almost original dict performance, of course at the cost of not having real-time updates anymore.\n\n### Write performance\n\n```python\n\u003e\u003e\u003e min(timeit.repeat('orig[1] = 1', globals=globals())) # original\n0.028232071083039045\n\u003e\u003e\u003e min(timeit.repeat('ultra[1] = 1', globals=globals())) # UltraDict\n2.911152713932097\n\u003e\u003e\u003e min(timeit.repeat('managed[1] = 1', globals=globals())) # Manager\n31.641707635018975\n\u003e\u003e\u003e min(timeit.repeat('r.set(1, 1)', globals=globals())) # Redis\n124.3432381930761\n```\n\nWe are factor 100 slower than a real, local Python dict, but still factor 10 faster than using a Manager and much fast than Redis.\n\n### Testing performance\n\nThere is an automated performance test in `tests/performance/performance.py`. If you run it, you get something like this:\n\n```bash\npython ./tests/performance/performance.py\n\nTesting Performance with 1000000 operations each\n\nRedis (writes) = 24,351 ops per second\nRedis (reads) = 30,466 ops per second\nPython MPM dict (writes) = 19,371 ops per second\nPython MPM dict (reads) = 22,290 ops per second\nPython dict (writes) = 16,413,569 ops per second\nPython dict (reads) = 16,479,191 ops per second\nUltraDict (writes) = 479,860 ops per second\nUltraDict (reads) = 2,337,944 ops per second\nUltraDict (shared_lock=True) (writes) = 41,176 ops per second\nUltraDict (shared_lock=True) (reads) = 1,518,652 ops per second\n\nRanking:\n  writes:\n    Python dict = 16,413,569 (factor 1.0)\n    UltraDict = 479,860 (factor 34.2)\n    UltraDict (shared_lock=True) = 41,176 (factor 398.62)\n    Redis = 24,351 (factor 674.04)\n    Python MPM dict = 19,371 (factor 847.33)\n  reads:\n    Python dict = 16,479,191 (factor 1.0)\n    UltraDict = 2,337,944 (factor 7.05)\n    UltraDict (shared_lock=True) = 1,518,652 (factor 10.85)\n    Redis = 30,466 (factor 540.9)\n    Python MPM dict = 22,290 (factor 739.31)\n```\n\nI am interested in extending the performance testing to other solutions (like sqlite, memcached, etc.) and to more complex use cases with multiple processes working in parallel.\n\n## Parameters\n\n`Ultradict(*arg, name=None, create=None, buffer_size=10000, serializer=pickle, shared_lock=False, full_dump_size=None, auto_unlink=None, recurse=False, recurse_register=None, **kwargs)`\n\n`name`: Name of the shared memory. A random name will be chosen if not set. By default, if a name is given\na new shared memory space is created if it does not exist yet. Otherwise the existing shared\nmemory space is attached.\n\n`create`: Can be either `True` or `False` or `None`. If set to `True`, a new UltraDict will be created\nand an exception is thrown if one exists already with the given name. If kept at the default value `None`,\neither a new UltraDict will be created if the name is not taken or an existing UltraDict will be attached.\n\nSetting `create=True` does ensure not accidentally attaching to an existing UltraDict that might be left over.\n\n`buffer_size`: Size of the shared memory buffer used for streaming changes of the dict.\nThe buffer size limits the biggest change that can be streamed, so when you use large values or\ndeeply nested dicts you might need a bigger buffer. Otherwise, if the buffer is too small,\nit will fall back to a full dump. Creating full dumps can be slow, depending on the size of your dict.\n\nWhenever the buffer is full, a full dump will be created. A new shared memory is allocated just\nbig enough for the full dump. Afterwards the streaming buffer is reset.  All other users of the\ndict will automatically load the full dump and continue streaming updates.\n\n(Also see the section [Memory management](#memory-management) below!)\n\n`serializer`: Use a different serialized from the default pickle, e. g. marshal, dill, jsons.\nThe module or object provided must support the methods *loads()* and *dumps()*\n\n`shared_lock`: When writing to the same dict at the same time from multiple, independent processes,\nthey need a shared lock to synchronize and not overwrite each other's changes. Shared locks are slow.\nThey rely on the [atomics](https://github.com/doodspav/atomics) package for atomic locks. By default,\nUltraDict will use a multiprocessing.RLock() instead which works well in fork context and is much faster.\n\n(Also see the section [Locking](#locking) below!)\n\n`full_dump_size`: If set, uses a static full dump memory instead of dynamically creating it. This\nmight be necessary on Windows depending on your write behaviour. On Windows, the full dump memory goes\naway if the process goes away that had created the full dump. Thus you must plan ahead which processes might\nbe writing to the dict and therefore creating full dumps.\n\n`auto_unlink`: If True, the creator of the shared memory will automatically unlink the handle at exit so\nit is not visible or accessible to new processes. All existing, still connected processes can continue to use the\ndict.\n\n`recurse`: If True, any nested dict objects will be automaticall wrapped in an `UltraDict` allowing transparent nested updates.\n\n`recurse_register`: Has to be either the `name` of an UltraDict or an UltraDict instance itself. Will be used internally to keep track of dynamically created, recursive UltraDicts for proper cleanup when using `recurse=True`. Usually does not have to be set by the user.\n\n## Memory management\n\n`UltraDict` uses shared memory buffers and those usually live is RAM. `UltraDict` does not use any management processes to keep track of buffers.  Also it cannot know when to free those shared memory buffers again because you might want the buffers to outlive the process that has created them.\n\nBy convention you should set the parameter `auto_unlink` to True for exactly one of the processes that is using the `UltraDict`. The first process\nthat is creating a certain `UltraDict` will automatically get the flag `auto_unlink=True` unless you explicitly set it to `False`.\nWhen this process with the `auto_unlink=True` flag ends, it will try to unlink (free) all shared memory buffers.\n\nA special case is the recursive mode using `recurse=True` parameter. This mode will use an additional internal `UltraDict` to keep\ntrack of recursively nested `UltraDict` instances. All child `UltraDicts` will write to this register the names of the shared memory buffers\nthey are creating. This allows the buffers to outlive the processes and still being correctly cleanup up by at the end of the program.\n\n**Buffer sizes and read performance:**\n\nThere are 3 cases that can occur when you read from an `UltraDict:\n\n1. No new updates: This is the fastes cases. `UltraDict` was optimized for this case to find out as quickly as possible if there are no updates on the stream and then just return the desired data. If you want even better read perforamance you can directly access the underlying `data` attribute of your `UltraDict`, though at the cost of not getting real time updates anymore.\n\n2. Streaming update: This is usually fast, depending on the size and amount of that data that was changed but not depending on the size of the whole `UltraDict`. Only the data that was actually changed has to be unserialized.\n\n3. Full dump load: This can be slow, depending on the total size of your data. If your `UltraDict` is big it might take long to unserialize it.\n\nGiven the above 3 cases, you need to balance the size of your data and your write patterns with the streaming `buffer_size` of your UltraDict. If the streaming buffer is full, a full dump has to be created. Thus, if your full dumps are expensive due to their size, try to find a good `buffer_size` to avoid creating too many full dumps.\n\nOn the other hand, if for example you only change back and forth the value of one single key in your `UltraDict`, it might be useless to process a stream of all these back and forth changes. It might be much more efficient to simply do one full dump which might be very small because it only contains one key.\n\n## Locking\n\nEvery UltraDict instance has a `lock` attribute which is either a [multiprocessing.RLock](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.RLock) or an `UltraDict.SharedLock` if you set `shared_lock=True` when creating the UltraDict.\n\nRLock is the fastest locking method that is used by default but you can only use it if you fork your child processes. Forking is the default on Linux systems.\n\nIn contrast, on Windows systems, forking is not available and Python will automatically use the spawn method when creating child processes. You should then use the parameter `shared_lock=True` when using UltraDict. This requires that the external [atomics](https://github.com/doodspav/atomics) package is installed.\n\n### How to use the locking?\n```python\nultra = UltraDict(shared_lock=True)\n\nwith ultra.lock:\n\tultra['counter']++\n\n# The same as above with all default parameters\nwith ultra.lock(timeout=None, block=True, steal=False, sleep_time=0.000001):\n\tultra['counter']++\n\n# Busy wait, will result in 99 % CPU usage, fastest option\n# Ideally number of processes using the UltraDict should be \u003c number of CPUs\nwith ultra.lock(sleep_time=0):\n\tultra['counter']++\n\ntry:\n\tresult = ultra.lock.acquire(block=False)\n\tultra.lock.release()\nexcept UltraDict.Exceptions.CannotAcquireLock as e:\n\tprint(f'Process with PID {e.blocking_pid} is holding the lock')\n\ntry:\n\twith ultra.lock(timeout=1.5):\n\t\tultra['counter']++\nexcept UltraDict.Exceptions.CannotAcquireLockTimeout:\n\tprint('Stale lock?')\n\nwith ultra.lock(timeout=1.5, steal_after_timeout=True):\n\tultra['counter']++\n\n```\n\n## Explicit cleanup\n\nSometimes, when your program crashes, no cleanup happens and you might have a corrupted shared memeory buffer that only goes away if you manually delete it.\n\nOn Linux/Unix systems, those buffers usually live in a memory based filesystem in the folder `/dev/shm`. You can simply delete the files there.\n\nAnother way to do this in code is like this:\n```python\n# Unlink both shared memory buffers possibly used by UltraDict\nname = 'my-dict-name'\nUltraDict.unlink_by_name(name, ignore_errors=True)\nUltraDict.unlink_by_name(f'{name}_memory', ignore_errors=True)\n```\n\n## Advanced usage\n\nSee [examples](/examples) folder\n\n```python\n\u003e\u003e\u003e ultra = UltraDict({ 'init': 'some initial data' }, name='my-name', buffer_size=100_000)\n\u003e\u003e\u003e # Let's use a value with 100k bytes length.\n\u003e\u003e\u003e # This will not fit into our 100k bytes buffer due to the serialization overhead.\n\u003e\u003e\u003e ultra[0] = ' ' * 100_000\n\u003e\u003e\u003e ultra.print_status()\n{'buffer': SharedMemory('my-name_memory', size=100000),\n 'buffer_size': 100000,\n 'control': SharedMemory('my-name', size=1000),\n 'full_dump_counter': 1,\n 'full_dump_counter_remote': 1,\n 'full_dump_memory': SharedMemory('psm_765691cd', size=100057),\n 'full_dump_memory_name_remote': 'psm_765691cd',\n 'full_dump_size': None,\n 'full_dump_static_size_remote': \u003cmemory at 0x7fcbf5ca6580\u003e,\n 'lock': \u003cRLock(None, 0)\u003e,\n 'lock_pid_remote': 0,\n 'lock_remote': 0,\n 'name': 'my-name',\n 'recurse': False,\n 'recurse_remote': \u003cmemory at 0x7fcbf5ca6700\u003e,\n 'serializer': \u003cmodule 'pickle' from '/usr/lib/python3.9/pickle.py'\u003e,\n 'shared_lock_remote': \u003cmemory at 0x7fcbf5ca6640\u003e,\n 'update_stream_position': 0,\n 'update_stream_position_remote': 0}\n```\n\nNote: All status keys ending with `_remote` are stored in the control shared memory space and shared across processes.\n\nOther things you can do:\n```python\n\u003e\u003e\u003e # Create a full dump\n\u003e\u003e\u003e ultra.dump()\n\n\u003e\u003e\u003e # Load latest full dump if one is available\n\u003e\u003e\u003e ultra.load()\n\n\u003e\u003e\u003e # Show statistics\n\u003e\u003e\u003e ultra.print_status()\n\n\u003e\u003e\u003e # Force load of latest full dump, even if we had already processed it.\n\u003e\u003e\u003e # There might also be streaming updates available after loading the full dump.\n\u003e\u003e\u003e ultra.load(force=True)\n\n\u003e\u003e\u003e # Apply full dump and stream updates to\n\u003e\u003e\u003e # underlying local dict, this is automatically\n\u003e\u003e\u003e # called by accessing the UltraDict in any usual way,\n\u003e\u003e\u003e # but can be useful to call after a forced load.\n\u003e\u003e\u003e ultra.apply_update()\n\n\u003e\u003e\u003e # Access underlying local dict directly for maximum performance\n\u003e\u003e\u003e ultra.data\n\n\u003e\u003e\u003e # Use any serializer you like, given it supports the loads() and dumps() methods\n\u003e\u003e\u003e import jsons\n\u003e\u003e\u003e ultra = UltraDict(serializer=jsons)\n\n\u003e\u003e\u003e # Close connection to shared memory; will return the data as a dict\n\u003e\u003e\u003e ultra.close()\n\n\u003e\u003e\u003e # Unlink all shared memory, it will not be visible to new processes afterwards\n\u003e\u003e\u003e ultra.unlink()\n\n```\n\n## Contributing\n\nContributions are always welcome!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fronny-rentner%2FUltraDict","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fronny-rentner%2FUltraDict","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fronny-rentner%2FUltraDict/lists"}