{"id":24300095,"url":"https://github.com/sun-lab-nbb/ataraxis-data-structures","last_synced_at":"2026-02-19T05:04:25.295Z","repository":{"id":250228065,"uuid":"819426407","full_name":"Sun-Lab-NBB/ataraxis-data-structures","owner":"Sun-Lab-NBB","description":"A Python library that provides classes and structures for storing, manipulating, and sharing data between Python processes.","archived":false,"fork":false,"pushed_at":"2025-10-14T18:29:10.000Z","size":577,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-15T08:28:55.211Z","etag":null,"topics":["ataraxis","data-logging","data-manipulation","data-structures","shared-memory"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sun-Lab-NBB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-06-24T13:32:52.000Z","updated_at":"2025-10-17T15:07:55.000Z","dependencies_parsed_at":"2026-02-15T03:02:32.882Z","dependency_job_id":null,"html_url":"https://github.com/Sun-Lab-NBB/ataraxis-data-structures","commit_stats":null,"previous_names":["sun-lab-nbb/ataraxis-data-structures"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/Sun-Lab-NBB/ataraxis-data-structures","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sun-Lab-NBB%2Fataraxis-data-structures","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sun-Lab-NBB%2Fataraxis-data-structures/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sun-Lab-NBB%2Fataraxis-data-structures/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sun-Lab-NBB%2Fataraxis-data-structures/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sun-Lab-NBB","download_url":"https://codeload.github.com/Sun-Lab-NBB/ataraxis-data-structures/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sun-Lab-NBB%2Fataraxis-data-structures/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29604190,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T04:38:07.383Z","status":"ssl_error","status_checked_at":"2026-02-19T04:35:50.016Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ataraxis","data-logging","data-manipulation","data-structures","shared-memory"],"created_at":"2025-01-16T22:38:42.608Z","updated_at":"2026-02-19T05:04:25.279Z","avatar_url":"https://github.com/Sun-Lab-NBB.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ataraxis-data-structures\n\nA Python library that provides classes and structures for storing, manipulating, and sharing data between Python \nprocesses.\n\n![PyPI - Version](https://img.shields.io/pypi/v/ataraxis-data-structures)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/ataraxis-data-structures)\n[![uv](https://tinyurl.com/uvbadge)](https://github.com/astral-sh/uv)\n[![Ruff](https://tinyurl.com/ruffbadge)](https://github.com/astral-sh/ruff)\n![type-checked: mypy](https://img.shields.io/badge/type--checked-mypy-blue?style=flat-square\u0026logo=python)\n![PyPI - License](https://img.shields.io/pypi/l/ataraxis-data-structures)\n![PyPI - Status](https://img.shields.io/pypi/status/ataraxis-data-structures)\n![PyPI - Wheel](https://img.shields.io/pypi/wheel/ataraxis-data-structures)\n\n___\n\n## Detailed Description\n\nThis library aggregates the classes and methods used by other Ataraxis and Sun lab libraries for working with data. \nThis includes classes to manipulate the data, share (move) the data between different Python processes, and store the \ndata in non-volatile memory (on disk). Generally, these classes either implement novel functionality not available \nthrough other popular libraries or extend existing functionality to match specific needs of other project Ataraxis \nlibraries.\n\n___\n\n## Features\n\n- Supports Windows, Linux, and macOS.\n- Provides a Process- and Thread-safe way of sharing data between multiple processes through a NumPy array structure.\n- Extends the standard Python dataclass to support saving and loading its data to / from YAML files.\n- Provides a fast and scalable data logger optimized for saving serialized data from multiple parallel processes in \n  non-volatile memory.\n- GPL 3 License.\n\n___\n\n## Table of Contents\n\n- [Dependencies](#dependencies)\n- [Installation](#installation)\n- [Usage](#usage)\n- [API Documentation](#api-documentation)\n- [Developers](#developers)\n- [Versioning](#versioning)\n- [Authors](#authors)\n- [License](#license)\n- [Acknowledgements](#Acknowledgments)\n\n___\n\n## Dependencies\n\nFor users, all library dependencies are installed automatically by all supported installation methods \n(see the [Installation](#installation) section).\n\n***Note!*** Developers should see the [Developers](#developers) section for information on installing additional \ndevelopment dependencies.\n\n___\n\n## Installation\n\n### Source\n\nNote, installation from source is ***highly discouraged*** for anyone who is not an active project developer.\n\n1. Download this repository to the local machine using the preferred method, such as git-cloning. Use one of the \n   [stable releases](https://github.com/Sun-Lab-NBB/ataraxis-data-structures/releases) that include precompiled binary \n   and source code distribution (sdist) wheels.\n2. If the downloaded distribution is stored as a compressed archive, unpack it using the appropriate decompression tool.\n3. ```cd``` to the root directory of the prepared project distribution.\n4. Run ```python -m pip install .``` to install the project. Alternatively, if using a distribution with precompiled\n   binaries, use ```python -m pip install WHEEL_PATH```, replacing 'WHEEL_PATH' with the path to the wheel file.\n\n### pip\n\nUse the following command to install the library using pip: ```pip install ataraxis-data-structures```\n\n___\n\n## Usage\n\nThis section is broken into subsections for each exposed utility class or module. For each, it only provides the \nminimalistic (quickstart) functionality overview, which does not reflect the nuances of using each asset. To learn \nabout the nuances, consult the [API documentation](#api-documentation) or see the [example implementations](examples).\n\n### YamlConfig\nThe YamlConfig class extends the functionality of the standard Python dataclass module by bundling the dataclass \ninstances with methods to save and load their data to / from .yaml files. Primarily, this functionality is implemented \nto support storing runtime configuration data in a non-volatile, human-readable, and editable format.\n\nThe YamlConfig class is designed to be **subclassed** by custom dataclass instances to gain the .yaml saving and \nloading functionality realized through the inherited **to_yaml()** and **from_yaml()** methods:\n```\nfrom ataraxis_data_structures import YamlConfig\nfrom dataclasses import dataclass\nfrom pathlib import Path\nimport tempfile\n\n\n# All YamlConfig functionality is accessed via subclassing.\n@dataclass\nclass MyConfig(YamlConfig):\n    integer: int = 0\n    string: str = 'random'\n\n\n# Instantiates the test class using custom values that do not match the default initialization values.\nconfig = MyConfig(integer=123, string='hello')\n\n# Saves the instance data to a YAML file in a temporary directory. The saved data can be modified by directly editing\n# the saved .yaml file.\ntempdir = tempfile.TemporaryDirectory()  # Creates a temporary directory for illustration purposes.\nout_path = Path(tempdir.name).joinpath(\"my_config.yaml\")  # Resolves the path to the output file.\nconfig.to_yaml(file_path=out_path)\n\n# Ensures that the cache file has been created.\nassert out_path.exists()\n\n# Creates a new MyConfig instance using the data inside the .yaml file.\nloaded_config = MyConfig.from_yaml(file_path=out_path)\n\n# Ensures that the loaded data matches the original MyConfig instance data.\nassert loaded_config.integer == config.integer\nassert loaded_config.string == config.string\n```\n\n### SharedMemoryArray\nThe SharedMemoryArray class supports sharing data between multiple Python processes in a thread- and process-safe way.\nTo do so, it implements a shared memory buffer accessed via an n-dimensional NumPy array instance, allowing different \nprocesses to read and write any element(s) of the array.\n\n#### SharedMemoryArray Creation\nThe SharedMemoryArray only needs to be instantiated __once__ by the main runtime process (thread) and provided to all \nchildren processes as an input. The initialization process uses the specified prototype NumPy array and unique buffer \nname to generate a (new) NumPy array whose data is stored in a shared memory buffer accessible from any thread or \nprocess. *__Note!__* The array dimensions and datatype cannot be changed after initialization.\n```\nfrom ataraxis_data_structures import SharedMemoryArray\nimport numpy as np\n\n# The prototype array and buffer name determine the layout of the SharedMemoryArray for its entire lifetime:\nprototype = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64)\nbuffer_name = 'unique_buffer_name'  # Has to be unique for all concurrently used SharedMemoryArray instances.\n\n# To initialize the array, use the create_array() method. Do not call the class initialization method directly!\nsma = SharedMemoryArray.create_array(name=buffer_name, prototype=prototype)\n\n# Ensures that the shared memory buffer is destroyed when the instance is garbage-collected.\nsma.enable_buffer_destruction()\n\n# The instantiated SharedMemoryArray object wraps an n-dimensional NumPy array with the same dimensions and data type\n# as the prototype and uses the unique shared memory buffer name to identify the shared memory buffer to connect to from\n# different processes.\nassert sma.name == buffer_name\nassert sma.shape == prototype.shape\nassert sma.datatype == prototype.dtype\n\n# Demonstrates the current values for the critical SharedMemoryArray parameters evaluated above:\nprint(sma)\n```\n\n#### SharedMemoryArray Connection, Disconnection, and Destruction\nEach process using the SharedMemoryArray instance, __including__ the process that created it, must use the \n__connect()__ method to connect to the array before reading or writing data. At the end of its runtime, each connected\nprocess must call the __disconnect()__ method to release the local reference to the shared buffer. The **main** process \nalso needs to call the __destroy()__ method to destroy the shared memory buffer.\n```\nimport numpy as np\nfrom ataraxis_data_structures import SharedMemoryArray\n\n# Initializes a SharedMemoryArray\nprototype = np.zeros(shape=6, dtype=np.uint64)\nbuffer_name = \"unique_buffer\"\nsma = SharedMemoryArray.create_array(name=buffer_name, prototype=prototype)\n\n# This method has to be called before attempting to manipulate the data inside the array.\nsma.connect()\n\n# The connection status of the array can be verified at any time by using is_connected property:\nassert sma.is_connected\n\n# Each process that connected to the shared memory buffer must disconnect from it at the end of its runtime. On Windows\n# platforms, when all processes are disconnected from the buffer, the buffer is automatically garbage-collected.\nsma.disconnect()  # For each connect() call, there has to be a matching disconnect() call\n\nassert not sma.is_connected\n\n# On Unix platforms, the buffer persists even after being disconnected by all instances, unless it is explicitly\n# destroyed.\nsma.destroy()  # For each create_array() call, there has to be a matching destroy() call\n```\n\n#### Reading and Writing SharedMemoryArray Data\nFor routine data writing or reading operations, the SharedMemoryArray supports accessing its data via __indexing__ or \n__slicing__, just like a regular NumPy array. Critically, accessing the data in this way is process-safe, as the \ninstance first acquires an exclusive multiprocessing Lock before interfacing with the data. For more complex access \nscenarios, it is possible to use the __array()__ method to directly access and manipulate the underlying NumPy array \nobject used by the instance.\n```\nimport numpy as np\nfrom ataraxis_data_structures import SharedMemoryArray\n\n# Initializes a SharedMemoryArray\nprototype = np.array([1, 2, 3, 4, 5, 6], dtype=np.uint64)\nbuffer_name = \"unique_buffer\"\nsma = SharedMemoryArray.create_array(name=buffer_name, prototype=prototype)\nsma.connect()\n\n# The SharedMemoryArray data can be accessed directly using indexing or slicing, just like any regular NumPy array or\n# Python iterable:\n\n# Index\nassert sma[2] == np.uint64(3)\nassert isinstance(sma[2], np.uint64)\nsma[2] = 123  # Written data must be convertible to the datatype of the underlying NumPy array\nassert sma[2] == np.uint64(123)\n\n# Slice\nassert np.array_equal(sma[:4], np.array([1, 2, 123, 4], dtype=np.uint64))\nassert isinstance(sma[:4], np.ndarray)\n\n# It is also possible to directly access the underlying NumPy array, which allows using the full range of NumPy\n# operations. The accessor method can be used from within a context manager to enforce exclusive access to the array's\n# data via an internal multiprocessing lock mechanism:\nwith sma.array(with_lock=True) as array:\n    print(f\"Before clipping: {array}\")\n\n    # Clipping replaces the out-of-bounds value '123' with '10'.\n    array = np.clip(array, 0, 10)\n\n    print(f\"After clipping: {array}\")\n\n# Cleans up the array buffer\nsma.disconnect()\nsma.destroy()\n```\n\n#### Using SharedMemoryArray from Multiple Processes\nWhile all methods showcased above run in the same process, the main advantage of the SharedMemoryArray class is that \nit behaves the same way when used from different Python processes. See the [example](examples/shared_memory_array.py) \nscript for more details.\n\n### DataLogger\nThe DataLogger class initializes and manages the runtime of a logger process running in an independent Process and \nexposes a shared Queue object for buffering and piping data from any other Process to the logger. Currently, the \nclass is specifically designed for saving serialized byte arrays used by other Ataraxis libraries, most notably the\n__ataraxis-video-system__ and the __ataraxis-transport-layer__.\n\nThe sections below break down various aspects of working with the DataLogger instance. The individual sections can also\nbe seen as a combined [example](examples/data_logger.py) script.\n\n#### Creating and Starting the DataLogger\nDataLogger is intended to only be initialized ***once*** in the main runtime thread (Process) and provided to all \nchildren Processes as an input. ***Note!*** While a single DataLogger instance is typically enough for most use cases, \nit is possible to use more than a single DataLogger instance at the same time.\n```\nfrom pathlib import Path\nimport tempfile\nfrom ataraxis_data_structures import DataLogger\n\n# Due to the internal use of the 'Process' class, each DataLogger call has to be protected by the __main__ guard at\n# the highest level of the call hierarchy.\nif __name__ == \"__main__\":\n\n    # As a minimum, each DataLogger has to be given the path to the output directory and a unique name to distinguish\n    # the instance from any other concurrently active DataLogger instance.\n    tempdir = tempfile.TemporaryDirectory()  # Creates a temporary directory for illustration purposes\n    logger = DataLogger(output_directory=Path(tempdir.name), instance_name=\"my_name\")\n\n    # The DataLogger initialized above creates a new directory: 'tempdir/my_name_data_log' to store logged entries.\n\n    # Before the DataLogger starts saving data, its saver process needs to be initialized via the start() method.\n    # Until the saver is initialized, the instance buffers all incoming data in RAM (via the internal Queue object),\n    # which may eventually exhaust the available memory.\n    logger.start()\n\n    # Each call to the start() method must be matched with a corresponding call to the stop() method. This method shuts\n    # down the logger process and releases any resources held by the instance.\n    logger.stop()\n```\n\n#### Data Logging\nThe DataLogger is explicitly designed to log serialized data of arbitrary size. To enforce the correct data formatting, \nall data submitted to the logger ***must*** be packaged into a __LogPackage__ class instance before it is put into the\nDataLoger’s input queue.\n```\nfrom pathlib import Path\nimport tempfile\nimport numpy as np\nfrom ataraxis_data_structures import DataLogger, LogPackage, assemble_log_archives\nfrom ataraxis_time import get_timestamp, TimestampFormats\n\nif __name__ == \"__main__\":\n    # Initializes and starts the DataLogger.\n    tempdir = tempfile.TemporaryDirectory()\n    logger = DataLogger(output_directory=Path(tempdir.name), instance_name=\"my_name\")\n    logger.start()\n\n    # The DataLogger uses a multiprocessing Queue to buffer and pipe the incoming data to the saver process. The queue\n    # is accessible via the 'input_queue' property of each logger instance.\n    logger_queue = logger.input_queue\n\n    # The DataLogger is explicitly designed to log serialized data. All data submitted to the logger must be packaged\n    # into a LogPackage instance to ensure that it adheres to the proper format expected by the logger instance.\n    source_id = np.uint8(1)  # Has to be an unit8 type\n    timestamp = np.uint64(get_timestamp(output_format=TimestampFormats.INTEGER))  # Has to be an uint64 type\n    data = np.array([1, 2, 3, 4, 5], dtype=np.uint8)  # Has to be an uint8 NumPy array\n    logger_queue.put(LogPackage(source_id, timestamp, data))\n\n    # The timer used to timestamp the log entries has to be precise enough to resolve two consecutive data entries.\n    # Due to these constraints, it is recommended to use a nanosecond or microsecond timer, such as the one offered\n    # by the ataraxis-time library.\n    timestamp = np.uint64(get_timestamp(output_format=TimestampFormats.INTEGER))\n    data = np.array([6, 7, 8, 9, 10], dtype=np.uint8)\n    logger_queue.put(LogPackage(source_id, timestamp, data))  # Same source id as the package above\n\n    # Stops the data logger.\n    logger.stop()\n\n    # The DataLogger saves the input LogPackage instances as serialized NumPy byte array .npy files. The output \n    # directory for the saved files can be queried from the DataLogger instance's 'output_directory' property.\n    assert len(list(logger.output_directory.glob(\"**/*.npy\"))) == 2\n```\n\n#### Log Archive Assembly\nTo optimize the log writing speed and minimize the time the data sits in the volatile memory, all log entries are saved \nto disk as separate NumPy array .npy files. While this format is efficient for time-critical runtimes, it is not \noptimal for long-term storage and data transfer. To help with optimizing the post-runtime data storage, the library \noffers the __assemble_log_archives()__ function which aggregates .npy files from the same data source into an\n(uncompressed) .npz archive.\n\n```\nfrom pathlib import Path\nimport tempfile\nimport numpy as np\nfrom ataraxis_data_structures import DataLogger, LogPackage, assemble_log_archives\n\nif __name__ == \"__main__\":\n\n    # Creates and starts the DataLogger instance.\n    tempdir = tempfile.TemporaryDirectory()\n    logger = DataLogger(output_directory=Path(tempdir.name), instance_name=\"my_name\")\n    logger.start()\n    logger_queue = logger.input_queue\n\n    # Generates and logs 255 data messages. This generates 255 unique .npy files under the logger's output directory.\n    for i in range(255):\n        logger_queue.put(LogPackage(np.uint8(1), np.uint64(i), np.array([i, i, i], dtype=np.uint8)))\n\n    # Stops the data logger.\n    logger.stop()\n\n    # Depending on the runtime context, a DataLogger instance can generate a large number of individual .npy files as\n    # part of its runtime. While having advantages for real-time data logging, this format of storing the data is not\n    # ideal for later data transfer and manipulation. Therefore, it is recommended to always use the\n    # assemble_log_archives() function to aggregate the individual .npy files into one or more .npz archives.\n    assemble_log_archives(log_directory=logger.output_directory, remove_sources=True, memory_mapping=True, verbose=True)\n\n    # The archive assembly creates a single .npz file named after the source_id (1_log.npz), using all available .npy\n    # files. Generally, each unique data source is assembled into a separate .npz archive.\n    assert len(list(logger.output_directory.glob(\"**/*.npy\"))) == 0\n    assert len(list(logger.output_directory.glob(\"**/*.npz\"))) == 1\n```\n\n___\n\n## API Documentation\n\nSee the [API documentation](https://ataraxis-data-structures-api-docs.netlify.app/) for the detailed description of the \nmethods and classes exposed by components of this library.\n\n___\n\n## Developers\n\nThis section provides installation, dependency, and build-system instructions for project developers.\n\n### Installing the Project\n\n***Note!*** This installation method requires **mamba version 2.3.2 or above**. Currently, all Sun lab automation \npipelines require that mamba is installed through the [miniforge3](https://github.com/conda-forge/miniforge) installer.\n\n1. Download this repository to the local machine using the preferred method, such as git-cloning.\n2. If the downloaded distribution is stored as a compressed archive, unpack it using the appropriate decompression tool.\n3. ```cd``` to the root directory of the prepared project distribution.\n4. Install the core Sun lab development dependencies into the ***base*** mamba environment via the \n   ```mamba install tox uv tox-uv``` command.\n5. Use the ```tox -e create``` command to create the project-specific development environment followed by \n   ```tox -e install``` command to install the project into that environment as a library.\n\n### Additional Dependencies\n\nIn addition to installing the project and all user dependencies, install the following dependencies:\n\n1. [Python](https://www.python.org/downloads/) distributions, one for each version supported by the developed project. \n   Currently, this library supports the three latest stable versions. It is recommended to use a tool like \n   [pyenv](https://github.com/pyenv/pyenv) to install and manage the required versions.\n\n### Development Automation\n\nThis project comes with a fully configured set of automation pipelines implemented using \n[tox](https://tox.wiki/en/latest/user_guide.html). Check the [tox.ini file](tox.ini) for details about the \navailable pipelines and their implementation. Alternatively, call ```tox list``` from the root directory of the project\nto see the list of available tasks.\n\n**Note!** All pull requests for this project have to successfully complete the ```tox``` task before being merged. \nTo expedite the task’s runtime, use the ```tox --parallel``` command to run some tasks in-parallel.\n\n### Automation Troubleshooting\n\nMany packages used in 'tox' automation pipelines (uv, mypy, ruff) and 'tox' itself may experience runtime failures. In \nmost cases, this is related to their caching behavior. If an unintelligible error is encountered with \nany of the automation components, deleting the corresponding .cache (.tox, .ruff_cache, .mypy_cache, etc.) manually \nor via a CLI command typically solves the issue.\n\n___\n\n## Versioning\n\nThis project uses [semantic versioning](https://semver.org/). See the \n[tags on this repository](https://github.com/Sun-Lab-NBB/ataraxis-data-structures/tags) for the available project \nreleases.\n\n---\n\n## Authors\n\n- Ivan Kondratyev ([Inkaros](https://github.com/Inkaros))\n\n___\n\n## License\n\nThis project is licensed under the GPL3 License: see the [LICENSE](LICENSE) file for details.\n\n___\n\n## Acknowledgments\n\n- All Sun lab [members](https://neuroai.github.io/sunlab/people) for providing the inspiration and comments during the\n  development of this library.\n- The creators of all other dependencies and projects listed in the [pyproject.toml](pyproject.toml) file.\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsun-lab-nbb%2Fataraxis-data-structures","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsun-lab-nbb%2Fataraxis-data-structures","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsun-lab-nbb%2Fataraxis-data-structures/lists"}