{"id":20848156,"url":"https://github.com/tlcfem/msglc","last_synced_at":"2025-05-12T02:32:24.430Z","repository":{"id":225946584,"uuid":"767318051","full_name":"TLCFEM/msglc","owner":"TLCFEM","description":"🗜️ (de)serialize json objects with lazy/partial loading containers using msgpack | msgpack.org[Python]","archived":false,"fork":false,"pushed_at":"2024-11-03T16:00:19.000Z","size":723,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-03T17:16:40.108Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://tlcfem.github.io/msglc/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TLCFEM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-05T04:27:35.000Z","updated_at":"2024-11-03T16:00:23.000Z","dependencies_parsed_at":"2024-03-05T05:31:18.262Z","dependency_job_id":"9299e8c6-27e0-4ab7-ad99-59e55ea35816","html_url":"https://github.com/TLCFEM/msglc","commit_stats":null,"previous_names":["tlcfem/msglc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TLCFEM%2Fmsglc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TLCFEM%2Fmsglc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TLCFEM%2Fmsglc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TLCFEM%2Fmsglc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TLCFEM","download_url":"https://codeload.github.com/TLCFEM/msglc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225116083,"owners_count":17423175,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-18T02:24:46.333Z","updated_at":"2024-11-18T02:24:47.062Z","avatar_url":"https://github.com/TLCFEM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# msglc --- (de)serialize json objects with lazy/partial loading containers using msgpack\n\n[![codecov](https://codecov.io/gh/TLCFEM/msglc/graph/badge.svg?token=JDPARZSVDR)](https://codecov.io/gh/TLCFEM/msglc)\n[![PyPI version](https://badge.fury.io/py/msglc.svg)](https://pypi.org/project/msglc/)\n\n## What\n\n`msglc` is a Python library that provides a way to serialize and deserialize json objects with lazy/partial loading\ncontainers using `msgpack` as the serialization format.\n\nIt can be used in environments that use `msgpack` to store/exchange data that is larger than a few MBs if any of the\nfollowings hold.\n\n1. After cold storage, each retrieval only accesses part of the stored data.\n2. Cannot afford to decode the whole file due to memory limitation, performance consideration, etc.\n3. Want to combine encoded data into a single blob without decoding and re-encoding the same piece of data.\n\nOne may want to check the [benchmark](https://github.com/TLCFEM/msglc/wiki/Benchmark).\n\n## Quick Start\n\n### Serialization\n\nUse `dump` to serialize a json object to a file.\n\n```python\nfrom msglc import dump\n\ndata = {\"a\": [1, 2, 3], \"b\": {\"c\": 4, \"d\": 5, \"e\": [0x221548313] * 10}}\ndump(\"data.msg\", data)\n```\n\nUse `combine` to combine several serialized files together.\nThe combined files can be further combined.\n\n#### Combine as `dict`\n\n```python\nfrom msglc import dump, combine, FileInfo\nfrom msglc.reader import LazyReader\n\ndump(\"dict.msg\", {str(v): v for v in range(1000)})\ndump(\"list.msg\", [float(v) for v in range(1000)])\n\ncombine(\"combined.msg\", [FileInfo(\"dict.msg\", \"dict\"), FileInfo(\"list.msg\", \"list\")])\n# support recursively combining files\n# ...\n\n# the combined file uses a dict layout\n# { 'dict' : {'1':1,'2':2,...}, 'list' : [1.0,2.0,3.0,...] }\n# so one can read it as follows, details in coming section\nwith LazyReader(\"combined.msg\") as reader:\n    assert reader['dict/101'] == 101  # also reader['dict'][101]\n    assert reader['list/101'] == 101.0  # also reader['list'][101]\n```\n\n#### Combine as `list`\n\n```python\nfrom msglc import dump, combine, FileInfo\nfrom msglc.reader import LazyReader\n\ndump(\"dict.msg\", {str(v): v for v in range(1000)})\ndump(\"list.msg\", [float(v) for v in range(1000)])\n\ncombine(\"combined.msg\", [FileInfo(\"dict.msg\"), FileInfo(\"list.msg\")])\n# support recursively combining files\n# ...\n\n# the combined file uses a list layout\n# [ {'1':1,'2':2,...}, [1.0,2.0,3.0,...] ]\n# so one can read it as follows, details in coming section\nwith LazyReader(\"combined.msg\") as reader:\n    assert reader['0/101'] == 101  # also reader[0][101]\n    assert reader['1/101'] == 101.0  # also reader[1][101]\n```\n\n### Deserialization\n\nUse `LazyReader` to read a file.\n\n```python\nfrom msglc.reader import LazyReader, to_obj\n\nwith LazyReader(\"data.msg\") as reader:\n    data = reader.read()  # return a LazyDict, LazyList, dict, list or primitive value\n    data = reader[\"b/c\"]  # subscriptable if the actual data is subscriptable\n    # data = reader[2:]  # also support slicing if its underlying data is list compatible\n    data = reader.read(\"b/c\")  # or provide a path to visit a particular node\n    print(data)  # 4\n    b_dict = reader.read(\"b\")\n    print(b_dict.__class__)  # \u003cclass 'msglc.reader.LazyDict'\u003e\n    for k, v in b_dict.items():  # dict compatible\n        if k != \"e\":\n            print(k, v)  # c 4, d 5\n    b_json = to_obj(b_dict)  # ensure plain dict\n```\n\nPlease note all data operations shall be performed inside the `with` block.\n\nAll data is lazily loaded, use `to_obj()` function to ensure it is properly read, especially when the data goes out of\nthe `with` block.\n\nIf there is no need to cache the read data, pass the argument `cached=False` to the initializer.\n\n```python\nfrom msglc.reader import LazyReader, to_obj\n\nwith LazyReader(\"data.msg\", cached=False) as reader:\n    data = to_obj(reader.read('some/path/to/the/target'))\n```\n\n## Why\n\nThe `msgpack` specification and the corresponding Python library `msgpack` provide a tool to serialize json objects into\nbinary data.\nHowever, the encoded data has to be fully decoded to reveal what is inside.\nThis becomes an issue when the data is large and only a small part of it is needed.\n\n`msglc` provides an enhanced format to embed structure information into the encoded data.\nThis allows lazy and partial decoding of the data of interest, which can be a significant performance improvement.\n\n## How\n\n### Overview\n\n`msglc` packs tables of contents and data into a single binary blob. The detailed layout can be shown as follows.\n\n```text\n#####################################################################\n# magic bytes # 20 bytes # encoded data # encoded table of contents #\n#####################################################################\n```\n\n1. The magic bytes are used to identify the format of the file.\n2. The 20 bytes are used to store the start position and the length of the encoded table of contents.\n3. The encoded data is the original msgpack encoded data.\n\nThe table of contents is placed at the end of the file to allow direct writing of the encoded data to the file.\nThis makes the memory footprint small.\n\n### Buffering\n\nOne can configure the buffer size for reading and writing.\n\n```python\nfrom msglc.config import configure\n\nconfigure(write_buffer_size=2 ** 23)\nconfigure(read_buffer_size=2 ** 16)\n```\n\nCombining multiple files into a single one requires copying data from one file to another.\nAdjust `copy_chunk_size` to control memory footprint.\n\n```python\nfrom msglc.config import configure\n\nconfigure(copy_chunk_size=2 ** 24)  # 16 MB\n```\n\n### Table of Contents\n\nThere are two types of containers in json objects: array and object.\nThey correspond to `list` and `dict` in Python, respectively.\n\nThe table of contents mimics the structure of the original json object.\nHowever, only containers that exceed a certain size are included in the table of contents.\nThis size is configurable and can be often set to the multiple of the block size of the storage system.\n\n```python\nfrom msglc.config import configure\n\nconfigure(small_obj_optimization_threshold=2 ** 20)\n```\n\nThe above configuration assigns a threshold of 1 MB, containers larger than 1 MB will be indexed in the table of\ncontents.\nTo achieve optimal performance, one shall configure this value according to the underlying file system.\n\nThe basic structure of the table of contents of any object is a `dict` with two keys: `t` (toc) and `p` (position).\nThe `t` field only exists when the object is a **sufficiently large container**.\n\nIf all the elements in the container are small, the `t` field will also be omitted.\n\nFor the purpose of demonstration, the size threshold is set to 2 bytes in the following examples.\n\n```python\n# an integer is not a container\ndata = 2154848\ntoc = {\"p\": [0, 5]}\n\n# a string is not a container\ndata = \"a string\"\ntoc = {\"p\": [5, 14]}\n\n# the inner lists contain small elements, so the `t` field is omitted\n# the outer list is larger than 2 bytes, so the `t` field is included\ndata = [[1, 1], [2, 2, 2, 2, 2]]\ntoc = {\"t\": [{\"p\": [15, 18]}, {\"p\": [18, 24]}], \"p\": [14, 24]}\n\n# the outer dict is larger than 2 bytes, so the `t` field is included\n# the `b` field is not a container\n# the `aa` field is a container, but all its elements are small, so the `t` field is omitted\ndata = {'a': {'aa': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]}, 'b': 2}\ntoc = {\"t\": {\"a\": {\"t\": {\"aa\": {\"p\": [31, 42]}}, \"p\": [27, 42]}, \"b\": {\"p\": [44, 45]}}, \"p\": [24, 45]}\n```\n\nDue to the presence of the size threshold, the table of contents only requires a small amount of extra space.\n\n### Reading\n\nThe table of contents is read first. The actual data is represented by `LazyDict` and `LazyList` classes, which have\nsimilar interfaces to the original `dict` and `list` classes in Python.\n\nAs long as the table of contents contains the `t` field, no actual data is read.\nEach piece of data is read only when it is accessed, and it is cached for future use.\nThus, the data is read lazily and will only be read once (unless fast loading is enabled).\n\n### Fast Loading\n\nThere are two ways to read a container into memory:\n\n1. Read the entire container into memory.\n2. Read each element of the container into memory one by one.\n\nThe first way only requires one system call, but data may be repeatedly read if some of its children have been read\nbefore.\nThe second way requires multiple system calls, but it ensures that each piece of data is read only once.\nDepending on various factors, one may be faster than the other.\n\nFast loading is a feature that allows the entire data to be read into memory at once.\nThis helps to avoid issuing multiple system calls to read the data, which can be slow if the latency is high.\n\n```python\nfrom msglc.config import configure\n\nconfigure(fast_loading=True)\n```\n\nOne shall also configure the threshold for fast loading.\n\n```python\nfrom msglc.config import configure\n\nconfigure(fast_loading_threshold=0.5)\n```\n\nThe threshold is a fraction between 0 and 1. The above 0.5 means if more than half of the children of a container have\nbeen read already, `to_obj` will use the second way to read the whole container. Otherwise, it will use the first way.\n\n### Detection of Long List with Small Elements\n\nLongs lists with small elements, such as integers and floats, can be further optimized by grouping elements into blocks\nthat are of the size of `small_obj_optimization_threshold` so that small reads can be avoided.\n\nSet a `trivial_size` to the desired bytes to identify those long lists.\nFor example, the following sets it to 10 bytes, long lists of integers and floats will be grouped into blocks.\n64-bit integers and doubles require 8 bytes (data) + 1 byte (type) = 9 bytes.\n\n```python\nfrom msglc.config import configure\n\nconfigure(trivial_size=10)\n```\n\n### Disable GC\n\nTo improve performance, `gc` can be disabled during (de)serialization.\nIt is controlled by a global counter, as long as there is one active writer/reader, `gc` will stay disabled.\n\n```python\nfrom msglc.config import configure\n\nconfigure(disable_gc=True)\n```\n\n### Default Values\n\n```python\nfrom dataclasses import dataclass\n\n\n@dataclass\nclass Config:\n    small_obj_optimization_threshold: int = 2 ** 13  # 8KB\n    write_buffer_size: int = 2 ** 23  # 8MB\n    read_buffer_size: int = 2 ** 16  # 64KB\n    fast_loading: bool = True\n    fast_loading_threshold: float = 0.3\n    trivial_size: int = 20\n    disable_gc: bool = True\n    simple_repr: bool = True\n    copy_chunk_size: int = 2 ** 24  # 16MB\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftlcfem%2Fmsglc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftlcfem%2Fmsglc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftlcfem%2Fmsglc/lists"}