{"id":17059237,"url":"https://github.com/danijar/granular","last_synced_at":"2025-04-12T17:51:56.455Z","repository":{"id":246445318,"uuid":"821151616","full_name":"danijar/granular","owner":"danijar","description":"Fast dataset format and loader","archived":false,"fork":false,"pushed_at":"2025-01-17T00:38:35.000Z","size":96,"stargazers_count":22,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T12:11:51.104Z","etag":null,"topics":["ai","artificial-intelligence","datasets","machine-learning","multimodal","python","research"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/granular","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danijar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-27T23:29:58.000Z","updated_at":"2025-02-01T19:02:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"37e2a584-9973-4daa-89e3-ffec43de8428","html_url":"https://github.com/danijar/granular","commit_stats":null,"previous_names":["danijar/bags","danijar/granular"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danijar%2Fgranular","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danijar%2Fgranular/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danijar%2Fgranular/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danijar%2Fgranular/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danijar","download_url":"https://codeload.github.com/danijar/granular/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248610407,"owners_count":21132920,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","datasets","machine-learning","multimodal","python","research"],"created_at":"2024-10-14T10:33:31.219Z","updated_at":"2025-04-12T17:51:56.429Z","avatar_url":"https://github.com/danijar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI](https://img.shields.io/pypi/v/granular.svg)](https://pypi.python.org/pypi/granular/#history)\n\n# Granular\n\nGranular is a format for datasets, from simple to complex. Each Granular\ndataset is a collection of linked files in [bag file format][bag], a\nseekable container structure. Granular comes with a high-performance\ndata loader.\n\n```\npip install granular\n```\n\n[bag]: ...\n\n## Features\n\n- 🚀 **Performance:** High read and write throughput locally and on Cloud.\n- 🔎 **Seeking:** Fast random access from disk by datapoint index.\n- 🎞️ **Sequences:** Datapoints can contain seekable lists of modalities.\n- 🤸 **Flexibility:** User provides encoders and decoders; examples available.\n- 👥 **Sharding:** Store datasets into shards to split processing workloads.\n- 🔄 **Determinism:** Deterministic and resumable global shuffling per epoch.\n- ✅ **Correctness:** A suite of unit tests with high code coverage.\n\n## Quickstart\n\n```python3\nimport pathlib\nimport granular\nimport numpy as np\n\ndirectory = './dataset'\n```\n\nWriting\n\n```python\nspec = {\n    'foo': 'int',      # integer\n    'bar': 'utf8[]',   # *list* of strings\n    'baz': 'msgpack',  # packed structure\n    'abc': 'jpg',      # image\n    'xyz': 'array',    # array\n}\n\nwith granular.DatasetWriter(directory, spec, granular.encoders) as writer:\n  for i in range(10):\n    datapoint = {\n        'foo': i,\n        'bar': ['hello'] * i,\n        'baz': {'a': 1},\n        'abc': np.zeros((60, 80, 3), np.uint8),\n        'xyz': np.arange(0, 1 + i, np.float32),\n    }\n    writer.append(datapoint)\n\nprint(list(directory.glob('*')))\n# ['spec.json', 'refs.bag', 'foo.bag', 'bar.bag', 'baz.bag', 'abc.bag', 'xyz.bag']\n```\n\nReading\n\n```python\nwith granular.DatasetReader(directory, granular.decoders) as reader:\n  print(reader.spec)    # {'foo': 'int', 'bar': 'utf8[]', 'baz': 'msgpack', ...}\n  print(reader.size)    # Dataset size in bytes\n  print(len(reader))    # Number of datapoints\n\n  datapoint = reader[2]\n  print(datapoint['foo'])        # 2\n  print(datapoint['bar'])        # ['hello', 'hello']\n  print(datapoint['abc'].shape)  # (60, 80, 3)\n```\n\nLoading\n\n```python\ndef preproc(datapoint, seed):\n  return {'image': datapoint['abc'], 'label': datapoint['foo']}\n\nsource = granular.sources.Epochs(reader, shuffle=True, seed=0)\nsource = granular.sources.Transform(source, preproc)\n\nloader = granular.Loader(source, batch=8, workers=64)\n\nprint(loader.spec)\n# {'image': (np.uint8, (60, 80, 3)), 'label': (np.int64, ())}\n\ndataset = iter(loader)\nfor _ in range(100):\n  batch = next(dataset)\n  print(batch['image'].shape)  # (8, 60, 80, 3)\n```\n\n## Advanced\n\n### Filesystems\n\nCustom filesystems are supported by providing different `Path` implementations.\nFor example, on Google Cloud you can use the `Path` from [elements][elements]\nthat is optimized for data loading throughput:\n\n```python\nimport elements  # pip install elements\n\ndirectory = elements.Path('gs://\u003cbucket\u003e/dataset')\n\nreader = granular.DatasetReader(directory, ...)\nwrtier = granular.DatasetWriter(directory, ...)\n```\n\n[elements]: https://github.com/danijar/elements\n\n### Formats\n\nGranular does not impose a serialization solution on the user. Any strings can\nbe used as types in `spec`, as long as their encoder and decoder functions are\nprovided, for example:\n\n```python\nimport msgpack\n\nencoders = {\n    'bytes': lambda x: x,\n    'utf8': lambda x: x.encode('utf-8'),\n    'msgpack': msgpack.packb,\n}\n\ndecoders = {\n    'bytes': lambda x: x,\n    'utf8': lambda x: x.decode('utf-8'),\n    'msgpack': msgpack.unpackb,\n}\n```\n\nExamples of common encode and decode functions are provided in\n[formats.py][formats]. These support Numpy arrays, images, videos, and more.\nThey can be used as `granular.encoders` and `granular.decoders`.\n\n[formats]: https://github.com/danijar/granular/blob/main/granular/formats.py\n\n### Resuming\n\nThe dataloader is fully deterministic and resumable, given only the step and\nseed integers. For this, checkpoint the state dictionary returned by\n`loader.save()` and pass this into `loader.load()` when storing a checkpoint.\n\n```python\nstate = loader.save()\nprint(state)  # {'step': 100, 'seed': 0}\nloader.load(state)\n```\n\n### Caching\n\nRetriving a datapoint requires first reading from `refs.bag` to find the\nreferences into the other bag files, and then reading from each of the modality\nbag files. If some of the modalities are small enough, they can be cached in\nRAM by setting `cache_keys`. In general, it is recommended to cache `refs` as\nwell as all small modalities, such as integer labels.\n\nAdditionally, reading from a Bag file requires two read operations. The first\noperation looks at the index table at the end of the file to locate the byte\noffset of the record. The second operation retrieves the actual record. In\ngeneral, it is recommended to cache the index for all Bag files. Together, the\ntables take up `8 * len(spec) * len(reader)` bytes of RAM.\n\n```python\nreader = granular.DatasetReader(\n    directory, decoders,\n    cache_index=True,            # Cache index tables of all bag files in memory.\n    cache_keys=('refs', 'foo'),  # Fully cache refs.bag and foo.bag in memory.\n)\n```\n\n### Masking\n\nIt is possible to load the values of only a subset of keys of a datapoint. For\nthis, provide a mask in addition to the datapoint index. This reduces the\nnumber of read requests to only the bag files that are actually needed:\n\n```python\nprint(reader.spec)  # {'foo': 'int', 'bar': 'utf8', 'baz': 'array'}\n\nmask = {'foo': True, 'baz': True}\ndatapoint = reader[index, mask]\nprint('foo' in datapoint)  # True\nprint('bar' in datapoint)  # False\nprint('baz' in datapoint)  # True\n```\n\n### Sequences\n\nEach dataset is a list of datapoints. Each datapoint is a dictionary with\nstring keys and either individual byte values or lists of byte values. To use\nsequence values, add the `[]` suffix to the type in the `spec`:\n\n```python\nspec = {\n    'title': 'utf8',\n    'frames': 'jpg[]',\n    'captions': 'utf8[]',\n    'times': 'int[]',\n}\n```\n\nSequence fields can not only store values of variable length, but also allow\nreading ranges of the value without loading the whole sequence from disk using\nmasking:\n\n```python\navailable = reader.available(index)\nprint(available)\n# {'title': True, 'frames': range(54), 'captions': range(7), 'times': range(7)}\n\nmask = {\n    'title': True,            # Read the title modality\n    'frames': range(32, 42),  # Read a range of 10 frames.\n    'captions': range(0, 7),  # Read all captions.\n    'times': True,            # Another way to read the full list.\n}\ndatapoint = reader[index, mask]\nprint(len(datapoint['frames']))  # 10\n```\n\nRanges are loaded using a single read operation, corresponding to a single\ndownload request on Cloud infrastructure.\n\n### Sharding\n\nLarge datasets can be stored as list of smaller datasets to easily parallelize\nprocessing, by processing each smaller dataset individually in a different\nprocess or on a different machine. The shard length specifies the number of\ndatapoints per shard. A good default is to set the number of datapoints such\nthat each shard is around 10 Gb in size.\n\n```python\n# Write into a sharded dataset.\nwriter = granular.ShardedDatasetWriter(directory, spec, encoders, shardlen=10000)\n\n# Read from a sharded dataset.\nreader = granular.ShardedDatasetReader(directory, decoders)\n```\n\nThe file structure of a sharded dataset is one folder per shard, named after\nthe shard number. Each shard itself is a dataset and can also be read using the\nnon-sharded `granular.DatasetReader`.\n\n```sh\n$ tree ./directory\n.\n├── 000000\n│  ├── spec.json\n│  ├── refs.bag\n│  ├── foo.bag\n│  ├── bar.bag\n│  └── baz.bag\n├── 000001\n│  ├── spec.json\n│  ├── refs.bag\n│  ├── foo.bag\n│  ├── bar.bag\n│  └── baz.bag\n└── ...\n```\n\nWhen processing a dataset with a large number of shards using a smaller number\nof workers, specify `shardstart` and `shardstep` so each worker reads and\nwrites its dedicated subset of shards.\n\n```python\n# Write into a sharded dataset.\nwriter = granular.ShardedDatasetWriter(\n    directory, spec, encoders, shardlen=10000,\n    shardstart=worker_id,   # Start writing at this shard.\n    shardstep=num_workers,  # Afterwards, jump this many shards ahead.\n)\n\n# Read from a sharded dataset.\nreader = granular.ShardedDatasetReader(\n    directory, decoders,\n    shardstart=worker_id,   # Start reading at this shard.\n    shardstep=num_workers,  # Afterwards, jump this many shards ahead.\n)\n```\n\n## Questions\n\nIf you have a question, please [file an issue][issues].\n\n[issues]: https://github.com/danijar/granular/issues\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanijar%2Fgranular","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanijar%2Fgranular","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanijar%2Fgranular/lists"}