{"id":22675607,"url":"https://github.com/stability-ai/datapipelines","last_synced_at":"2025-04-12T18:08:30.952Z","repository":{"id":178622183,"uuid":"656863565","full_name":"Stability-AI/datapipelines","owner":"Stability-AI","description":"Iterable datapipelines for pytorch training.","archived":false,"fork":false,"pushed_at":"2024-08-31T00:13:13.000Z","size":23,"stargazers_count":83,"open_issues_count":4,"forks_count":17,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-12T18:07:22.705Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Stability-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-21T19:55:42.000Z","updated_at":"2025-04-04T02:10:35.000Z","dependencies_parsed_at":null,"dependency_job_id":"0dd63d49-45ea-4d66-81c4-87d98ccb1caf","html_url":"https://github.com/Stability-AI/datapipelines","commit_stats":null,"previous_names":["stability-ai/datapipelines"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fdatapipelines","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fdatapipelines/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fdatapipelines/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fdatapipelines/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Stability-AI","download_url":"https://codeload.github.com/Stability-AI/datapipelines/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248610340,"owners_count":21132921,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-09T17:57:46.325Z","updated_at":"2025-04-12T18:08:30.918Z","avatar_url":"https://github.com/Stability-AI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# datapipelines\n\n\nIterable datapipelines for pytorch training.\n\nThe functions `sdata.create_dataset()` and `sdata.create_loader()` provide interfaces for your pytorch training code, where the former returns \na dataset and the latter a wrapper around a pytorch dataloader.\n\nA dataset as returned by `sdata.create_dataset()` consists of 5 main modules should be defined in a yaml-config:\n1. A base [datapipeline](./sdata/datapipeline.py#L306), which reads data as tar files from local fs and assembles them to samples. Each sample comes as a python-dict. \n2. A list of [preprocessors](sdata/dataset.py#L129) which can be either used to transform the entries of a sample or to filter out unsuitable samples. The former kinds are called `mappers`, the latter `filters`. This repository provides a basic set of [mappers](sdata/mappers) and [filters](sdata/filters) which provide basic (not too application specific) data transforms and filters.\n3. A list of [decoders](hsdata/dataset.py#L127) whose elements can be either defined as a string matching one of the predefined webdataset [image decoders](https://github.com/webdataset/webdataset/blob/039d74319ae55e5696dcef89829be9671802cf70/webdataset/autodecode.py#L238) decoders or some custom decoder (in the config-style) for handling more specific needs. Note that decoding will be skipped alltogether when setting `decoders=None` (or in config-style yaml `decoders: null`).\n4. A list of [postprocessors](sdata/dataset.py#L130) which are used to filter or transform the data after it has been decoded and should again be either `mappers` or `filters`.\n5. `error_handler`: A [webdataset-style function](https://github.com/webdataset/webdataset/blob/main/webdataset/handlers.py) for handling any errors which occur in the `datapipeline`, `preprocessors`, `decoders` or `postprocessors`.\n\nA wrapper around a pytorch dataloader, which can be plugged in to your training, is returned by [`sdata.create_loader()`](sdata/dataset.py#L51). You can pass the dataset either as an `IterableDataset` as returned by `sdata.create_dataset()` or via the config which would instantiate this dataset. Apart from the known `batch_size`, `num_workers`, `partial` and `collation_fn` parameteters for pytorch dataloaders, the function can be configured via the following arguments.\n\n1. `batched_transforms` of batched `mappers` and `filters` which transform an entire training batch before being passed to the dataloader defined in the same style than the `preprocessors` and `postprocessors` from above.\n2. `loader_kwargs` defining additional keyword arguments for the dataloader (such as `prefetch_factor`, ...)\n3. `error_handler`: A [webdataset-style function](https://github.com/webdataset/webdataset/blob/main/webdataset/handlers.py) for handling any errors which occur in the `batched_transforms`.\n\n\n## Examples \n\nHere, it is most effective to look at the configs in `examples/configs/` for the following examples. These will show you how this works.\n\nFor a simple example, see [`examples/image_simple.py`](examples/image_simple.py), find config [here](examples/configs/example.yaml). \n\n**NOTE:** You have to add your dataset in tar-form which should follow the [webdataset-format](https://github.com/webdataset/webdataset). To find the parts which have to be adapted, search for comments conaining `USER:` in the respective config. \n\n## Installation\n\n### Pytorch 2 and later\n\n```bash\npython3 -m venv .pt2\nsource .pt2/bin/activate\npip3 install wheel\npip3 install -r requirements_pt2.txt\n\n```\n\n### Pytorch 1.13 \n\n```bash\npython3 -m venv .pt1\nsource .pt1/bin/activate\npip3 install wheel\npip3 install -r requirements_pt1.txt\n\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstability-ai%2Fdatapipelines","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstability-ai%2Fdatapipelines","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstability-ai%2Fdatapipelines/lists"}