{"id":13631886,"url":"https://github.com/mosaicml/streaming","last_synced_at":"2025-05-13T19:08:26.394Z","repository":{"id":59179201,"uuid":"501667548","full_name":"mosaicml/streaming","owner":"mosaicml","description":"A Data Streaming Library for Efficient Neural Network Training","archived":false,"fork":false,"pushed_at":"2025-05-12T18:14:50.000Z","size":8321,"stargazers_count":1293,"open_issues_count":93,"forks_count":160,"subscribers_count":27,"default_branch":"main","last_synced_at":"2025-05-12T19:24:21.830Z","etag":null,"topics":["dataset","deep-learning","machine-learning","neural-network","pytorch","streaming"],"latest_commit_sha":null,"homepage":"https://streaming.docs.mosaicml.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mosaicml.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-06-09T13:36:54.000Z","updated_at":"2025-05-12T18:14:53.000Z","dependencies_parsed_at":"2023-11-13T22:22:34.781Z","dependency_job_id":"122d8941-740a-41e3-906f-c964efb4b57d","html_url":"https://github.com/mosaicml/streaming","commit_stats":{"total_commits":206,"total_committers":16,"mean_commits":12.875,"dds":0.5825242718446602,"last_synced_commit":"3afa26cc3b36677c86d4ca842afccbdb763b952e"},"previous_names":[],"tags_count":30,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosaicml%2Fstreaming","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosaicml%2Fstreaming/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosaicml%2Fstreaming/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosaicml%2Fstreaming/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mosaicml","download_url":"https://codeload.github.com/mosaicml/streaming/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253806842,"owners_count":21967249,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","deep-learning","machine-learning","neural-network","pytorch","streaming"],"created_at":"2024-08-01T22:02:42.506Z","updated_at":"2025-05-13T19:08:26.353Z","avatar_url":"https://github.com/mosaicml.png","language":"Python","funding_links":[],"categories":["Python","Data Stream Processing","分布式机器学习","Databricks / formerly Mosaic ML","7. Training \u0026 Fine-tuning Ecosystem"],"sub_categories":[],"readme":"\u003cbr /\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/mosaicml/streaming#gh-light-mode-only\" class=\"only-light\"\u003e\n      \u003cimg src=\"./docs/source/_static/images/streaming-logo-light-mode.png\" width=\"50%\"/\u003e\n    \u003c/a\u003e\n    \u003c!--pypi website does not support dark mode and does not understand GitHub tag. Hence, it renders both the images.\n    The below tag is being used to remove the dark mode image on pypi website.--\u003e\n    \u003c!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_BEGIN --\u003e\n    \u003ca href=\"https://github.com/mosaicml/streaming#gh-dark-mode-only\" class=\"only-dark\"\u003e\n      \u003cimg src=\"./docs/source/_static/images/streaming-logo-dark-mode.png\" width=\"50%\"/\u003e\n    \u003c/a\u003e\n    \u003c!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_END --\u003e\n\u003c/p\u003e\n\n\u003ch2\u003e\u003cp align=\"center\"\u003eFast, accurate streaming of training data from cloud storage\u003c/p\u003e\u003c/h2\u003e\n\n\u003ch4\u003e\u003cp align='center'\u003e\n\u003ca href=\"https://www.mosaicml.com\"\u003e[Website]\u003c/a\u003e\n- \u003ca href=\"https://docs.mosaicml.com/projects/streaming/en/latest/getting_started/quick_start.html\"\u003e[Quick Start]\u003c/a\u003e\n- \u003ca href=\"https://streaming.docs.mosaicml.com/\"\u003e[Docs]\n- \u003ca href=\"https://www.databricks.com/company/careers/open-positions?department=Mosaic%20AI\u0026location=all\"\u003e[We're Hiring!]\u003c/a\u003e\n\u003c/p\u003e\u003c/h4\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/mosaicml-streaming/\"\u003e\n        \u003cimg alt=\"PyPi Version\" src=\"https://img.shields.io/pypi/pyversions/mosaicml-streaming\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/mosaicml-streaming/\"\u003e\n        \u003cimg alt=\"PyPi Package Version\" src=\"https://img.shields.io/pypi/v/mosaicml-streaming\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/mosaicml/streaming/actions?query=workflow%3ATest\"\u003e\n        \u003cimg alt=\"Unit test\" src=\"https://github.com/mosaicml/streaming/actions/workflows/pytest.yaml/badge.svg\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pepy.tech/project/mosaicml-streaming/\"\u003e\n        \u003cimg alt=\"PyPi Downloads\" src=\"https://static.pepy.tech/personalized-badge/mosaicml-streaming?period=month\u0026units=international_system\u0026left_color=grey\u0026right_color=blue\u0026left_text=Downloads/month\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://streaming.docs.mosaicml.com\"\u003e\n        \u003cimg alt=\"Documentation\" src=\"https://readthedocs.org/projects/streaming/badge/?version=stable\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://dub.sh/mcomm\"\u003e\n        \u003cimg alt=\"Chat @ Slack\" src=\"https://img.shields.io/badge/slack-chat-2eb67d.svg?logo=slack\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/mosaicml/streaming/blob/main/LICENSE\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/badge/License-Apache%202.0-green.svg?logo=slack\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://gurubase.io/g/streaming\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/badge/Gurubase-Ask%20Streaming%20Guru-006BFF\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\u003cbr /\u003e\n\n# 👋 Welcome\n\nWe built StreamingDataset to make training on large datasets from cloud storage as fast, cheap, and scalable as possible.\n\nIt’s specially designed for multi-node, distributed training for large models—maximizing correctness guarantees, performance, and ease of use. Now, you can efficiently train anywhere, independent of your training data location. Just stream in the data you need, when you need it. To learn more about why we built StreamingDataset, read our [announcement blog](https://www.mosaicml.com/blog/mosaicml-streamingdataset).\n\nStreamingDataset is compatible with any data type, including **images, text, video, and multimodal data**.\n\nWith support for major cloud storage providers ([AWS](https://aws.amazon.com/s3/), [OCI](https://www.oracle.com/cloud/storage/object-storage/), [GCS](https://cloud.google.com/storage), [Azure](https://azure.microsoft.com/en-us/products/storage/blobs), [Databricks](https://docs.databricks.com/en/storage/index.html), and any S3 compatible object store such as [Cloudflare R2](https://www.cloudflare.com/products/r2/), [Coreweave](https://docs.coreweave.com/storage/object-storage), [Backblaze b2](https://www.backblaze.com/b2/cloud-storage.html), etc. ) and designed as a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class, StreamingDataset seamlessly integrates into your existing training workflows.\n\n![The flow of samples from shards in the cloud to devices in your cluster](docs/source/_static/images/flow.gif)\n\n# 🚀 Getting Started\n\n## 💾 Installation\n\nStreaming can be installed with `pip`:\n\n\u003c!--pytest.mark.skip--\u003e\n```bash\npip install mosaicml-streaming\n```\n\n## 🏁 Quick Start\n\n### 1. Prepare Your Data\n\nConvert your raw dataset into one of our supported streaming formats:\n\n- MDS (Mosaic Data Shard) format which can encode and decode any Python object\n- CSV / TSV\n- JSONL\n\n\u003c!--pytest.mark.skip--\u003e\n```python\nimport numpy as np\nfrom PIL import Image\nfrom streaming import MDSWriter\n\n# Local or remote directory in which to store the compressed output files\ndata_dir = 'path-to-dataset'\n\n# A dictionary mapping input fields to their data types\ncolumns = {\n    'image': 'jpeg',\n    'class': 'int'\n}\n\n# Shard compression, if any\ncompression = 'zstd'\n\n# Save the samples as shards using MDSWriter\nwith MDSWriter(out=data_dir, columns=columns, compression=compression) as out:\n    for i in range(10000):\n        sample = {\n            'image': Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),\n            'class': np.random.randint(10),\n        }\n        out.write(sample)\n```\n\n### 2. Upload Your Data to Cloud Storage\n\nUpload your streaming dataset to the cloud storage of your choice ([AWS](https://aws.amazon.com/s3/), [OCI](https://www.oracle.com/cloud/storage/object-storage/), or [GCP](https://cloud.google.com/storage)). Below is one example of uploading a directory to an S3 bucket using the [AWS CLI](https://aws.amazon.com/cli/).\n\n\u003c!--pytest.mark.skip--\u003e\n```bash\n$ aws s3 cp --recursive path-to-dataset s3://my-bucket/path-to-dataset\n```\n\n### 3. Build a StreamingDataset and DataLoader\n\n\u003c!--pytest.mark.skip--\u003e\n```python\nfrom torch.utils.data import DataLoader\nfrom streaming import StreamingDataset\n\n# Remote path where full dataset is persistently stored\nremote = 's3://my-bucket/path-to-dataset'\n\n# Local working dir where dataset is cached during operation\nlocal = '/tmp/path-to-dataset'\n\n# Create streaming dataset\ndataset = StreamingDataset(local=local, remote=remote, shuffle=True)\n\n# Let's see what is in sample #1337...\nsample = dataset[1337]\nimg = sample['image']\ncls = sample['class']\n\n# Create PyTorch DataLoader\ndataloader = DataLoader(dataset)\n```\n\n### 📚 What next?\n\nGetting started guides, examples, API references, and other useful information can be found in our [docs](https://streaming.docs.mosaicml.com/).\n\nWe have end-to-end tutorials for training a model on:\n\n- [CIFAR-10](https://docs.mosaicml.com/projects/streaming/en/stable/how_to_guides/cifar10.html)\n- [FaceSynthetics](https://github.com/mosaicml/streaming/blob/main/examples/facesynthetics.ipynb)\n- [SyntheticNLP](https://docs.mosaicml.com/projects/streaming/en/stable/how_to_guides/synthetic_nlp.html)\n\nWe also have starter code for the following popular datasets, which can be found in the `streaming` [directory](https://github.com/mosaicml/streaming/tree/main/streaming):\n\n| Dataset | Task | Read | Write |\n| --- | --- | --- | --- |\n| LAION-400M | Text and image | [Read](https://github.com/mosaicml/diffusion-benchmark/blob/main/data.py) | [Write](https://github.com/mosaicml/streaming/tree/main/streaming/multimodal/convert/laion/laion400m) |\n| WebVid | Text and video | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/multimodal/webvid.py) | [Write](https://github.com/mosaicml/streaming/blob/main/streaming/multimodal/convert/webvid.py) |\n| C4 | Text | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/text/c4.py) | [Write](https://github.com/mosaicml/streaming/blob/main/streaming/text/convert/c4.py) |\n| EnWiki | Text | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/text/enwiki.py) | [Write](https://github.com/mosaicml/streaming/tree/main/streaming/text/convert/enwiki) |\n| Pile | Text | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/text/pile.py) | [Write](https://github.com/mosaicml/streaming/blob/main/streaming/text/convert/pile.py)\n| ADE20K | Image segmentation | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/vision/ade20k.py) | [Write](https://github.com/mosaicml/streaming/blob/main/streaming/vision/convert/ade20k.py)\n| CIFAR10 | Image classification | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/vision/cifar10.py) | [Write](https://github.com/mosaicml/streaming/blob/main/streaming/vision/convert/cifar10.py) |\n| COCO | Image classification | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/vision/coco.py) | [Write](https://github.com/mosaicml/streaming/blob/main/streaming/vision/convert/coco.py) |\n| ImageNet | Image classification | [Read](https://github.com/mosaicml/streaming/blob/main/streaming/vision/imagenet.py) | [Write](https://github.com/mosaicml/streaming/blob/main/streaming/vision/convert/imagenet.py) |\n\n**To start training on these datasets:**\n\n1. Convert raw data into .mds format using the corresponding script from the `convert` directory.\n\nFor example:\n\n\u003c!--pytest.mark.skip--\u003e\n```bash\n$ python -m streaming.multimodal.convert.webvid --in \u003cCSV file\u003e --out \u003cMDS output directory\u003e\n```\n\n2. Import dataset class to start training the model.\n\n\u003c!--pytest.mark.skip--\u003e\n```python\nfrom streaming.multimodal import StreamingInsideWebVid\ndataset = StreamingInsideWebVid(local=local, remote=remote, shuffle=True)\n```\n\n# **🔑** Key Features\n\n---\n\n## Seamless data mixing\n\nEasily experiment with dataset mixtures with [`Stream`](https://docs.mosaicml.com/projects/streaming/en/latest/api_reference/generated/streaming.Stream.html#stream). Dataset sampling can be controlled in relative (proportion) or absolute (repeat or samples terms). During streaming, the different datasets are streamed, shuffled, and mixed seamlessly just-in-time.\n\n\u003c!--pytest.mark.skip--\u003e\n```python\n# mix C4, github code, and internal datasets\nstreams = [\n  Stream(remote='s3://datasets/c4', proportion=0.4),\n  Stream(remote='s3://datasets/github', proportion=0.1),\n  Stream(remote='gcs://datasets/my_internal', proportion=0.5),\n]\n\ndataset = StreamingDataset(\n  streams=streams,\n  samples_per_epoch=1e8,\n)\n```\n\n## True Determinism\n\nA unique feature of our solution: samples are in the same order regardless of the number of GPUs, nodes, or CPU workers. This makes it easier to:\n\n- Reproduce and debug training runs and loss spikes\n- Load a checkpoint trained on 64 GPUs and debug on 8 GPUs with reproducibility\n\nSee the figure below — training a model on 1, 8, 16, 32, or 64 GPUs yields the **exact same loss curve** (up to the limitations of floating point math!)\n\n![Plot of elastic determinism](docs/source/_static/images/determinism.png)\n\n## Instant mid-epoch resumption\n\nIt can be expensive — and annoying — to wait for your job to resume while your dataloader spins after a hardware failure or loss spike. Thanks to our deterministic sample ordering, StreamingDataset lets you resume training in seconds, not hours, in the middle of a long training run.\n\nMinimizing resumption latency can save thousands of dollars in egress fees and idle GPU compute time compared to existing solutions.\n\n## High throughput\n\nOur MDS format cuts extraneous work to the bone, resulting in ultra-low sample latency and higher throughput compared to alternatives for workloads bottlenecked by the dataloader.\n\n| Tool | Throughput |\n| --- | --- |\n| StreamingDataset | ~19000 img/sec |\n| ImageFolder | ~18000 img/sec |\n| WebDataset | ~16000 img/sec |\n\n*Results shown are from ImageNet + ResNet-50 training, collected over 5 repetitions after the data is cached after the first epoch.*\n\n## Equal convergence\n\nModel convergence from using StreamingDataset is just as good as using local disk, thanks to our shuffling algorithm.\n\n![Plot of equal convergence](docs/source/_static/images/convergence.png)\n\nBelow are results from ImageNet + ResNet-50 training, collected over 5 repetitions.\n\n| Tool | Top-1 Accuracy |\n| --- | --- |\n| StreamingDataset | 76.51% +/- 0.09 |\n| ImageFolder | 76.57% +/- 0.10 |\n| WebDataset | 76.23% +/- 0.17 |\n\nStreamingDataset shuffles across all samples assigned to a node, whereas alternative solutions only shuffle samples in a smaller pool (within a single process). Shuffling across a wider pool spreads out adjacent samples more. In addition, our shuffling algorithm minimizes dropped samples. We have found both of these shuffling features advantageous for model convergence.\n\n## Random access\n\nAccess the data you need when you need it.\n\nEven if a sample isn’t downloaded yet, you can access `dataset[i]` to get sample `i`. The download will kick off immediately and the result will be returned when it’s done - similar to a map-style PyTorch dataset with samples numbered sequentially and accessible in any order.\n\n\u003c!--pytest.mark.skip--\u003e\n```python\ndataset = StreamingDataset(...)\nsample = dataset[19543]\n```\n\n## No divisibility requirements\n\nStreamingDataset will happily iterate over any number of samples. You do not have to forever delete samples so that the dataset is divisible over a baked-in number of devices. Instead, each epoch a different selection of samples are repeated (none dropped) so that each device processes the same count.\n\n\u003c!--pytest.mark.skip--\u003e\n```python\ndataset = StreamingDataset(...)\ndl = DataLoader(dataset, num_workers=...)\n```\n\n## Disk usage limits\n\nDynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument `cache_limit`. See the [shuffling](./docs/source/fundamentals/shuffling.md) guide for more details.\n\n```\ndataset = StreamingDataset(\n    cache_limit='100gb',\n    ...\n)\n```\n\n# 🏆 Project Showcase\n\nHere are some projects and experiments that used StreamingDataset. Got something to add?  Email [mcomm@databricks.com](mailto:mcomm@databricks.com) or join our [Community Slack](https://dub.sh/mcomm).\n\n- [BioMedLM](https://www.mosaicml.com/blog/introducing-pubmed-gpt): a Domain Specific Large Language Model for BioMedicine by MosaicML and Stanford CRFM\n- [Mosaic Diffusion Models](https://www.mosaicml.com/blog/training-stable-diffusion-from-scratch-costs-160k): Training Stable Diffusion from Scratch Costs \u003c$160k\n- [Mosaic LLMs](https://www.mosaicml.com/blog/gpt-3-quality-for-500k): GPT-3 quality for \u003c$500k\n- [Mosaic ResNet](https://www.mosaicml.com/blog/mosaic-resnet): Blazingly Fast Computer Vision Training with the Mosaic ResNet and Composer\n- [Mosaic DeepLabv3](https://www.mosaicml.com/blog/mosaic-image-segmentation): 5x Faster Image Segmentation Training with MosaicML Recipes\n- …more to come! Stay tuned!\n\n# 💫 Contributors\n\nWe welcome any contributions, pull requests, or issues.\n\nTo start contributing, see our [Contributing](https://github.com/mosaicml/streaming/blob/main/CONTRIBUTING.md) page.\n\nP.S.: [We're hiring](https://mosaicml.com/jobs)!\n\nIf you like this project, give us a star **⭐** and check out our other projects:\n\n- **[Composer](https://github.com/mosaicml/composer) -** a modern PyTorch library that makes scalable, efficient neural network training easy\n- **[MosaicML Examples](https://github.com/mosaicml/examples)** - reference examples for training ML models quickly and to high accuracy - featuring starter code for GPT / Large Language Models, Stable Diffusion, BERT, ResNet-50, and DeepLabV3\n- **[MosaicML Cloud](https://www.mosaicml.com/cloud)** - our training platform built to minimize training costs for LLMs, Diffusion Models, and other large models - featuring multi-cloud orchestration, effortless multi-node scaling, and under-the-hood optimizations for speeding up training time\n\n# ✍️ Citation\n\n```\n@misc{mosaicml2022streaming,\n    author = {The Mosaic ML Team},\n    title = {streaming},\n    year = {2022},\n    howpublished = {\\url{\u003chttps://github.com/mosaicml/streaming/\u003e}},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmosaicml%2Fstreaming","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmosaicml%2Fstreaming","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmosaicml%2Fstreaming/lists"}