{"id":20261317,"url":"https://github.com/lightning-ai/litdata","last_synced_at":"2026-02-20T12:03:17.963Z","repository":{"id":223321058,"uuid":"758163683","full_name":"Lightning-AI/litData","owner":"Lightning-AI","description":"Transform datasets at scale. Optimize datasets for fast AI model training.","archived":false,"fork":false,"pushed_at":"2025-05-07T20:32:02.000Z","size":2417,"stargazers_count":468,"open_issues_count":54,"forks_count":65,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-05-07T21:34:55.188Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Lightning-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-02-15T18:44:16.000Z","updated_at":"2025-05-07T20:28:25.000Z","dependencies_parsed_at":"2024-05-19T09:22:54.154Z","dependency_job_id":"ad08263d-9c4a-4623-884d-a3ffe34b5af5","html_url":"https://github.com/Lightning-AI/litData","commit_stats":null,"previous_names":["lightning-ai/lit-data","lightning-ai/litdata"],"tags_count":53,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-AI%2FlitData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-AI%2FlitData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-AI%2FlitData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lightning-AI%2FlitData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Lightning-AI","download_url":"https://codeload.github.com/Lightning-AI/litData/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252961107,"owners_count":21832179,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T11:25:14.623Z","updated_at":"2026-02-20T12:03:17.954Z","avatar_url":"https://github.com/Lightning-AI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\u003ch1\u003e\n  Speed up model training by fixing data loading\n\u003c/h1\u003e  \n\u003cimg src=\"https://pl-flash-data.s3.amazonaws.com/lit_data_logo.webp\" alt=\"LitData\" width=\"800px\"/\u003e\n\n\u0026nbsp;\n\u0026nbsp;\n\n\u003cpre\u003e\nTransform                              Optimize\n  \n✅ Parallelize data processing       ✅ Stream large cloud datasets          \n✅ Create vector embeddings          ✅ Accelerate training by 20x           \n✅ Run distributed inference         ✅ Pause and resume data streaming      \n✅ Scrape websites at scale          ✅ Use remote data without local loading\n\u003c/pre\u003e\n\n---\n\n![PyPI](https://img.shields.io/pypi/v/litdata)\n![Downloads](https://img.shields.io/pypi/dm/litdata)\n![License](https://img.shields.io/github/license/Lightning-AI/litdata)\n[![Discord](https://img.shields.io/discord/1077906959069626439?label=Get%20Help%20on%20Discord)](https://discord.gg/VptPCZkGNa)\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://lightning.ai/\"\u003eLightning AI\u003c/a\u003e •\n  \u003ca href=\"#quick-start\"\u003eQuick start\u003c/a\u003e •\n  \u003ca href=\"#speed-up-model-training\"\u003eOptimize data\u003c/a\u003e •\n  \u003ca href=\"#transform-datasets\"\u003eTransform data\u003c/a\u003e •\n  \u003ca href=\"#key-features\"\u003eFeatures\u003c/a\u003e •\n  \u003ca href=\"#benchmarks\"\u003eBenchmarks\u003c/a\u003e •\n  \u003ca href=\"#start-from-a-template\"\u003eTemplates\u003c/a\u003e •\n  \u003ca href=\"#community\"\u003eCommunity\u003c/a\u003e\n\u003c/p\u003e\n\n\u0026nbsp;\n\n\u003ca target=\"_blank\" href=\"https://lightning.ai/docs/overview/optimize-data/optimize-datasets\"\u003e\n  \u003cimg src=\"https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/app-2/get-started-badge.svg\" height=\"36px\" alt=\"Get started\"/\u003e\n\u003c/a\u003e\n\n\u003c/div\u003e\n\n\u0026nbsp;\n\n# Why LitData?\nSpeeding up model training involves more than kernel tuning. Data loading frequently slows down training, because datasets are too large to fit on disk, consist of millions of small files, or stream slowly from the cloud. \n\nLitData provides tools to preprocess and optimize datasets into a format that streams efficiently from any cloud or local source. It also includes a map operator for distributed data processing before optimization. This makes data pipelines faster, cloud-agnostic, and can improve training throughput by up to 20×.\n\n\u0026nbsp;\n\n# Looking for GPUs?\nOver 340,000 developers use [Lightning Cloud](https://lightning.ai/?utm_source=litdata\u0026utm_medium=referral\u0026utm_campaign=litdata) - purpose-built for PyTorch and PyTorch Lightning. \n- [GPUs](https://lightning.ai/pricing?utm_source=litdata\u0026utm_medium=referral\u0026utm_campaign=litdata) from $0.19.   \n- [Clusters](https://lightning.ai/clusters?utm_source=litdata\u0026utm_medium=referral\u0026utm_campaign=litdata): frontier-grade training/inference clusters.   \n- [AI Studio (vibe train)](https://lightning.ai/studios?utm_source=litdata\u0026utm_medium=referral\u0026utm_campaign=litdata): workspaces where AI helps you debug, tune and vibe train.\n- [AI Studio (vibe deploy)](https://lightning.ai/studios?utm_source=litdata\u0026utm_medium=referral\u0026utm_campaign=litdata): workspaces where AI helps you optimize, and deploy models.     \n- [Notebooks](https://lightning.ai/notebooks?utm_source=litdata\u0026utm_medium=referral\u0026utm_campaign=litdata): Persistent GPU workspaces where AI helps you code and analyze.\n- [Inference](https://lightning.ai/deploy?utm_source=litdata\u0026utm_medium=referral\u0026utm_campaign=litdata): Deploy models as inference APIs.\n\n# Quick start\nFirst, install LitData:\n\n```bash\npip install litdata\n```\n\nChoose your workflow:\n\n🚀 [Speed up model training](#speed-up-model-training)    \n🚀 [Transform datasets](#transform-datasets)\n\n\u0026nbsp;\n\n\u003cdetails\u003e\n  \u003csummary\u003eAdvanced install\u003c/summary\u003e\n\nInstall all the extras\n```bash\npip install 'litdata[extras]'\n```\n\n\u003c/details\u003e\n\n\u0026nbsp;\n\n----\n\n# Speed up model training\nStream datasets directly from cloud storage without local downloads. Choose the approach that fits your workflow:\n\n## Option 1: Start immediately with existing data ⚡⚡\nStream raw files directly from cloud storage - no pre-optimization needed.\n\n```python\nfrom litdata import StreamingRawDataset\nfrom torch.utils.data import DataLoader\n\n# Point to your existing cloud data\ndataset = StreamingRawDataset(\"s3://my-bucket/raw-data/\")\ndataloader = DataLoader(dataset, batch_size=32)\n\nfor batch in dataloader:\n    # Process raw bytes on-the-fly\n    pass\n```\n\n**Key benefits:**\n\n✅ **Instant access:**         Start streaming immediately without preprocessing.    \n✅ **Zero setup time:**        No data conversion or optimization required.    \n✅ **Native format:**          Work with original file formats (images, text, etc.).    \n✅ **Flexible processing:**    Apply transformations on-the-fly during streaming.    \n✅ **Cloud-native:**           Stream directly from S3, GCS, or Azure storage.    \n\n## Option 2: Optimize for maximum performance ⚡⚡⚡  \nAccelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming.\n\n**Step 1: Optimize your data (one-time setup)**\n\nTransform raw data into optimized chunks for maximum streaming speed.\nThis step formats the dataset for fast loading by writing data in an efficient chunked binary format.\n\n```python\nimport numpy as np\nfrom PIL import Image\nimport litdata as ld\n\ndef random_images(index):\n    # Replace with your actual image loading here (e.g., .jpg, .png, etc.)\n    # Recommended: use compressed formats like JPEG for better storage and optimized streaming speed\n    # You can also apply resizing or reduce image quality to further increase streaming speed and save space\n    fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))\n    fake_labels = np.random.randint(10)\n\n    # You can use any key:value pairs. Note that their types must not change between samples, and Python lists must\n    # always contain the same number of elements with the same types\n    data = {\"index\": index, \"image\": fake_images, \"class\": fake_labels}\n\n    return data\n\nif __name__ == \"__main__\":\n    # The optimize function writes data in an optimized format\n    ld.optimize(\n        fn=random_images,                   # the function applied to each input\n        inputs=list(range(1000)),           # the inputs to the function (here it's a list of numbers)\n        output_dir=\"fast_data\",             # optimized data is stored here\n        num_workers=4,                      # the number of workers on the same machine\n        chunk_bytes=\"64MB\"                  # size of each chunk\n    )\n```\n\n**Step 2: Put the data on the cloud**\n\nUpload the data to a [Lightning Studio](https://lightning.ai) (backed by S3) or your own S3 bucket:\n```bash\naws s3 cp --recursive fast_data s3://my-bucket/fast_data\n```\n\n**Step 3: Stream the data during training**\n\nLoad the data by replacing the PyTorch Dataset and DataLoader with the StreamingDataset and StreamingDataLoader.\n\n```python\nimport litdata as ld\n\ndataset = ld.StreamingDataset('s3://my-bucket/fast_data', shuffle=True, drop_last=True)\n\n# Custom collate function to handle the batch (optional)\ndef collate_fn(batch):\n    return {\n        \"image\": [sample[\"image\"] for sample in batch],\n        \"class\": [sample[\"class\"] for sample in batch],\n    }\n\n\ndataloader = ld.StreamingDataLoader(dataset, collate_fn=collate_fn)\nfor sample in dataloader:\n    img, cls = sample[\"image\"], sample[\"class\"]\n```\n\n**Key benefits:**\n\n✅ **Accelerate training:**       Optimized datasets load 20x faster.      \n✅ **Stream cloud datasets:**     Work with cloud data without downloading it.    \n✅ **PyTorch-first:**             Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.    \n✅ **Easy collaboration:**        Share and access datasets in the cloud, streamlining team projects.     \n✅ **Scale across GPUs:**         Streamed data automatically scales to all GPUs.      \n✅ **Flexible storage:**          Use S3, GCS, Azure, or your own cloud account for data storage.    \n✅ **Compression:**               Reduce your data footprint by using advanced compression algorithms.  \n✅ **Run local or cloud:**        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         \n✅ **Enterprise security:**       Self host or process data on your cloud account with Lightning Studios.  \n\n\u0026nbsp;\n\n----\n\n# Transform datasets\nAccelerate data processing tasks (data scraping, image resizing, embedding creation, distributed inference) by parallelizing (map) the work across many machines at once.\n\nHere's an example that resizes and crops a large image dataset:\n\n```python\nfrom PIL import Image\nimport litdata as ld\n\n# use a local or S3 folder\ninput_dir = \"my_large_images\"     # or \"s3://my-bucket/my_large_images\"\noutput_dir = \"my_resized_images\"  # or \"s3://my-bucket/my_resized_images\"\n\ninputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]\n\n# resize the input image\ndef resize_image(image_path, output_dir):\n  output_image_path = os.path.join(output_dir, os.path.basename(image_path))\n  Image.open(image_path).resize((224, 224)).save(output_image_path)\n\nld.map(\n    fn=resize_image,\n    inputs=inputs,\n    output_dir=\"output_dir\",\n)\n```\n\n**Key benefits:**\n\n✅ Parallelize processing:    Reduce processing time by transforming data across multiple machines simultaneously.    \n✅ Scale to large data:       Increase the size of datasets you can efficiently handle.    \n✅ Flexible usecases:         Resize images, create embeddings, scrape the internet, etc...    \n✅ Run local or cloud:        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         \n✅ Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  \n\n\u0026nbsp;\n\n----\n\n# Key Features\n\n## Features for optimizing and streaming datasets for model training\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Stream raw datasets from cloud storage (beta) \u003ca id=\"stream-raw\" href=\"#stream-raw\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n  \u0026nbsp;\n\nEffortlessly stream raw files (images, text, etc.) directly from S3, GCS, and Azure cloud storage without any optimization or conversion. Ideal for workflows requiring instant access to original data in its native format.\n\n**Prerequisites:**\n\nInstall the required dependencies to stream raw datasets from cloud storage like **Amazon S3** or **Google Cloud Storage**:\n\n```bash\n# for aws s3\npip install \"litdata[extra]\" s3fs\n\n# for gcloud storage\npip install \"litdata[extra]\" gcsfs\n```\n\n**Usage Example:**\n```python\nfrom torch.utils.data import DataLoader\nfrom litdata import StreamingRawDataset\n\ndataset = StreamingRawDataset(\"s3://bucket/files/\")\n\n# Use with PyTorch DataLoader\nloader = DataLoader(dataset, batch_size=32)\nfor batch in loader:\n    # Each item is raw bytes\n    pass\n```\n\n\u003e Use `StreamingRawDataset` to stream your data as-is. Use `StreamingDataset` for fastest streaming after optimizing your data.\n\n\nYou can also customize how files are grouped by subclassing `StreamingRawDataset` and overriding the `setup` method. This is useful for pairing related files (e.g., image and mask, audio and transcript) or any custom grouping logic.\n\n```python\nfrom typing import Union\nfrom torch.utils.data import DataLoader\nfrom litdata import StreamingRawDataset\nfrom litdata.raw.indexer import FileMetadata\n\nclass SegmentationRawDataset(StreamingRawDataset):\n    def setup(self, files: list[FileMetadata]) -\u003e Union[list[FileMetadata], list[list[FileMetadata]]]:\n        # TODO: Implement your custom grouping logic here.\n        # For example, group files by prefix, extension, or any rule you need.\n        # Return a list of groups, where each group is a list of FileMetadata.\n        # Example:\n        #   return [[image, mask], ...]\n        pass\n\n# Initialize the custom dataset\ndataset = SegmentationRawDataset(\"s3://bucket/files/\")\nloader = DataLoader(dataset, batch_size=32)\nfor item in loader:\n    # Each item in the batch is a pair: [image_bytes, mask_bytes]\n    pass\n```\n\n**Smart Index Caching**\n\n`StreamingRawDataset` automatically caches the file index for instant startup. Initial scan, builds and caches the index, then subsequent runs load instantly.\n\n**Two-Level Cache:**\n- **Local:** Stored in your cache directory for instant access\n- **Remote:** Automatically saved to cloud storage (e.g., `s3://bucket/files/index.json.zstd`) for reuse\n\n**Force Rebuild:**\n```python\n# When dataset files have changed\ndataset = StreamingRawDataset(\"s3://bucket/files/\", recompute_index=True)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Stream large cloud datasets \u003ca id=\"stream-large\" href=\"#stream-large\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nUse data stored on the cloud without needing to download it all to your computer, saving time and space.\n\nImagine you're working on a project with a huge amount of data stored online. Instead of waiting hours to download it all, you can start working with the data almost immediately by streaming it.\n\nOnce you've optimized the dataset with LitData, stream it as follows:\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\n\ndataset = StreamingDataset('s3://my-bucket/my-data', shuffle=True)\ndataloader = StreamingDataLoader(dataset, batch_size=64)\n\nfor batch in dataloader:\n    process(batch)  # Replace with your data processing logic\n\n```\n\n\nAdditionally, you can inject client connection settings for [S3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html#boto3.session.Session.client) or GCP when initializing your dataset. This is useful for specifying custom endpoints and credentials per dataset.\n\n```python\nfrom litdata import StreamingDataset\n\n# boto3 compatible storage options for a custom S3-compatible endpoint\nstorage_options = {\n    \"endpoint_url\": \"your_endpoint_url\",\n    \"aws_access_key_id\": \"your_access_key_id\",\n    \"aws_secret_access_key\": \"your_secret_access_key\",\n}\n\ndataset = StreamingDataset('s3://my-bucket/my-data', storage_options=storage_options)\n\n\n\ndataset = StreamingDataset('s3://my-bucket/my-data', storage_options=storage_options)\n```\n\nAlso, you can specify a custom cache directory when initializing your dataset. This is useful when you want to store the cache in a specific location.\n```python\nfrom litdata import StreamingDataset\n\n# Initialize the StreamingDataset with the custom cache directory\ndataset = StreamingDataset('s3://my-bucket/my-data', cache_dir=\"/path/to/cache\")\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Stream Hugging Face 🤗 datasets \u003ca id=\"stream-hf\" href=\"#stream-hf\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\n\nTo use your favorite  Hugging Face dataset with LitData, simply pass its URL to `StreamingDataset`.\n\n\u003cdetails\u003e\n  \u003csummary\u003eHow to get HF dataset URI?\u003c/summary\u003e\n\nhttps://github.com/user-attachments/assets/3ba9e2ef-bf6b-41fc-a578-e4b4113a0e72\n\n\u003c/details\u003e\n\n**Prerequisites:**\n\nInstall the required dependencies to stream Hugging Face datasets:\n```sh\npip install \"litdata[extra]\" huggingface_hub\n\n# Optional: To speed up downloads on high-bandwidth networks\npip install hf_transfer\nexport HF_HUB_ENABLE_HF_TRANSFER=1\n```\n\n**Stream Hugging Face dataset:**\n\n```python\nimport litdata as ld\n\n# Define the Hugging Face dataset URI\nhf_dataset_uri = \"hf://datasets/leonardPKU/clevr_cogen_a_train/data\"\n\n# Create a streaming dataset\ndataset = ld.StreamingDataset(hf_dataset_uri)\n\n# Print the first sample\nprint(\"Sample\", dataset[0])\n\n# Stream the dataset using StreamingDataLoader\ndataloader = ld.StreamingDataLoader(dataset, batch_size=4)\nfor sample in dataloader:\n    pass \n```\n\nYou don’t need to worry about indexing the dataset or any other setup. **LitData** will **handle all the necessary steps automatically** and `cache` the `index.json` file, so you won't have to index it again.\n\nThis ensures that the next time you stream the dataset, the indexing step is skipped..\n\n\u0026nbsp;\n\n### Indexing the HF dataset (Optional)\n\nIf the Hugging Face dataset hasn't been indexed yet, you can index it first using the `index_hf_dataset` method, and then stream it using the code above.\n\n```python\nimport litdata as ld\n\nhf_dataset_uri = \"hf://datasets/leonardPKU/clevr_cogen_a_train/data\"\n\nld.index_hf_dataset(hf_dataset_uri)\n```\n\n- Indexing the Hugging Face dataset ahead of time will make streaming abit faster, as it avoids the need for real-time indexing during streaming.\n\n- To use `HF gated dataset`, ensure the `HF_TOKEN` environment variable is set.\n\n**Note**: For HuggingFace datasets, `indexing` \u0026 `streaming` is supported only for datasets in **`Parquet format`**.\n\n\u0026nbsp;\n\n### Full Workflow for Hugging Face Datasets\n\nFor full control over the cache path(`where index.json file will be stored`) and other configurations, follow these steps:\n\n1. Index the Hugging Face dataset first:\n\n```python\nimport litdata as ld\n\nhf_dataset_uri = \"hf://datasets/open-thoughts/OpenThoughts-114k/data\"\n\nld.index_parquet_dataset(hf_dataset_uri, \"hf-index-dir\")\n```\n\n2. To stream HF datasets now, pass the `HF dataset URI`, the path where the `index.json` file is stored, and `ParquetLoader` as the `item_loader` to the **`StreamingDataset`**:\n\n```python\nimport litdata as ld\nfrom litdata.streaming.item_loader import ParquetLoader\n\nhf_dataset_uri = \"hf://datasets/open-thoughts/OpenThoughts-114k/data\"\n\ndataset = ld.StreamingDataset(hf_dataset_uri, item_loader=ParquetLoader(), index_path=\"hf-index-dir\")\n\nfor batch in ld.StreamingDataLoader(dataset, batch_size=4):\n  pass\n```\n\n\u0026nbsp;\n\n### LitData `Optimize` v/s `Parquet`\n\u003c!-- TODO: Update benchmark --\u003e\nBelow is the benchmark for the `Imagenet dataset (155 GB)`, demonstrating that **`optimizing the dataset using LitData is faster and results in smaller output size compared to raw Parquet files`**.\n\n| **Operation**                    | **Size (GB)** | **Time (seconds)** | **Throughput (images/sec)** |\n|-----------------------------------|---------------|---------------------|-----------------------------|\n| LitData Optimize Dataset          | 45            | 283.17             | 4000-4700                  |\n| Parquet Optimize Dataset          | 51            | 465.96             | 3600-3900                  |\n| Index Parquet Dataset (overhead)  | N/A           | 6                  | N/A                         |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Streams on multi-GPU, multi-node \u003ca id=\"multi-gpu\" href=\"#multi-gpu\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\n\nData optimized and loaded with Lightning automatically streams efficiently in distributed training across GPUs or multi-node.\n\nThe `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.\n\nHere you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.\n\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\n\n# For the training dataset, don't forget to enable shuffle and drop_last !!! \ntrain_dataset = StreamingDataset('s3://my-bucket/my-train-data', shuffle=True, drop_last=True)\ntrain_dataloader = StreamingDataLoader(train_dataset, batch_size=64)\n\nfor batch in train_dataloader:\n    process(batch)  # Replace with your data processing logic\n\nval_dataset = StreamingDataset('s3://my-bucket/my-val-data', shuffle=False, drop_last=False)\nval_dataloader = StreamingDataLoader(val_dataset, batch_size=64)\n\nfor batch in val_dataloader:\n    process(batch)  # Replace with your data processing logic\n```\n\n![An illustration showing how the Streaming Dataset works with multi node.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Stream from multiple cloud providers \u003ca id=\"cloud-providers\" href=\"#cloud-providers\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\n\nThe `StreamingDataset` provides support for reading optimized datasets from common cloud storage providers like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage. Below are examples of how to use StreamingDataset with each cloud provider.\n\n```python\nimport os\nimport litdata as ld\n\n# Read data from AWS S3 using boto3\naws_storage_options={\n    \"aws_access_key_id\": os.environ['AWS_ACCESS_KEY_ID'],\n    \"aws_secret_access_key\": os.environ['AWS_SECRET_ACCESS_KEY'],\n}\n# You can also pass the session options. (for boto3 only)\naws_session_options = {\n  \"profile_name\": os.environ['AWS_PROFILE_NAME'],  # Required only for custom profiles\n  \"region_name\": os.environ['AWS_REGION_NAME'],    # Required only for custom regions\n}\ndataset = ld.StreamingDataset(\"s3://my-bucket/my-data\", storage_options=aws_storage_options, session_options=aws_session_options)\n\n# Read Data from AWS S3 with Unsigned Request using boto3\naws_storage_options={\n  \"config\": botocore.config.Config(\n        retries={\"max_attempts\": 1000, \"mode\": \"adaptive\"}, # Configure retries for S3 operations\n        signature_version=botocore.UNSIGNED, # Use unsigned requests\n  )\n}\ndataset = ld.StreamingDataset(\"s3://my-bucket/my-data\", storage_options=aws_storage_options)\n\naws_storage_options={\n    \"AWS_ACCESS_KEY_ID\": os.environ['AWS_ACCESS_KEY_ID'],\n    \"AWS_SECRET_ACCESS_KEY\": os.environ['AWS_SECRET_ACCESS_KEY'],\n    \"S3_ENDPOINT_URL\": os.environ['AWS_ENDPOINT_URL'],  # Required only for custom endpoints\n}\ndataset = ld.StreamingDataset(\"s3://my-bucket/my-data\", storage_options=aws_storage_options)\n\ndataset = ld.StreamingDataset(\"s3://my-bucket/my-data\", storage_options=aws_storage_options)\n\n\n# Read data from GCS\ngcp_storage_options={\n    \"project\": os.environ['PROJECT_ID'],\n}\ndataset = ld.StreamingDataset(\"gs://my-bucket/my-data\", storage_options=gcp_storage_options)\n\n# Read data from Azure\nazure_storage_options={\n    \"account_url\": f\"https://{os.environ['AZURE_ACCOUNT_NAME']}.blob.core.windows.net\",\n    \"credential\": os.environ['AZURE_ACCOUNT_ACCESS_KEY']\n}\ndataset = ld.StreamingDataset(\"azure://my-bucket/my-data\", storage_options=azure_storage_options)\n```\n\n\u003c/details\u003e  \n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Pause, resume data streaming \u003ca id=\"pause-resume\" href=\"#pause-resume\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nStream data during long training, if interrupted, pick up right where you left off without any issues.\n\nLitData provides a stateful `Streaming DataLoader` e.g. you can `pause` and `resume` your training whenever you want.\n\nInfo: The `Streaming DataLoader` was used by [Lit-GPT](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md) to pretrain LLMs. Restarting from an older checkpoint was critical to get to pretrain the full model due to several failures (network, CUDA Errors, etc..).\n\n```python\nimport os\nimport torch\nfrom litdata import StreamingDataset, StreamingDataLoader\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\", shuffle=True)\ndataloader = StreamingDataLoader(dataset, num_workers=os.cpu_count(), batch_size=64)\n\n# Restore the dataLoader state if it exists\nif os.path.isfile(\"dataloader_state.pt\"):\n    state_dict = torch.load(\"dataloader_state.pt\")\n    dataloader.load_state_dict(state_dict)\n\n# Iterate over the data\nfor batch_idx, batch in enumerate(dataloader):\n\n    # Store the state every 1000 batches\n    if batch_idx % 1000 == 0:\n        torch.save(dataloader.state_dict(), \"dataloader_state.pt\")\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Use shared queue for Optimizing \u003ca id=\"shared-queue\" href=\"#shared-queue\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nIf you are using multiple workers to optimize your dataset, you can use a shared queue to speed up the process.\n\nThis is especially useful when optimizing large datasets in parallel, where some workers may be slower than others.\n\nIt can also improve fault tolerance when workers fail due to out-of-memory (OOM) errors.\n\n```python\nimport numpy as np\nfrom PIL import Image\nimport litdata as ld\n\ndef random_images(index):\n    fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))\n    fake_labels = np.random.randint(10)\n\n    data = {\"index\": index, \"image\": fake_images, \"class\": fake_labels}\n\n    return data\n\nif __name__ == \"__main__\":\n    # The optimize function writes data in an optimized format.\n    ld.optimize(\n        fn=random_images,                   # the function applied to each input\n        inputs=list(range(1000)),           # the inputs to the function (here it's a list of numbers)\n        output_dir=\"fast_data\",             # optimized data is stored here\n        num_workers=4,                      # The number of workers on the same machine\n        chunk_bytes=\"64MB\" ,                 # size of each chunk\n        keep_data_ordered=False,             # Use a shared queue to speed up the process\n    )\n```\n\n\n### Performance Difference between using a shared queue and not using it:\n\n**Note**: The following benchmarks were collected using the ImageNet dataset on an A10G machine with 16 workers.\n\n| Configuration    | Optimize Time (sec) | Stream 1 (img/sec) | Stream 2 (img/sec) |\n|------------------|---------------------|---------------------|---------------------|\n| shared_queue (`keep_data_ordered=False`)     | 1281                | 5392                | 5732                |\n| no shared_queue (`keep_data_ordered=True (default)`)  | 1187                | 5257                | 5746                |\n\n📌 Note: The **shared_queue** option impacts optimization time, not streaming speed.\n\u003e While the streaming numbers may appear slightly different, this variation is incidental and not caused by shared_queue.\n\u003e\n\u003e Streaming happens after optimization and does not involve inter-process communication where shared_queue plays a role.\n\n- 📄 Using a shared queue helps balance the load across workers, though it may slightly increase optimization time due to the overhead of pickling items sent between processes.\n\n- ⚡ However, it can significantly improve optimizing performance — especially when some workers are slower than others.\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Use a \u003ccode\u003eQueue\u003c/code\u003e as input for optimizing data \u003ca id=\"queue-input\" href=\"#queue-input\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nSometimes you don’t have a static list of inputs to optimize — instead, you have a stream of data coming in over time. In such cases, you can use a multiprocessing.Queue to feed data into the optimize() function.\n\n- This is especially useful when you're collecting data from a remote source like a web scraper, socket, or API.\n\n- You can also use this setup to store `replay buffer` data during reinforcement learning and later stream it back for training.\n\n```python\nfrom multiprocessing import Process, Queue\nfrom litdata.processing.data_processor import ALL_DONE\nimport litdata as ld\nimport time\n\ndef yield_numbers():\n    for i in range(1000):\n        time.sleep(0.01)\n        yield (i, i**2)\n\ndef data_producer(q: Queue):\n    for item in yield_numbers():\n        q.put(item)\n\n    q.put(ALL_DONE)  # Sentinel value to signal completion\n\ndef fn(index):\n    return index  # Identity function for demo\n\nif __name__ == \"__main__\":\n    q = Queue(maxsize=100)\n\n    producer = Process(target=data_producer, args=(q,))\n    producer.start()\n\n    ld.optimize(\n        fn=fn,                   # Function to process each item\n        queue=q,                 # 👈 Stream data from this queue\n        output_dir=\"fast_data\",  # Where to store optimized data\n        num_workers=2,\n        chunk_size=100,\n        mode=\"overwrite\",\n    )\n\n    producer.join()\n```\n\n📌 Note: Using queues to optimize your dataset impacts optimization time, not streaming speed.\n\n\u003e Irrespective of number of workers, you only need to put one sentinel value to signal completion.\n\u003e\n\u003e It'll be handled internally by LitData.\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ LLM Pre-training \u003ca id=\"llm-training\" href=\"#llm-training\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nLitData is highly optimized for LLM pre-training. First, we need to tokenize the entire dataset and then we can consume it.\n\n```python\nimport json\nfrom pathlib import Path\nimport zstandard as zstd\nfrom litdata import optimize, TokensLoader\nfrom tokenizer import Tokenizer\nfrom functools import partial\n\n# 1. Define a function to convert the text within the jsonl files into tokens\ndef tokenize_fn(filepath, tokenizer=None):\n    with zstd.open(open(filepath, \"rb\"), \"rt\", encoding=\"utf-8\") as f:\n        for row in f:\n            text = json.loads(row)[\"text\"]\n            if json.loads(row)[\"meta\"][\"redpajama_set_name\"] == \"RedPajamaGithub\":\n                continue  # exclude the GitHub data since it overlaps with starcoder\n            text_ids = tokenizer.encode(text, bos=False, eos=True)\n            yield text_ids\n\nif __name__ == \"__main__\":\n    # 2. Generate the inputs (we are going to optimize all the compressed json files from SlimPajama dataset )\n    input_dir = \"./slimpajama-raw\"\n    inputs = [str(file) for file in Path(f\"{input_dir}/SlimPajama-627B/train\").rglob(\"*.zst\")]\n\n    # 3. Store the optimized data wherever you want under \"/teamspace/datasets\" or \"/teamspace/s3_connections\"\n    outputs = optimize(\n        fn=partial(tokenize_fn, tokenizer=Tokenizer(f\"{input_dir}/checkpoints/Llama-2-7b-hf\")), # Note: You can use HF tokenizer or any others\n        inputs=inputs,\n        output_dir=\"./slimpajama-optimized\",\n        chunk_size=(2049 * 8012),\n        # This is important to inform LitData that we are encoding contiguous 1D array (tokens). \n        # LitData skips storing metadata for each sample e.g all the tokens are concatenated to form one large tensor.\n        item_loader=TokensLoader(),\n    )\n```\n\n```python\nimport os\nfrom litdata import StreamingDataset, StreamingDataLoader, TokensLoader\nfrom tqdm import tqdm\n\n# Increase by one because we need the next word as well\ndataset = StreamingDataset(\n  input_dir=f\"./slimpajama-optimized/train\",\n  item_loader=TokensLoader(block_size=2048 + 1),\n  shuffle=True,\n  drop_last=True,\n)\n\ntrain_dataloader = StreamingDataLoader(dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())\n\n# Iterate over the SlimPajama dataset\nfor batch in tqdm(train_dataloader):\n    pass\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Filter illegal data \u003ca id=\"filter-data\" href=\"#filter-data\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nSometimes, you have bad data that you don't want to include in the optimized dataset. With LitData, yield only the good data sample to include. \n\n\n```python\nfrom litdata import optimize, StreamingDataset\n\ndef should_keep(index) -\u003e bool:\n  # Replace with your own logic\n  return index % 2 == 0\n\n\ndef fn(data):\n    if should_keep(data):\n        yield data\n\nif __name__ == \"__main__\":\n    optimize(\n        fn=fn,\n        inputs=list(range(1000)),\n        output_dir=\"only_even_index_optimized\",\n        chunk_bytes=\"64MB\",\n        num_workers=1\n    )\n\n    dataset = StreamingDataset(\"only_even_index_optimized\")\n    data = list(dataset)\n    print(data)\n    # [0, 2, 4, 6, 8, 10, ..., 992, 994, 996, 998]\n```\n\nYou can even use try/expect.  \n\n```python\nfrom litdata import optimize, StreamingDataset\n\ndef fn(data):\n    try:\n        yield 1 / data \n    except:\n        pass\n\nif __name__ == \"__main__\":\n    optimize(\n        fn=fn,\n        inputs=[0, 0, 0, 1, 2, 4, 0],\n        output_dir=\"only_defined_ratio_optimized\",\n        chunk_bytes=\"64MB\",\n        num_workers=1\n    )\n\n    dataset = StreamingDataset(\"only_defined_ratio_optimized\")\n    data = list(dataset)\n    # The 0 are filtered out as they raise a division by zero \n    print(data)\n    # [1.0, 0.5, 0.25] \n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Combine datasets \u003ca id=\"combine-datasets\" href=\"#combine-datasets\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nMix and match different sets of data to experiment and create better models.\n\nCombine datasets with `CombinedStreamingDataset`.  As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) \u0026 [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.\n\n```python\nfrom litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader\nfrom tqdm import tqdm\nimport os\n\ntrain_datasets = [\n    StreamingDataset(\n        input_dir=\"s3://tinyllama-template/slimpajama/train/\",\n        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs\n        shuffle=True,\n        drop_last=True,\n    ),\n    StreamingDataset(\n        input_dir=\"s3://tinyllama-template/starcoder/\",\n        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs\n        shuffle=True,\n        drop_last=True,\n    ),\n]\n\n# Mix SlimPajama data and Starcoder data with these proportions:\nweights = (0.693584, 0.306416)\ncombined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights, iterate_over_all=False)\n\ntrain_dataloader = StreamingDataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())\n\n# Iterate over the combined datasets\nfor batch in tqdm(train_dataloader):\n    pass\n```\n\n**Batching Methods**\n\nThe `CombinedStreamingDataset` supports two different batching methods through the `batching_method` parameter:\n\n**Stratified Batching (Default)**:\nWith `batching_method=\"stratified\"` (the default), each batch contains samples from multiple datasets according to the specified weights:\n\n```python\n# Default stratified batching - batches mix samples from all datasets\ncombined_dataset = CombinedStreamingDataset(\n    datasets=[dataset1, dataset2], \n    batching_method=\"stratified\"  # This is the default\n)\n```\n\n**Per-Stream Batching**:\nWith `batching_method=\"per_stream\"`, each batch contains samples exclusively from a single dataset. This is useful when datasets have different shapes or structures:\n\n```python\n# Per-stream batching - each batch contains samples from only one dataset\ncombined_dataset = CombinedStreamingDataset(\n    datasets=[dataset1, dataset2], \n    batching_method=\"per_stream\"\n)\n\n# This ensures each batch has consistent structure, helpful for datasets with varying:\n# - Image sizes\n# - Sequence lengths  \n# - Data types\n# - Feature dimensions\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Parallel streaming \u003ca id=\"parallel-streaming\" href=\"#parallel-streaming\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nWhile `CombinedDataset` allows to fetch a sample from one of the datasets it wraps at each iteration, `ParallelStreamingDataset` can be used to fetch a sample from all the wrapped datasets at each iteration:\n\n```python\nfrom litdata import StreamingDataset, ParallelStreamingDataset, StreamingDataLoader\nfrom tqdm import tqdm\n\nparallel_dataset = ParallelStreamingDataset(\n    [\n        StreamingDataset(input_dir=\"input_dir_1\"),\n        StreamingDataset(input_dir=\"input_dir_2\"),\n    ],\n)\n\ndataloader = StreamingDataLoader(parallel_dataset)\n\nfor batch_1, batch_2 in tqdm(dataloader):\n    pass\n```\n\nThis is useful to generate new data on-the-fly using a sample from each dataset. To do so, provide a ``transform`` function to `ParallelStreamingDataset`:\n\n```python\ndef transform(samples: Tuple[Any]):\n    sample_1, sample_2 = samples  # as many samples as wrapped datasets\n    return sample_1 + sample_2  # example transformation\n\nparallel_dataset = ParallelStreamingDataset([dset_1, dset_2], transform=transform)\n\ndataloader = StreamingDataLoader(parallel_dataset)\n\nfor transformed_batch in tqdm(dataloader):\n    pass\n```\n\nIf the transformation requires random number generation, internal random number generators provided by `ParallelStreamingDataset` can be used. These are seeded using the current dataset state at the beginning of each epoch, which allows for reproducible and resumable data transformation. To use them, define a ``transform`` which takes a dictionary of random number generators as its second argument:\n\n```python\ndef transform(samples: Tuple[Any], rngs: Dict[str, Any]):\n    sample_1, sample_2 = samples  # as many samples as wrapped datasets\n    rng = rngs[\"random\"]  # \"random\", \"numpy\" and \"torch\" keys available\n    return rng.random() * sample_1 + rng.random() * sample_2  # example transformation\n\nparallel_dataset = ParallelStreamingDataset([dset_1, dset_2], transform=transform)\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Cycle datasets \u003ca id=\"cycle-datasets\" href=\"#cycle-datasets\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\n`ParallelStreamingDataset` can also be used to cycle a `StreamingDataset`. This allows to dissociate the epoch length from the number of samples in the dataset.\n\nTo do so, set the `length` option to the desired number of samples to yield per epoch. If ``length`` is greater than the number of samples in the dataset, the dataset is cycled. At the beginning of a new epoch, the dataset resumes from where it left off at the end of the previous epoch.\n\n```python\nfrom litdata import StreamingDataset, ParallelStreamingDataset, StreamingDataLoader\nfrom tqdm import tqdm\n\ndataset = StreamingDataset(input_dir=\"input_dir\")\n\ncycled_dataset = ParallelStreamingDataset([dataset], length=100)\n\nprint(len(cycled_dataset)))  # 100\n\ndataloader = StreamingDataLoader(cycled_dataset)\n\nfor batch, in tqdm(dataloader):\n    pass\n```\n\nYou can even set `length` to `float(\"inf\")` for an infinite dataset!\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Merge datasets \u003ca id=\"merge-datasets\" href=\"#merge-datasets\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nMerge multiple optimized datasets into one.\n\n```python\nimport numpy as np\nfrom PIL import Image\n\nfrom litdata import StreamingDataset, merge_datasets, optimize\n\n\ndef random_images(index):\n    return {\n        \"index\": index,\n        \"image\": Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)),\n        \"class\": np.random.randint(10),\n    }\n\n\nif __name__ == \"__main__\":\n    out_dirs = [\"fast_data_1\", \"fast_data_2\", \"fast_data_3\", \"fast_data_4\"]  # or [\"s3://my-bucket/fast_data_1\", etc.]\"\n    for out_dir in out_dirs:\n        optimize(fn=random_images, inputs=list(range(250)), output_dir=out_dir, num_workers=4, chunk_bytes=\"64MB\")\n\n    merged_out_dir = \"merged_fast_data\" # or \"s3://my-bucket/merged_fast_data\"\n    merge_datasets(input_dirs=out_dirs, output_dir=merged_out_dir)\n\n    dataset = StreamingDataset(merged_out_dir)\n    print(len(dataset))\n    # out: 1000\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Transform datasets while Streaming \u003ca id=\"transform-streaming\" href=\"#transform-streaming\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nTransform datasets on-the-fly while streaming them, allowing for efficient data processing without the need to store intermediate results.\n\n- You can use the `transform` argument in `StreamingDataset` to apply a `transformation function` or `a list of transformation functions` to each sample as it is streamed.\n\n```python\n# Define a simple transform function\ntorch_transform = transforms.Compose([\n  transforms.Resize((256, 256)),       # Resize to 256x256\n  transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)\n  transforms.Normalize(                # Normalize using ImageNet stats\n      mean=[0.485, 0.456, 0.406], \n      std=[0.229, 0.224, 0.225]\n  )\n])\n\ndef transform_fn(x, *args, **kwargs):\n    \"\"\"Define your transform function.\"\"\"\n    return torch_transform(x)  # Apply the transform to the input image\n\n# Create dataset with appropriate configuration\ndataset = StreamingDataset(data_dir, cache_dir=str(cache_dir), shuffle=shuffle, transform=[transform_fn])\n```\n\nOr, you can create a subclass of `StreamingDataset` and override its `transform` method to apply custom transformations to each sample.\n\n```python\nclass StreamingDatasetWithTransform(StreamingDataset):\n        \"\"\"A custom dataset class that inherits from StreamingDataset and applies a transform.\"\"\"\n\n        def __init__(self, *args, **kwargs):\n            super().__init__(*args, **kwargs)\n\n            self.torch_transform = transforms.Compose([\n                transforms.Resize((256, 256)),       # Resize to 256x256\n                transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)\n                transforms.Normalize(                # Normalize using ImageNet stats\n                    mean=[0.485, 0.456, 0.406], \n                    std=[0.229, 0.224, 0.225]\n                )\n            ])\n\n        # Define your transform method\n        def transform(self, x, *args, **kwargs):\n            \"\"\"A simple transform function.\"\"\"\n            return self.torch_transform(x)\n\n\ndataset = StreamingDatasetWithTransform(data_dir, cache_dir=str(cache_dir), shuffle=shuffle)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Split datasets for train, val, test \u003ca id=\"split-datasets\" href=\"#split-datasets\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\n\nSplit a dataset into train, val, test splits with `train_test_split`.\n\n```python\nfrom litdata import StreamingDataset, train_test_split\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\") # data are stored in the cloud\n\nprint(len(dataset)) # display the length of your data\n# out: 100,000\n\ntrain_dataset, val_dataset, test_dataset = train_test_split(dataset, splits=[0.3, 0.2, 0.5])\n\nprint(train_dataset)\n# out: 30,000\n\nprint(val_dataset)\n# out: 20,000\n\nprint(test_dataset)\n# out: 50,000\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Load a subset of the remote dataset \u003ca id=\"load-subset\" href=\"#load-subset\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\nWork on a smaller, manageable portion of your data to save time and resources.\n\n\n```python\nfrom litdata import StreamingDataset, train_test_split\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\", subsample=0.01) # data are stored in the cloud\n\nprint(len(dataset)) # display the length of your data\n# out: 1000\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Upsample from your source datasets \u003ca id=\"upsample-datasets\" href=\"#upsample-datasets\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\nUse to control the size of one iteration of a StreamingDataset using repeats. Contains `floor(N)` possibly shuffled copies of the source data, then a subsampling of the remainder.\n\n\n```python\nfrom litdata import StreamingDataset\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\", subsample=2.5, shuffle=True)\n\nprint(len(dataset)) # display the length of your data\n# out: 250000\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Easily modify optimized cloud datasets \u003ca id=\"modify-datasets\" href=\"#modify-datasets\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nAdd new data to an existing dataset or start fresh if needed, providing flexibility in data management.\n\nLitData optimized datasets are assumed to be immutable. However, you can make the decision to modify them by changing the mode to either `append` or `overwrite`.\n\n```python\nfrom litdata import optimize, StreamingDataset\n\ndef compress(index):\n    return index, index**2\n\nif __name__ == \"__main__\":\n    # Add some data\n    optimize(\n        fn=compress,\n        inputs=list(range(100)),\n        output_dir=\"./my_optimized_dataset\",\n        chunk_bytes=\"64MB\",\n    )\n\n    # Later on, you add more data\n    optimize(\n        fn=compress,\n        inputs=list(range(100, 200)),\n        output_dir=\"./my_optimized_dataset\",\n        chunk_bytes=\"64MB\",\n        mode=\"append\",\n    )\n\n    ds = StreamingDataset(\"./my_optimized_dataset\")\n    assert len(ds) == 200\n    assert ds[:] == [(i, i**2) for i in range(200)]\n```\n\nThe `overwrite` mode will delete the existing data and start from fresh.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Stream parquet datasets \u003ca id=\"stream-parquet\" href=\"#stream-parquet\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nStream Parquet datasets directly with LitData—no need to convert them into LitData’s optimized binary format! If your dataset is already in Parquet format, you can efficiently index and stream it using `StreamingDataset` and `StreamingDataLoader`.\n\n**Assumption:**\n\nYour dataset directory contains one or more Parquet files.\n\n**Prerequisites:**\n\nInstall the required dependencies to stream Parquet datasets from cloud storage like **Amazon S3** or **Google Cloud Storage**:\n\n```bash\n# For Amazon S3\npip install \"litdata[extra]\" s3fs\n\n# For Google Cloud Storage\npip install \"litdata[extra]\" gcsfs\n```\n\n**Index Your Dataset**: \n\nIndex your Parquet dataset to create an index file that LitData can use to stream the dataset.\n\n```python\nimport litdata as ld\n\n# Point to your data stored in the cloud\npq_dataset_uri = \"s3://my-bucket/my-parquet-data\"  # or \"gs://my-bucket/my-parquet-data\"\n\nld.index_parquet_dataset(pq_dataset_uri)\n```\n\n**Stream the Dataset**\n\nUse `StreamingDataset` with `ParquetLoader` to load and stream the dataset efficiently:\n\n\n```python\nimport litdata as ld\nfrom litdata.streaming.item_loader import ParquetLoader\n\n# Specify your dataset location in the cloud\npq_dataset_uri = \"s3://my-bucket/my-parquet-data\"  # or \"gs://my-bucket/my-parquet-data\"\n\n# Set up the streaming dataset\ndataset = ld.StreamingDataset(pq_dataset_uri, item_loader=ParquetLoader())\n\nprint(\"Sample\", dataset[0])\n\ndataloader = ld.StreamingDataLoader(dataset, batch_size=4)\nfor sample in dataloader:\n    pass\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Use compression \u003ca id=\"compression\" href=\"#compression\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nReduce your data footprint by using advanced compression algorithms.\n\n```python\nimport litdata as ld\n\ndef compress(index):\n    return index, index**2\n\nif __name__ == \"__main__\":\n    # Add some data\n    ld.optimize(\n        fn=compress,\n        inputs=list(range(100)),\n        output_dir=\"./my_optimized_dataset\",\n        chunk_bytes=\"64MB\",\n        num_workers=1,\n        compression=\"zstd\"\n    )\n```\n\nUsing [zstd](https://github.com/facebook/zstd), you can achieve high compression ratio like 4.34x for this simple example.\n\n| Without | With |\n| -------- | -------- | \n| 2.8kb | 646b |\n\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Access samples without full data download \u003ca id=\"access-samples\" href=\"#access-samples\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nLook at specific parts of a large dataset without downloading the whole thing or loading it on a local machine.\n\n```python\nfrom litdata import StreamingDataset\n\ndataset = StreamingDataset(\"s3://my-bucket/my-data\") # data are stored in the cloud\n\nprint(len(dataset)) # display the length of your data\n\nprint(dataset[42]) # show the 42th element of the dataset\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Use any data transforms \u003ca id=\"data-transforms\" href=\"#data-transforms\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nCustomize how your data is processed to better fit your needs.\n\nSubclass the `StreamingDataset` and override its `__getitem__` method to add any extra data transformations.\n\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\nimport torchvision.transforms.v2.functional as F\n\nclass ImagenetStreamingDataset(StreamingDataset):\n\n    def __getitem__(self, index):\n        image = super().__getitem__(index)\n        return F.resize(image, (224, 224))\n\ndataset = ImagenetStreamingDataset(...)\ndataloader = StreamingDataLoader(dataset, batch_size=4)\n\nfor batch in dataloader:\n    print(batch.shape)\n    # Out: (4, 3, 224, 224)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Profile data loading speed \u003ca id=\"profile-loading\" href=\"#profile-loading\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nMeasure and optimize how fast your data is being loaded, improving efficiency.\n\nThe `StreamingDataLoader` supports profiling of your data loading process. Simply use the `profile_batches` argument to specify the number of batches you want to profile:\n\n```python\nfrom litdata import StreamingDataset, StreamingDataLoader\n\nStreamingDataLoader(..., profile_batches=5)\n```\n\nThis generates a Chrome trace called `result.json`. Then, visualize this trace by opening Chrome browser at the `chrome://tracing` URL and load the trace inside.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Reduce memory use for large files \u003ca id=\"reduce-memory\" href=\"#reduce-memory\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nHandle large data files efficiently without using too much of your computer's memory.\n\nWhen processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), use the Python yield keyword to process and store one item at the time, reducing the memory footprint of the entire program.\n\n```python\nfrom pathlib import Path\nimport pyarrow.parquet as pq\nfrom litdata import optimize\nfrom tokenizer import Tokenizer\nfrom functools import partial\n\n# 1. Define a function to convert the text within the parquet files into tokens\ndef tokenize_fn(filepath, tokenizer=None):\n    parquet_file = pq.ParquetFile(filepath)\n    # Process per batch to reduce RAM usage\n    for batch in parquet_file.iter_batches(batch_size=8192, columns=[\"content\"]):\n        for text in batch.to_pandas()[\"content\"]:\n            yield tokenizer.encode(text, bos=False, eos=True)\n\n# 2. Generate the inputs\ninput_dir = \"/teamspace/s3_connections/tinyllama-template\"\ninputs = [str(file) for file in Path(f\"{input_dir}/starcoderdata\").rglob(\"*.parquet\")]\n\n# 3. Store the optimized data wherever you want under \"/teamspace/datasets\" or \"/teamspace/s3_connections\"\noutputs = optimize(\n    fn=partial(tokenize_fn, tokenizer=Tokenizer(f\"{input_dir}/checkpoints/Llama-2-7b-hf\")), # Note: Use HF tokenizer or any others\n    inputs=inputs,\n    output_dir=\"/teamspace/datasets/starcoderdata\",\n    chunk_size=(2049 * 8012), # Number of tokens to store by chunks. This is roughly 64MB of tokens per chunk.\n)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Limit local cache space \u003ca id=\"limit-cache\" href=\"#limit-cache\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nLimit the amount of disk space used by temporary files, preventing storage issues.\n\nAdapt the local caching limit of the `StreamingDataset`. This is useful to make sure the downloaded data chunks are deleted when used and the disk usage stays low.\n\n```python\nfrom litdata import StreamingDataset\n\ndataset = StreamingDataset(..., max_cache_size=\"10GB\")\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Change cache directory path \u003ca id=\"cache-directory\" href=\"#cache-directory\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nSpecify the directory where cached files should be stored, ensuring efficient data retrieval and management. This is particularly useful for organizing your data storage and improving access times.\n\n```python\nfrom litdata import StreamingDataset\nfrom litdata.streaming.cache import Dir\n\ncache_dir = \"/path/to/your/cache\"\ndata_dir = \"s3://my-bucket/my_optimized_dataset\"\n\ndataset = StreamingDataset(input_dir=Dir(path=cache_dir, url=data_dir))\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Optimize loading on networked drives \u003ca id=\"networked-drives\" href=\"#networked-drives\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nOptimize data handling for computers on a local network to improve performance for on-site setups.\n\nOn-prem compute nodes can mount and use a network drive. A network drive is a shared storage device on a local area network. In order to reduce their network overload, the `StreamingDataset` supports `caching` the data chunks.\n\n```python\nfrom litdata import StreamingDataset\n\ndataset = StreamingDataset(input_dir=\"local:/data/shared-drive/some-data\")\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Optimize dataset in distributed environment \u003ca id=\"distributed-optimization\" href=\"#distributed-optimization\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nLightning can distribute large workloads across hundreds of machines in parallel. This can reduce the time to complete a data processing task from weeks to minutes by scaling to enough machines.\n\nTo apply the optimize operator across multiple machines, simply provide the num_nodes and machine arguments to it as follows:\n\n```python\nimport os\nfrom litdata import optimize, Machine\n\ndef compress(index):\n    return (index, index ** 2)\n\noptimize(\n    fn=compress,\n    inputs=list(range(100)),\n    num_workers=2,\n    output_dir=\"my_output\",\n    chunk_bytes=\"64MB\",\n    num_nodes=2,\n    machine=Machine.DATA_PREP, # You can select between dozens of optimized machines\n)\n```\n\nIf the `output_dir` is a local path, the optimized dataset will be present in: `/teamspace/jobs/{job_name}/nodes-0/my_output`. Otherwise, it will be stored in the specified `output_dir`.\n\nRead the optimized dataset:\n\n```python\nfrom litdata import StreamingDataset\n\noutput_dir = \"/teamspace/jobs/litdata-optimize-2024-07-08/nodes.0/my_output\"\n\ndataset = StreamingDataset(output_dir)\n\nprint(dataset[:])\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Encrypt, decrypt data at chunk/sample level \u003ca id=\"encrypt-decrypt\" href=\"#encrypt-decrypt\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nSecure data by applying encryption to individual samples or chunks, ensuring sensitive information is protected during storage.\n\nThis example shows how to use the `FernetEncryption` class for sample-level encryption with a data optimization function.\n\n```python\nfrom litdata import optimize\nfrom litdata.utilities.encryption import FernetEncryption\nimport numpy as np\nfrom PIL import Image\n\n# Initialize FernetEncryption with a password for sample-level encryption\nfernet = FernetEncryption(password=\"your_secure_password\", level=\"sample\")\ndata_dir = \"s3://my-bucket/optimized_data\"\n\ndef random_image(index):\n    \"\"\"Generate a random image for demonstration purposes.\"\"\"\n    fake_img = Image.fromarray(np.random.randint(0, 255, (32, 32, 3), dtype=np.uint8))\n    return {\"image\": fake_img, \"class\": index}\n\n# Optimize data while applying encryption\noptimize(\n    fn=random_image,\n    inputs=list(range(5)),  # Example inputs: [0, 1, 2, 3, 4]\n    num_workers=1,\n    output_dir=data_dir,\n    chunk_bytes=\"64MB\",\n    encryption=fernet,\n)\n\n# Save the encryption key to a file for later use\nfernet.save(\"fernet.pem\")\n```\n\nLoad the encrypted data using the `StreamingDataset` class as follows:\n\n```python\nfrom litdata import StreamingDataset\nfrom litdata.utilities.encryption import FernetEncryption\n\n# Load the encryption key\nfernet = FernetEncryption(password=\"your_secure_password\", level=\"sample\")\nfernet.load(\"fernet.pem\")\n\n# Create a streaming dataset for reading the encrypted samples\nds = StreamingDataset(input_dir=data_dir, encryption=fernet)\n```\n\nImplement your own encryption method: Subclass the `Encryption` class and define the necessary methods:\n\n```python\nfrom litdata.utilities.encryption import Encryption\n\nclass CustomEncryption(Encryption):\n    def encrypt(self, data):\n        # Implement your custom encryption logic here\n        return data\n\n    def decrypt(self, data):\n        # Implement your custom decryption logic here\n        return data\n```\n\nThis allows the data to remain secure while maintaining flexibility in the encryption method.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Debug \u0026 Profile LitData with logs \u0026 Litracer \u003ca id=\"debug-profile\" href=\"#debug-profile\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\n\nLitData comes with built-in logging and profiling capabilities to help you debug and profile your data streaming workloads.\n\n\u003cimg width=\"1439\" alt=\"431247797-0e955e71-2f9a-4aad-b7c1-a8218fed2e2e\" src=\"https://github.com/user-attachments/assets/4e40676c-ba0b-49af-acac-975977173669\" /\u003e\n\n- e.g., with LitData Streaming\n\n```python\nimport litdata as ld\nfrom litdata.debugger import enable_tracer\n\n# WARNING: Remove existing trace `litdata_debug.log` file if it exists before re-tracing\nenable_tracer()\n\nif __name__ == \"__main__\":\n    dataset = ld.StreamingDataset(\"s3://my-bucket/my-data\", shuffle=True)\n    dataloader = ld.StreamingDataLoader(dataset, batch_size=64)\n\n    for batch in dataloader:\n        print(batch)  # Replace with your data processing logic\n```\n\n1. Generate Debug Log:\n\n    - Run your Python program and it'll create a log file containing detailed debug information.\n\n    ```bash\n      python main.py\n    ```\n\n2. Install [Litracer](https://github.com/deependujha/litracer/):\n\n    - Option 1: Using Go (recommended)\n        - Install Go on your system.\n        - Run the following command to install Litracer:\n\n        ```bash\n          go install github.com/deependujha/litracer@latest\n        ```\n\n    - Option 2: Download Binary\n        - Visit the [LitRacer GitHub Releases](https://github.com/deependujha/litracer/releases) page.\n        - Download the appropriate binary for your operating system and follow the installation instructions.\n\n3. Convert Debug Log to trace JSON:\n\n    - Use litracer to convert the generated log file into a trace JSON file. This command uses 100 workers for conversion:\n\n    ```bash\n      litracer litdata_debug.log -o litdata_trace.json -w 100\n    ```\n\n4. Visualize the trace:\n\n    - Use either `chrome://tracing` in the Chrome browser or `ui.perfetto.dev` to view the `litdata_trace.json` file for in-depth performance insights. You can also use `SQL queries` to analyze the logs.\n    - `Perfetto` is recommended over `chrome://tracing` for visualization \u0026 analyzing.\n\n- Key Points:\n\n    - For very large trace.json files (`\u003e 2GB`), refer to the [Perfetto documentation](https://perfetto.dev/docs/visualization/large-traces) for using native accelerators.\n    - If you are trying to connect Perfetto to the RPC server, it is recommended to use Chrome over Brave, as it has been observed that Perfetto in Brave does not autodetect the RPC server.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Lightning AI Data Connections - Direct download and upload \u003ca id=\"lightning-connections\" href=\"#lightning-connections\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\n\u0026nbsp;\n\n[Lightning Studios](https://lightning.ai/) have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder.\n\nFor example, output artifacts from this code will be directly uploaded to the `my-data-1` s3 bucket.\n\n```python\nfrom litdata import optimize\n\ndef should_keep(data):\n    if data % 2 == 0:\n        yield data\n\nif __name__ == \"__main__\":\n    optimize(\n        fn=should_keep,\n        inputs=list(range(1000)),\n        output_dir=\"/teamspace/s3_connections/my-data-1/output\",\n        chunk_bytes=\"64MB\",\n        num_workers=1\n    )\n```\n\n\nSimilarly, data will be downloaded directly from the `my-data-1` s3 bucket in this example code.\n\n```python\nfrom litdata import StreamingRawDataset\n\nif __name__ == \"__main__\":\n    data_dir = \"/teamspace/s3_connections/my-bucket-1/data\"\n\n    raw_dataset = StreamingRawDataset(data_dir)\n\n    data = list(raw_dataset)\n    print(data)\n```\n\nReferences to any of the following directories will work similarly:\n1. `/teamspace/lightning_storage/...`\n2. `/teamspace/s3_connections/...`\n3. `/teamspace/gcs_connections/...`\n4. `/teamspace/s3_folders/...`\n5. `/teamspace/gcs_folders/...`\n\u003c/details\u003e\n\n\u0026nbsp;\n\n\n## Features for transforming datasets\n\n\u003cdetails\u003e\n  \u003csummary\u003e ✅ Parallelize data transformations (map) \u003ca id=\"map\" href=\"#map\"\u003e🔗\u003c/a\u003e \u003c/summary\u003e\n\u0026nbsp;\n\nApply the same change to different parts of the dataset at once to save time and effort.\n\nThe `map` operator can be used to apply a function over a list of inputs.\n\nHere is an example where the `map` operator is used to apply a `resize_image` function over a folder of large images.\n\n```python\nfrom litdata import map\nfrom PIL import Image\n\n# Note: Inputs could also refer to files on s3 directly.\ninput_dir = \"my_large_images\"\ninputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]\n\n# The resize image takes one of the input (image_path) and the output directory.\n# Files written to output_dir are persisted.\ndef resize_image(image_path, output_dir):\n  output_image_path = os.path.join(output_dir, os.path.basename(image_path))\n  Image.open(image_path).resize((224, 224)).save(output_image_path)\n\nmap(\n    fn=resize_image,\n    inputs=inputs,\n    output_dir=\"s3://my-bucket/my_resized_images\",\n)\n```\n\n\u003c/details\u003e\n\n\u0026nbsp;\n\n----\n\n# Benchmarks\nIn this section we show benchmarks for speed to optimize a dataset and the resulting streaming speed ([Reproduce the benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries)).\n\n## Streaming speed \n### LitData Chunks\nData optimized and streamed with LitData achieves a 20x speed up over non optimized data and 2x speed up over other streaming solutions.\n\nSpeed to stream Imagenet 1.2M from AWS S3:\n\n| Framework | Images / sec  1st Epoch (float32)  | Images / sec   2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |\n|---|---|---|---|---|\n| LitData | **5839** | **6692**  | **6282**  | **7221**  |\n| Web Dataset  | 3134 | 3924 | 3343 | 4424 |\n| Mosaic ML  | 2898 | 5099 | 2809 | 5158 |\n\n\u003cdetails\u003e\n  \u003csummary\u003e Benchmark details\u003c/summary\u003e\n\u0026nbsp;\n\n- [Imagenet-1.2M dataset](https://www.image-net.org/) contains `1,281,167 images`.\n- To align with other benchmarks, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.\n\n\u003c/details\u003e\n\u0026nbsp;\n\nSpeed to stream Imagenet 1.2M from other cloud storage providers:\n\n| Storage Provider | Framework | Images / sec 1st Epoch (float32) | Images / sec 2nd Epoch (float32) |\n|---|---|---|---|\n| Cloudflare R2 | LitData | **5335** | **5630** |\n\nSpeed to stream Imagenet 1.2M from local disk with ffcv vs LitData:\n| Framework | Dataset Mode | Dataset Size @ 256px | Images / sec 1st Epoch (float32) | Images / sec 2nd Epoch (float32) |\n|---|---|---|---|---|\n| LitData | PIL RAW | 168 GB | 6647 | 6398 | \n| LitData | JPEG 90% | 12 GB | 6553 | 6537 |\n| ffcv (os_cache=True) | RAW | 170 GB | 7263 | 6698 |\n| ffcv (os_cache=False) | RAW | 170 GB | 7556 | 8169 |\n| ffcv(os_cache=True) | JPEG 90% | 20 GB | 7653 | 8051 |\n| ffcv(os_cache=False) | JPEG 90% | 20 GB | 8149 | 8607 |\n\n### Raw Dataset\n\nSpeed to stream raw Imagenet 1.2M from different cloud storage providers:\n\n\n| Storage | Images / s (without transform) | Images / s (with transform) |\n|---------|-------------------|----------------|\n| AWS S3  | ~6400 +/- 100     | ~3200 +/- 100  |\n| Google Cloud Storage | ~5650 +/- 100     | ~3100 +/- 100  |\n\n\u003e **Note:**\n\u003e Use `StreamingRawDataset` if you want to stream your data as-is. Use `StreamingDataset` if you want the fastest streaming and are okay with optimizing your data first.\n\n\u0026nbsp;\n\n## Time to optimize data\nLitData optimizes the Imagenet dataset for fast training 3-5x faster than other frameworks:\n\nTime to optimize 1.2 million ImageNet images (Faster is better):\n| Framework |Train Conversion Time | Val Conversion Time | Dataset Size | # Files |\n|---|---|---|---|---|\n| LitData  |  **10:05 min** | **00:30 min** | **143.1 GB**  | 2.339  |\n| Web Dataset  | 32:36 min | 01:22 min | 147.8 GB | 1.144 |\n| Mosaic ML  | 49:49 min | 01:04 min | **143.1 GB** | 2.298 |\n\n\u0026nbsp;\n\n----\n\n# Parallelize transforms and data optimization on cloud machines\n\u003cdiv align=\"center\"\u003e\n\u003cimg alt=\"Lightning\" src=\"https://pl-flash-data.s3.amazonaws.com/data-prep.jpg\" width=\"700px\"\u003e\n\u003c/div\u003e\n\n## Parallelize data transforms\n\nTransformations with LitData are linearly parallelizable across machines.\n\nFor example, let's say that it takes 56 hours to embed a dataset on a single A10G machine. With LitData,\nthis can be speed up by adding more machines in parallel\n\n| Number of machines | Hours |\n|-----------------|--------------|\n| 1               | 56           |\n| 2               | 28           |\n| 4               | 14           |\n| ...               | ...            |\n| 64              | 0.875        |\n\nTo scale the number of machines, run the processing script on [Lightning Studios](https://lightning.ai/):\n\n```python\nfrom litdata import map, Machine\n\nmap(\n  ...\n  num_nodes=32,\n  machine=Machine.DATA_PREP, # Select between dozens of optimized machines\n)\n```\n\n## Parallelize data optimization\nTo scale the number of machines for data optimization, use [Lightning Studios](https://lightning.ai/):\n\n```python\nfrom litdata import optimize, Machine\n\noptimize(\n  ...\n  num_nodes=32,\n  machine=Machine.DATA_PREP, # Select between dozens of optimized machines\n)\n```\n\n\u0026nbsp;\n\nExample: [Process the LAION 400 million image dataset in 2 hours on 32 machines, each with 32 CPUs](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset).\n\n\u0026nbsp;\n\n----\n\n# Start from a template\nBelow are templates for real-world applications of LitData at scale.\n\n## Templates: Transform datasets\n\n| Studio | Data type | Time (minutes) | Machines | Dataset |\n| ------------------------------------ | ----------------- | ----------------- | -------------- | -------------- |\n| [Download LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) | Image \u0026 Text | 120 | 32 |[LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) |\n| [Tokenize 2M Swedish Wikipedia Articles](https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles) | Text | 7 | 4 | [Swedish Wikipedia](https://huggingface.co/datasets/wikipedia) |\n| [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) | Text | 15 | 3 | [English Wikipedia](https://huggingface.co/datasets/wikipedia) |\n\n## Templates: Optimize + stream data\n\n| Studio | Data type | Time (minutes) | Machines | Dataset |\n| -------------------------------- | ----------------- | ----------------- | -------------- | -------------- |\n| [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) | Image \u0026 Label | 10 | 1 | [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) |\n| [Optimize GeoSpatial data for model training](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) | Image \u0026 Mask | 120 | 32 | [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) |\n| [Optimize TinyLlama 1T dataset for training](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) | Text | 240 | 32 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) \u0026 [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) |\n| [Optimize parquet files for model training](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) | Parquet Files | 12 | 16 | Randomly Generated data |\n\n\u0026nbsp;\n\n----\n\n# Community\nLitData is a community project accepting contributions -  Let's make the world's most advanced AI data processing framework.\n\n💬 [Get help on Discord](https://discord.com/invite/XncpTy7DSt)    \n📋 [License: Apache 2.0](https://github.com/Lightning-AI/litdata/blob/main/LICENSE)\n\n\n----\n\n## Citation\n\n```\n@misc{litdata2023,\n  author       = {Thomas Chaton and Lightning AI},\n  title        = {LitData: Transform datasets at scale. Optimize datasets for fast AI model training.},\n  year         = {2023},\n  howpublished = {\\url{https://github.com/Lightning-AI/litdata}},\n  note         = {Accessed: 2025-04-09}\n}\n```\n\n----\n\n## Papers with LitData\n\n* [Towards Interpretable Protein Structure\nPrediction with Sparse Autoencoders](https://arxiv.org/pdf/2503.08764) | [Github](https://github.com/johnyang101/reticular-sae) | (Nithin Parsan, David J. Yang and John J. Yang)\n\n----\n\n# Governance\n\n## Maintainers\n\n* Thomas Chaton ([tchaton](https://github.com/tchaton))\n* Bhimraj Yadav ([bhimrazy](https://github.com/bhimrazy))\n* Deependu ([deependujha](https://github.com/deependujha))\n\n\n## Emeritus Maintainers\n\n* Luca Antiga ([lantiga](https://github.com/lantiga))\n* Justus Schock ([justusschock](https://github.com/justusschock))\n* Jirka Borda ([Borda](https://github.com/Borda))\n\n\u003cdetails\u003e\n  \u003csummary\u003eAlumni\u003c/summary\u003e\n\n* Adrian Wälchli ([awaelchli](https://github.com/awaelchli))\n\n\u003c/details\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flightning-ai%2Flitdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flightning-ai%2Flitdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flightning-ai%2Flitdata/lists"}