https://github.com/raideno/cached-dataset

A PyTorch dataset wrapper that efficiently caches samples to disk for faster loading during subsequent epochs, ideal for handling large datasets that don’t fit into memory.
https://github.com/raideno/cached-dataset

Last synced: about 1 year ago
JSON representation

A PyTorch dataset wrapper that efficiently caches samples to disk for faster loading during subsequent epochs, ideal for handling large datasets that don’t fit into memory.

Host: GitHub
URL: https://github.com/raideno/cached-dataset
Owner: raideno
Created: 2025-03-02T15:31:58.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-11T16:20:32.000Z (over 1 year ago)
Last Synced: 2025-03-11T17:35:11.075Z (over 1 year ago)
Language: Python
Homepage:
Size: 31.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          > 🚨 **WARNING: This package is still under development and NOT ready for production use!** 🚨

# Cached Dataset

The idea is that when you have datasets with computation hungry transformations, you can wrap your dataset with the cached-dataset in order to cache the transformed version of your dataset either into disk or memory & thus avoid recomputing the transformations during each epoch of your training.

Depending on the context this can save a lot of time, but at the cost of memory consumption.

The package supports multi processing & is thus able to apply and cache your transformations as fast as possible.

## Installation

```

pip install git+https://github.com/raideno/cached-dataset.git

```

## Usage

```python

from cached_dataset import DiskCachedDataset

# NOTE: your usual torch dataset with transforms for which you want to cache the transformed version

dataset = ...

# NOTE: the directory were you want to cache your dataset.

CACHING_DIRECTORY = "./cached-dataset"

cached_dataset = DiskCachedDataset.load_dataset_or_cache_it(

    dataset=dataset,

    base_path=CACHING_DIRECTORY,

    verbose=True,

    num_workers=0

)

for sample in cached_dataset:

    print(f"[sample-{i}]: {sample}")

```

Depending on your CPU / GPU power you might set the `num_workers` parameter to something else than 0 in order to speed up the caching process.

**Note:** for now the only available caching location is on disk, memory isn't supported yet.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raideno/cached-dataset

Awesome Lists containing this project

README