https://github.com/raideno/cached-dataset
A PyTorch dataset wrapper that efficiently caches samples to disk for faster loading during subsequent epochs, ideal for handling large datasets that don’t fit into memory.
https://github.com/raideno/cached-dataset
Last synced: about 1 year ago
JSON representation
A PyTorch dataset wrapper that efficiently caches samples to disk for faster loading during subsequent epochs, ideal for handling large datasets that don’t fit into memory.
- Host: GitHub
- URL: https://github.com/raideno/cached-dataset
- Owner: raideno
- Created: 2025-03-02T15:31:58.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-11T16:20:32.000Z (over 1 year ago)
- Last Synced: 2025-03-11T17:35:11.075Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 31.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
> 🚨 **WARNING: This package is still under development and NOT ready for production use!** 🚨
# Cached Dataset
The idea is that when you have datasets with computation hungry transformations, you can wrap your dataset with the cached-dataset in order to cache the transformed version of your dataset either into disk or memory & thus avoid recomputing the transformations during each epoch of your training.
Depending on the context this can save a lot of time, but at the cost of memory consumption.
The package supports multi processing & is thus able to apply and cache your transformations as fast as possible.
## Installation
```
pip install git+https://github.com/raideno/cached-dataset.git
```
## Usage
```python
from cached_dataset import DiskCachedDataset
# NOTE: your usual torch dataset with transforms for which you want to cache the transformed version
dataset = ...
# NOTE: the directory were you want to cache your dataset.
CACHING_DIRECTORY = "./cached-dataset"
cached_dataset = DiskCachedDataset.load_dataset_or_cache_it(
dataset=dataset,
base_path=CACHING_DIRECTORY,
verbose=True,
num_workers=0
)
for sample in cached_dataset:
print(f"[sample-{i}]: {sample}")
```
Depending on your CPU / GPU power you might set the `num_workers` parameter to something else than 0 in order to speed up the caching process.
**Note:** for now the only available caching location is on disk, memory isn't supported yet.