{"id":26367537,"url":"https://github.com/raideno/cached-dataset","last_synced_at":"2025-03-16T21:17:37.722Z","repository":{"id":280463659,"uuid":"941561023","full_name":"raideno/cached-dataset","owner":"raideno","description":"A PyTorch dataset wrapper that efficiently caches samples to disk for faster loading during subsequent epochs, ideal for handling large datasets that don’t fit into memory.","archived":false,"fork":false,"pushed_at":"2025-03-11T16:20:32.000Z","size":32,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T17:35:11.075Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raideno.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-02T15:31:58.000Z","updated_at":"2025-03-11T16:20:36.000Z","dependencies_parsed_at":"2025-03-03T15:54:38.087Z","dependency_job_id":null,"html_url":"https://github.com/raideno/cached-dataset","commit_stats":null,"previous_names":["raideno/cached-dataset"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raideno%2Fcached-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raideno%2Fcached-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raideno%2Fcached-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raideno%2Fcached-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raideno","download_url":"https://codeload.github.com/raideno/cached-dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243933449,"owners_count":20370988,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-16T21:17:37.209Z","updated_at":"2025-03-16T21:17:37.714Z","avatar_url":"https://github.com/raideno.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003e 🚨 **WARNING: This package is still under development and NOT ready for production use!** 🚨\n\n# Cached Dataset\n\nThe idea is that when you have datasets with computation hungry transformations, you can wrap your dataset with the cached-dataset in order to cache the transformed version of your dataset either into disk or memory \u0026 thus avoid recomputing the transformations during each epoch of your training.\n\nDepending on the context this can save a lot of time, but at the cost of memory consumption.\n\nThe package supports multi processing \u0026 is thus able to apply and cache your transformations as fast as possible.\n\n## Installation\n\n```\npip install git+https://github.com/raideno/cached-dataset.git\n```\n\n## Usage\n\n```python\nfrom cached_dataset import DiskCachedDataset\n\n# NOTE: your usual torch dataset with transforms for which you want to cache the transformed version\ndataset = ...\n\n# NOTE: the directory were you want to cache your dataset.\nCACHING_DIRECTORY = \"./cached-dataset\"\n\ncached_dataset = DiskCachedDataset.load_dataset_or_cache_it(\n    dataset=dataset,\n    base_path=CACHING_DIRECTORY,\n    verbose=True,\n    num_workers=0\n)\n\nfor sample in cached_dataset:\n    print(f\"[sample-{i}]: {sample}\")\n```\n\nDepending on your CPU / GPU power you might set the `num_workers` parameter to something else than 0 in order to speed up the caching process.\n\n**Note:** for now the only available caching location is on disk, memory isn't supported yet.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraideno%2Fcached-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraideno%2Fcached-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraideno%2Fcached-dataset/lists"}