https://github.com/toshas/cluster_tools
https://github.com/toshas/cluster_tools
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/toshas/cluster_tools
- Owner: toshas
- License: bsd-3-clause
- Created: 2021-02-11T12:51:09.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-02-11T13:44:42.000Z (over 4 years ago)
- Last Synced: 2025-04-19T23:53:04.248Z (about 2 months ago)
- Language: Python
- Homepage:
- Size: 5.86 KB
- Stars: 11
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Cluster tools
dotfiles for cluster environments
# Why
Many distributed environments for machine learning impose constraints on the data processing workflow.
# Overcoming file (inode) limit
One such constraint is the total number of files allowed on distributed storage media. This becomes a problem when working with modern datasets, consisting of millions of little files.
## Workflow
1. Run `pack_dataset_to_hdf5.py ` to convert a directory with dataset files into a single dataset blob;
2. Modify dataset accessor (e.g., `torch.utils.data.Dataset` instance) by changing `open` calls to working with the blob dataset instead. The following snippet reads out an image file from an HDF5:```python
import h5py, io
from PIL import Imagedef read_binary(rel_path, hdf5_path):
hfile = None
try:
hfile = h5py.File(hdf5_path, 'r')
return hfile[rel_path]['raw'][0]
finally:
if hfile is not None:
hfile.close()def read_image(rel_path, hdf5_path):
bytes = access_item(rel_path, hdf5_path)
return Image.open(io.BytesIO(bytes)).convert('RGB')
```## Alternatives
Check the following items before using snippets from this repository:
- tfrecord
- Petastorm
- tar