Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jnordberg/feistpy
Dataset sampling for machine learning using Feistel cipher
https://github.com/jnordberg/feistpy
Last synced: 22 days ago
JSON representation
Dataset sampling for machine learning using Feistel cipher
- Host: GitHub
- URL: https://github.com/jnordberg/feistpy
- Owner: jnordberg
- Created: 2023-10-30T21:33:39.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2023-11-01T10:36:33.000Z (about 1 year ago)
- Last Synced: 2024-10-19T04:04:36.116Z (25 days ago)
- Language: Python
- Size: 6.84 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# feistpy
Dataset sampling for machine learning using Feistel networks.
## Installation
```bash
pip install feistpy
```## Usage
Standalone:
```python
from feistpy import FeistelSampler
sampler = FeistelSampler(
1_000_000_000, # total number of samples
rank=0, # rank of this process
num_replicas=32, # aka world size
)# iterate over indices
for i in sampler:
print(i)```
With PyTorch:
```python
from feistpy import FeistelSampler
from torch.utils.data import DataLoader
import torch.distributed as distdataset = ... # some dataset
sampler = FeistelSampler(
dataset,
rank=dist.get_rank(),
num_replicas=dist.get_world_size(),
)loader = DataLoader(
dataset,
batch_size=8192,
num_workers=8,
sampler=sampler,
)for epoch in range(100):
sampler.set_epoch(epoch)
for batch in loader:
# do something with batch```
## Benefits
- Small memory footprint and fast sampling
- Deterministic shuffling across ranks and epochs
- Supports up to 2^64 items
- Advanced sampling strategies (see [sampler.py](./src/feistpy/sampler.py))## Acknowledgements
This library uses the excellent [gfc](https://github.com/maxmouchet/gfc) library for the
fast generation of Feistel permutations.## License
MIT