https://github.com/saforem2/ezpz
Train across all your devices, ezpz 🍋
https://github.com/saforem2/ezpz
deepspeed distributed-training launcher machine-learning mpi mpi4py parallelism python pytorch rich slurm
Last synced: about 1 month ago
JSON representation
Train across all your devices, ezpz 🍋
- Host: GitHub
- URL: https://github.com/saforem2/ezpz
- Owner: saforem2
- License: mit
- Created: 2023-09-12T20:23:20.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-09-09T20:37:15.000Z (9 months ago)
- Last Synced: 2025-09-15T10:00:08.083Z (9 months ago)
- Topics: deepspeed, distributed-training, launcher, machine-learning, mpi, mpi4py, parallelism, python, pytorch, rich, slurm
- Language: Python
- Homepage: https://saforem2.github.io/ezpz/
- Size: 6.06 MB
- Stars: 24
- Watchers: 1
- Forks: 7
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# 🍋 ezpz
> Write once, run anywhere.
`ezpz` makes distributed PyTorch launches portable across any supported
hardware {NVIDIA, AMD, Intel, MPS, CPU} with **zero code changes**.
## Features
- **Multi-hardware** — automatic device detection and backend selection
(CUDA/NCCL, XPU/CCL, MPS, CPU/Gloo)
- **Zero code changes** — same script runs on a laptop, a single GPU, or a
thousand-node supercomputer
- **HPC integration** — native PBS and SLURM support with automatic hostfile
discovery and rank assignment
- **Metric tracking** — built-in `History` class for recording, plotting, and
saving training metrics
- **CLI tools** — `ezpz launch`, `ezpz test`, `ezpz doctor` for launching
jobs, smoke-testing, and diagnostics
## Quick Install
```bash
uv pip install git+https://github.com/saforem2/ezpz
```
## Quick Start
```python
import torch
import ezpz
rank = ezpz.setup_torch() # auto-detects device + backend
device = ezpz.get_torch_device()
model = torch.nn.Linear(128, 10).to(device)
model = ezpz.wrap_model(model) # FSDP (default)
```
```bash
# Same command everywhere -- Mac laptop, NVIDIA cluster, Intel Aurora:
ezpz launch python3 train.py
```
## CLI
```bash
ezpz launch python3 train.py # launch distributed training
ezpz test # smoke-test your setup
ezpz doctor # diagnose environment issues
```
## Documentation
Full documentation is available at [**ezpz.cool**](https://ezpz.cool).