https://github.com/saforem2/ezpz
Write once, run anywhere; ezpz 🍋
https://github.com/saforem2/ezpz
ai-tools deepspeed distributed-training fsdp launcher machine-learning mpi mpi4py parallelism python pytorch rich slurm torch training
Last synced: about 20 hours ago
JSON representation
Write once, run anywhere; ezpz 🍋
- Host: GitHub
- URL: https://github.com/saforem2/ezpz
- Owner: saforem2
- License: mit
- Created: 2023-09-12T20:23:20.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2026-06-24T12:19:39.000Z (3 days ago)
- Last Synced: 2026-06-24T14:11:47.363Z (3 days ago)
- Topics: ai-tools, deepspeed, distributed-training, fsdp, launcher, machine-learning, mpi, mpi4py, parallelism, python, pytorch, rich, slurm, torch, training
- Language: Python
- Homepage: http://ezpz.cool/
- Size: 15.3 MB
- Stars: 34
- Watchers: 1
- Forks: 8
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Security: SECURITY.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# 🍋 ezpz
> Write once, run anywhere.
`ezpz` makes distributed PyTorch launches portable across any supported
hardware {NVIDIA, AMD, Intel, MPS, CPU} with **zero code changes**.
## Features
- **Multi-hardware** — automatic device detection and backend selection
(CUDA/NCCL, XPU/CCL, MPS, CPU/Gloo)
- **Zero code changes** — same script runs on a laptop, a single GPU, or a
thousand-node supercomputer
- **HPC integration** — native PBS and SLURM support with automatic hostfile
discovery and rank assignment
- **Metric tracking** — built-in `History` class for recording, plotting, and
saving training metrics
- **CLI tools** — `ezpz launch`, `ezpz test`, `ezpz doctor` for launching
jobs, smoke-testing, and diagnostics
## Quick Install
```bash
uv pip install git+https://github.com/saforem2/ezpz
```
## Quick Start
```python
import torch
import ezpz
rank = ezpz.setup_torch() # auto-detects device + backend
device = ezpz.get_torch_device()
model = torch.nn.Linear(128, 10).to(device)
model = ezpz.wrap_model(model) # FSDP (default)
# Multi-dim parallelism (TP/PP/CP) on XPU? Use ezpz.init_device_mesh_safe
# instead of torch's init_device_mesh — works around xccl's missing
# split_group on Aurora/Sunspot. See https://ezpz.cool/troubleshooting/.
```
```bash
# Same command everywhere -- Mac laptop, NVIDIA cluster, Intel Aurora:
ezpz launch python3 train.py
```
For a side-by-side diff against the equivalent raw-torch boilerplate, see
the [API Cheat Sheet](https://ezpz.cool/quickstart/#api-cheat-sheet).
## CLI
```bash
ezpz launch python3 train.py # launch distributed training
ezpz submit -N 2 -q debug -- python3 train.py # submit batch job to PBS/SLURM
ezpz test # smoke-test your setup
ezpz benchmark # run + compare example benchmarks
ezpz doctor # diagnose environment issues
```
## Why ezpz?
Compared to the alternatives:
- **vs raw `torchrun` / `mpirun` / `srun`**: one launcher that detects your
scheduler, builds the right command, and works on a laptop too.
- **vs `accelerate`**: lower surface area, no config files, designed for
HPC schedulers from the ground up rather than retrofitted.
- **vs `DeepSpeed`**: not an alternative — `ezpz` wraps your distributed
init so you can still use DeepSpeed (or anything else) underneath.
See the [full comparison](https://ezpz.cool/compare/) for details.
## Documentation
Full documentation is available at [**ezpz.cool**](https://ezpz.cool).
Useful entry points:
- 🏃♂️ [Quickstart](https://ezpz.cool/quickstart/) — install → script → launch in 5 minutes
- 🎓 [Distributed Training Tutorial](https://ezpz.cool/guides/distributed-training/) — progressive hello-world → FSDP+TP
- 🍳 [Recipes](https://ezpz.cool/recipes/) — copy-pasteable patterns (checkpointing, gradient accumulation, MFU tracking)
- 🔧 [Troubleshooting](https://ezpz.cool/troubleshooting/) — XPU FSDP2 hangs, NCCL/CCL errors, scheduler issues
- 📝 [Examples](https://ezpz.cool/examples/) — runnable end-to-end (FSDP, ViT, Diffusion, HF Trainer)