{"id":21986930,"url":"https://github.com/saforem2/ezpz","last_synced_at":"2026-06-26T06:01:09.106Z","repository":{"id":194416398,"uuid":"690760093","full_name":"saforem2/ezpz","owner":"saforem2","description":"Write once, run anywhere; ezpz 🍋 ","archived":false,"fork":false,"pushed_at":"2026-06-24T12:19:39.000Z","size":16039,"stargazers_count":34,"open_issues_count":1,"forks_count":8,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-06-24T14:11:47.363Z","etag":null,"topics":["ai-tools","deepspeed","distributed-training","fsdp","launcher","machine-learning","mpi","mpi4py","parallelism","python","pytorch","rich","slurm","torch","training"],"latest_commit_sha":null,"homepage":"http://ezpz.cool/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saforem2.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2023-09-12T20:23:20.000Z","updated_at":"2026-06-24T12:19:43.000Z","dependencies_parsed_at":"2023-10-14T23:54:06.671Z","dependency_job_id":"deeda46d-a8bf-41d4-9b56-49e209b1380f","html_url":"https://github.com/saforem2/ezpz","commit_stats":null,"previous_names":["saforem2/ezpz"],"tags_count":51,"template":false,"template_full_name":null,"purl":"pkg:github/saforem2/ezpz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fezpz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fezpz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fezpz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fezpz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saforem2","download_url":"https://codeload.github.com/saforem2/ezpz/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saforem2%2Fezpz/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34805072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-26T02:00:06.560Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-tools","deepspeed","distributed-training","fsdp","launcher","machine-learning","mpi","mpi4py","parallelism","python","pytorch","rich","slurm","torch","training"],"created_at":"2024-11-29T18:22:52.913Z","updated_at":"2026-06-26T06:01:09.101Z","avatar_url":"https://github.com/saforem2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🍋 ezpz\n\n\u003e Write once, run anywhere.\n\n`ezpz` makes distributed PyTorch launches portable across any supported\nhardware {NVIDIA, AMD, Intel, MPS, CPU} with **zero code changes**.\n\n## Features\n\n- **Multi-hardware** — automatic device detection and backend selection\n  (CUDA/NCCL, XPU/CCL, MPS, CPU/Gloo)\n- **Zero code changes** — same script runs on a laptop, a single GPU, or a\n  thousand-node supercomputer\n- **HPC integration** — native PBS and SLURM support with automatic hostfile\n  discovery and rank assignment\n- **Metric tracking** — built-in `History` class for recording, plotting, and\n  saving training metrics\n- **CLI tools** — `ezpz launch`, `ezpz test`, `ezpz doctor` for launching\n  jobs, smoke-testing, and diagnostics\n\n## Quick Install\n\n```bash\nuv pip install git+https://github.com/saforem2/ezpz\n```\n\n## Quick Start\n\n```python\nimport torch\nimport ezpz\n\nrank = ezpz.setup_torch()           # auto-detects device + backend\ndevice = ezpz.get_torch_device()\nmodel = torch.nn.Linear(128, 10).to(device)\nmodel = ezpz.wrap_model(model)       # FSDP (default)\n\n# Multi-dim parallelism (TP/PP/CP) on XPU? Use ezpz.init_device_mesh_safe\n# instead of torch's init_device_mesh — works around xccl's missing\n# split_group on Aurora/Sunspot. See https://ezpz.cool/troubleshooting/.\n```\n\n```bash\n# Same command everywhere -- Mac laptop, NVIDIA cluster, Intel Aurora:\nezpz launch python3 train.py\n```\n\nFor a side-by-side diff against the equivalent raw-torch boilerplate, see\nthe [API Cheat Sheet](https://ezpz.cool/quickstart/#api-cheat-sheet).\n\n## CLI\n\n```bash\nezpz launch python3 train.py    # launch distributed training\nezpz submit -N 2 -q debug -- python3 train.py   # submit batch job to PBS/SLURM\nezpz test                       # smoke-test your setup\nezpz benchmark                  # run + compare example benchmarks\nezpz doctor                     # diagnose environment issues\n```\n\n## Why ezpz?\n\nCompared to the alternatives:\n\n- **vs raw `torchrun` / `mpirun` / `srun`**: one launcher that detects your\n  scheduler, builds the right command, and works on a laptop too.\n- **vs `accelerate`**: lower surface area, no config files, designed for\n  HPC schedulers from the ground up rather than retrofitted.\n- **vs `DeepSpeed`**: not an alternative — `ezpz` wraps your distributed\n  init so you can still use DeepSpeed (or anything else) underneath.\n\nSee the [full comparison](https://ezpz.cool/compare/) for details.\n\n## Documentation\n\nFull documentation is available at [**ezpz.cool**](https://ezpz.cool).\n\nUseful entry points:\n\n- 🏃‍♂️ [Quickstart](https://ezpz.cool/quickstart/) — install → script → launch in 5 minutes\n- 🎓 [Distributed Training Tutorial](https://ezpz.cool/guides/distributed-training/) — progressive hello-world → FSDP+TP\n- 🍳 [Recipes](https://ezpz.cool/recipes/) — copy-pasteable patterns (checkpointing, gradient accumulation, MFU tracking)\n- 🔧 [Troubleshooting](https://ezpz.cool/troubleshooting/) — XPU FSDP2 hangs, NCCL/CCL errors, scheduler issues\n- 📝 [Examples](https://ezpz.cool/examples/) — runnable end-to-end (FSDP, ViT, Diffusion, HF Trainer)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaforem2%2Fezpz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaforem2%2Fezpz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaforem2%2Fezpz/lists"}