https://github.com/hansbug/torchrun_demos
https://github.com/hansbug/torchrun_demos
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/hansbug/torchrun_demos
- Owner: HansBug
- Created: 2025-05-12T07:40:31.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-12T09:14:17.000Z (about 1 year ago)
- Last Synced: 2025-05-12T09:39:26.020Z (about 1 year ago)
- Language: Python
- Size: 12.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# torchrun_demos
Demos for torch DDP
Install with (python3.10+ recommended)
```shell
pip install -r requirements.txt
```
Run a standard elastic task (8 GPUs, 1 node, no crash ratio)
```shell
torchrun --nproc_per_node=8 ddp_demo_elastic.py --workdir runs/elastic_demo_1 --fuck-ratio 0.0
```
Run an elastic task with 1% ratio of crash in any processed at the end of epochs (8 GPUs, 1 node)
```shell
torchrun --nproc_per_node=8 --max-restarts=1000 ddp_demo_elastic.py --workdir runs/elastic_demo_1x
```
Run a standard elastic task (2 nodes, 2 GPUs/node, no crash ratio)
```shell
torchrun \
--nnodes=${MLP_WORKER_NUM} \
--node_rank=${MLP_ROLE_INDEX} \
--nproc_per_node=${MLP_WORKER_GPU} \
--master_addr=${MLP_WORKER_0_PRIMARY_HOST} \
--master_port=${MLP_WORKER_0_PORT} \
--max_restarts=1000 \
--rdzv_id=${MLP_TASK_ID} \
ddp_demo_elastic.py \
--workdir runs/elastic_demo_2nodes --fuck-ratio 0.0
```
Run an elastic task with 1% ratio of crash in any processed at the end of epochs (2 nodes, 2 GPUs/node, no crash ratio)
```shell
torchrun \
--nnodes=1:${MLP_WORKER_NUM} \
--node_rank=${MLP_ROLE_INDEX} \
--nproc_per_node=${MLP_WORKER_GPU} \
--master_addr=${MLP_WORKER_0_PRIMARY_HOST} \
--master_port=${MLP_WORKER_0_PORT} \
--max_restarts=1000 \
--rdzv_id=${MLP_TASK_ID} \
ddp_demo_elastic.py \
--workdir runs/elastic_demo_2nodes_x
```