{"id":20796883,"url":"https://github.com/hkproj/pytorch-transformer-distributed","last_synced_at":"2025-05-06T09:18:36.109Z","repository":{"id":211367323,"uuid":"728909209","full_name":"hkproj/pytorch-transformer-distributed","owner":"hkproj","description":"Distributed training (multi-node) of a Transformer model","archived":false,"fork":false,"pushed_at":"2024-04-10T16:56:21.000Z","size":4231,"stargazers_count":66,"open_issues_count":0,"forks_count":29,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-06T09:18:18.953Z","etag":null,"topics":["collective-communication","data-parallelism","deep-learning","distributed-data-parallel","distributed-training","gradient-accumulation","machine-learning","model-parallelism","pytorch","tutorial"],"latest_commit_sha":null,"homepage":"https://www.youtube.com/watch?v=toUSzwR0EV8","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hkproj.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-08T00:52:38.000Z","updated_at":"2025-05-04T16:16:23.000Z","dependencies_parsed_at":"2023-12-19T05:07:33.533Z","dependency_job_id":"14f1436f-d388-4182-b05f-1cf4f6ef71ce","html_url":"https://github.com/hkproj/pytorch-transformer-distributed","commit_stats":null,"previous_names":["hkproj/pytorch-transformer-distributed"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkproj%2Fpytorch-transformer-distributed","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkproj%2Fpytorch-transformer-distributed/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkproj%2Fpytorch-transformer-distributed/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkproj%2Fpytorch-transformer-distributed/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hkproj","download_url":"https://codeload.github.com/hkproj/pytorch-transformer-distributed/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252655006,"owners_count":21783374,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["collective-communication","data-parallelism","deep-learning","distributed-data-parallel","distributed-training","gradient-accumulation","machine-learning","model-parallelism","pytorch","tutorial"],"created_at":"2024-11-17T16:29:20.375Z","updated_at":"2025-05-06T09:18:36.084Z","avatar_url":"https://github.com/hkproj.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pytorch-transformer-distributed\n\nDistributed training of an attention model. Forked from: [hkproj/pytorch-transformer](https://github.com/hkproj/pytorch-transformer)\n\n## Instructions for Paperspace\n\n### Machines\n\nMake sure to create everything in the same region. I used `East Coast (NY2)`.\n\n1. Create 1x Private network. Assign both computers to the private network when creating the machines.\n2. Create 2x nodes of `P4000x2` (multi-GPU) with `ML-in-a-Box` as operating system\n3. Create 1 Network drive (250 GB)\n\n### Setup\n\nLogin on each machine and perform the following operations:\n\n1. `sudo apt-get update`\n2. `sudo apt-get install net-tools`\n3. If you get an error about `seahorse` while installing `net-tools`, do the following:\n   1. sudo rm /var/lib/dpkg/info/seahorse.list\n   2. sudo apt-get install seahorse --reinstall\n4. Get each machine's private IP address using `ifconfig`\n5. Add IP and hostname mapping of all the slave nodes on `/etc/hosts` file of the master node\n6. Mount the network drive\n   1. `sudo apt-get install smbclient`\n   2. `sudo apt-get install cifs-utils`\n   3. `sudo mkdir /mnt/training-data`\n   4. Replace the following values on the command below:\n      1. `NETWORD_DRIVE_IP` with the IP address of the network drive\n      2. `NETWORK_SHARE_NAME` with the name of the network share\n      3. `DRIVE_USERNAME` with the username of the network drive\n   5. `sudo mount -t cifs //NETWORD_DRIVE_IP/NETWORK_SHARE_NAME /mnt/training-data -o uid=1000,gid=1000,rw,user,username=NETWORK_DRIVE_USERNAME`\n      1. Type the drive's password when prompted\n7. `git clone https://github.com/hkproj/pytorch-transformer-distributed`\n8. `cd pytorch-transformer-distributed`\n9. `pip install -r requirements.txt`\n10. Login on Weights \u0026 Biases\n    1. `wandb login`\n    2. Copy the API key from the browser and paste it on the terminal\n11. Run the training command from below\n\n### Local training\n\nRun the following command on any machine. Make sure to not run it on both, otherwise they will end up overwriting each other's checkpoints.\n\n`torchrun --nproc_per_node=2 --nnodes=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:48123 train.py --batch_size 8 --model_folder \"/mnt/training-data/weights\"`\n\n### Distributed training\n\nRun the following command on each machine (replace `IP_ADDR_MASTER_NODE` with the IP address of the master node):\n\n`torchrun --nproc_per_node=2 --nnodes=2 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=IP_ADDR_MASTER_NODE:48123 train.py --batch_size 8 --model_folder \"/mnt/training-data/weights\"`\n\n### Monitoring\n\nLogin to Weights \u0026 Biases to monitor the training progress: https://app.wandb.ai/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkproj%2Fpytorch-transformer-distributed","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhkproj%2Fpytorch-transformer-distributed","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkproj%2Fpytorch-transformer-distributed/lists"}