{"id":13870051,"url":"https://github.com/asprenger/distributed-training-patterns","last_synced_at":"2025-07-15T20:31:03.909Z","repository":{"id":129601698,"uuid":"149883102","full_name":"asprenger/distributed-training-patterns","owner":"asprenger","description":"Experiments with low level communication patterns that are useful for distributed training.","archived":false,"fork":false,"pushed_at":"2018-11-14T18:56:34.000Z","size":9,"stargazers_count":5,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-06T21:22:58.948Z","etag":null,"topics":["distributed-training","horovod","mpi","mpi4py","nccl","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asprenger.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2018-09-22T14:12:48.000Z","updated_at":"2023-08-25T02:44:01.000Z","dependencies_parsed_at":"2023-03-20T18:03:58.984Z","dependency_job_id":null,"html_url":"https://github.com/asprenger/distributed-training-patterns","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asprenger%2Fdistributed-training-patterns","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asprenger%2Fdistributed-training-patterns/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asprenger%2Fdistributed-training-patterns/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asprenger%2Fdistributed-training-patterns/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asprenger","download_url":"https://codeload.github.com/asprenger/distributed-training-patterns/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226068160,"owners_count":17568706,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distributed-training","horovod","mpi","mpi4py","nccl","tensorflow"],"created_at":"2024-08-05T20:01:26.911Z","updated_at":"2024-11-23T16:31:00.765Z","avatar_url":"https://github.com/asprenger.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Installations\n\n## Install OpenMPI from source\n\nInstall OpenMPI to `/usr/local`:\n\n\twget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz\n\ttar xzf openmpi-3.1.2.tar.gz\n\tcd openmpi-3.1.2\n\tmake all\n\tsudo make install\n\nExecuting `mpirun` requires setting `LD_LIBRARY_PATH`:\n\n\texport LD_LIBRARY_PATH=/usr/local/lib\n\n## Install mpi4py\n\nMPI for Python provides MPI bindings for Python. Check out the docs: [MPI for Python](https://mpi4py.readthedocs.io/en/stable/).\n\nInstall module:\n\n\tpip install mpi4py\n\n## Install NCCL\n\n    tar -xvf  nccl_2.2.12-1+cuda9.0_x86_64.txz\n    sudo mkdir /usr/local/nccl-2.2.12\n    sudo cp -r nccl_2.2.12-1+cuda9.0_x86_64/* /usr/local/nccl-2.2.12\n\nCreate a file `/etc/ld.so.conf.d/nccl.conf` with content:\n\n    /usr/local/nccl-2.2.12/lib\n\nRun `ldconfig` to update `LD_LIBRARY_PATH`:\n\n    sudo ldconfig \n\nCreate symbolic link for NCCL header file:\n\n    sudo ln -s /usr/local/nccl-2.2.12/include/nccl.h /usr/include/nccl.h\n\n## Install Horovod\n\nHere is the link to the [Horovod docs](https://github.com/uber/horovod/tree/master/docs)\n\nFor installation on machines with GPUs read this: [Horovod GPU page](https://github.com/uber/horovod/blob/master/docs/gpus.md)\n\nInstall Horovod with NCCL support:\n\n    HOROVOD_NCCL_HOME=/usr/local/nccl-2.2.12 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod\n\n\n# OpenMPI\n\nThere are a number of things to consider when running Python MPI applications on multiple machines:\n * When using Anaconda, the Anaconda `./bin` directory must be in the `PATH` before the system Python executables\n * The MPI `./bin` directory must be in the `PATH`\n * The MPI `./lib` directory must be in the `LD_LIBRARY_PATH`\n * Your Python code must exist on all machines in the same location\n\nMPI uses a non-interactive shell for launching processes on remote machines. The best way to setup `PATH`\nand `LD_LIBRARY_PATH` variables is to add them to `/etc/environment`.\n\nMPI uses ssh to launch processes on remote machines. Install certificates so that ssh works without passwords\nbetween machines that serve MPI processes.\n\nThe MPI processes on different machines communicate over TCP connections. Make sure there are no firewalls\nblocking the communication.\n\nAlso you should disable SSH host key checking by creating a file `~/.ssh/config` with content\n\n\tHost *\n\t   StrictHostKeyChecking no\n\t   UserKnownHostsFile=/dev/null\n\nRunning `env` from one MPI machine on another MPI machine is a good test to check all the points:\n\n\tssh ${REMOTE_SERVER_HOST} env\n\n## Byte Transfer Layer (BTL)\n\n * `vader` BTL is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. \n   This BTL can only be used between processes executing on the same node.\n * `sm` BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth mechanism for transferring data between \n   two processes via shared memory. This BTL can only be used between processes executing on the same node.\n * `tcp` BTL direct Open MPI to use TCP-based communications over IP interfaces / networks.\n\n\n## MPI Examples\n\nCreate a MPI hostfile `mpi_hosts` that specifies network addresses and number of slots:\n\n\t${HOSTNAME1} slots=${NB_SLOTS}\n\t${HOSTNAME2} slots=${NB_SLOTS}\n\t...\n\n### Point to point\n\nSend data from one process to another.\n\n\tmpirun -np 2 --hostfile mpi_hosts --mca btl self,tcp python mpi_point_to_point.py\n\n### Broadcasting\n\nBroadcasting takes a variable and sends an exact copy of it to all processes.\n\n\tmpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_broadcast.py\n\t\u003e Rank:  0 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]\n\t\u003e Rank:  1 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]\n\t\u003e Rank:  2 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]\n\t\u003e Rank:  3 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]\n \t\n### Scattering\n\nScatter takes an array and distributes contiguous sections of it to different processes. \n\n\tmpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_scatter.py\n\t\u003e Rank:  0 , recvbuf received:  [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]\n\t\u003e Rank:  1 , recvbuf received:  [11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]\n\t\u003e Rank:  2 , recvbuf received:  [21. 22. 23. 24. 25. 26. 27. 28. 29. 30.]\n\t\u003e Rank:  3 , recvbuf received:  [31. 32. 33. 34. 35. 36. 37. 38. 39. 40.]\n\n\n### Gathering\n\nThe reverse of a scatter is a gather, which takes subsets of an array that are distributed across the processes, \nand gathers them back into the full array.\n\n\tmpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_gather.py\n\t\u003e Rank:  0 , sendbuf:  [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]\n\t\u003e Rank:  1 , sendbuf:  [11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]\n\t\u003e Rank:  2 , sendbuf:  [21. 22. 23. 24. 25. 26. 27. 28. 29. 30.]\n\t\u003e Rank:  3 , sendbuf:  [31. 32. 33. 34. 35. 36. 37. 38. 39. 40.]\n\t\u003e Rank:  0 , recvbuf received:  [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40.]\n\n\n### Reduce\n\nThe reduce operation takes values in from an array on each process and reduces them to a single result on the root process.\n\n\tmpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_reduce.py\n\t\u003e Rank:  0  value =  0.0\n\t\u003e Rank:  1  value =  1.0\n\t\u003e Rank:  2  value =  2.0\n\t\u003e Rank:  3  value =  3.0\n\t\u003e Rank 0: value_sum = 6.0\n\t\u003e Rank 0: value_max = 3.0\n\n### Allreduce\n\nThe allreduce operation takes values in from an array on each process, reduces them to a single result and sends the result \nto each process. Note that the communication pattern is much more complex compared to the reduce operation.\n\n\tmpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_allreduce.py\n\t\u003e Rank  0 value= 0.0\n\t\u003e Rank  1 value= 1.0\n\t\u003e Rank  2 value= 2.0\n\t\u003e Rank  3 value= 3.0\n\t\u003e Rank 0 value_sum= 6.0\n\t\u003e Rank 0 value_max= 3.0\n\t\u003e Rank 1 value_sum= 6.0\n\t\u003e Rank 2 value_sum= 6.0\n\t\u003e Rank 2 value_max= 3.0\n\t\u003e Rank 3 value_sum= 6.0\n\t\u003e Rank 3 value_max= 3.0\n\t\u003e Rank 1 value_max= 3.0\n\n\n# Horovod primitive examples\n\n## Horovod allreduce operation\n\n    mpirun -np 2 \\\n        --hostfile mpi_hosts \\\n        -bind-to none -map-by slot \\\n        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \\\n        -mca pml ob1 --mca btl self,tcp \\\n        python -u hvd_allreduce.py\n\n\n## Horovod broadcast_global_variables operation\n\n    mpirun -np 2 \\\n        --hostfile mpi_hosts \\\n        -bind-to none -map-by slot \\\n        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \\\n        -mca pml ob1 --mca btl self,tcp \\\n        python -u hvd_broadcast.py\n\n\n## Horovod allgather operation\n\n    mpirun -np 2 \\\n        --hostfile mpi_hosts \\\n        -bind-to none -map-by slot \\\n        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \\\n        -mca pml ob1 --mca btl self,tcp \\\n        python -u hvd_allgather.py\n\n\n# Horovod Tensorflow example tensorflow_mnist.py\n\nTo run on a machine with 8 GPUs:\n\n    time mpirun -np 2 \\\n        --hostfile mpi_hosts \\\n        -bind-to none -map-by slot \\\n        -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \\\n        -mca pml ob1 --mca btl self,tcp \\\n        python tensorflow_mnist.py\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasprenger%2Fdistributed-training-patterns","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasprenger%2Fdistributed-training-patterns","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasprenger%2Fdistributed-training-patterns/lists"}