{"id":13590542,"url":"https://github.com/NVIDIA/multi-gpu-programming-models","last_synced_at":"2025-04-08T13:31:23.836Z","repository":{"id":37382188,"uuid":"133061133","full_name":"NVIDIA/multi-gpu-programming-models","owner":"NVIDIA","description":"Examples demonstrating available options to program multiple GPUs in a single node or a cluster","archived":false,"fork":false,"pushed_at":"2025-02-21T14:32:35.000Z","size":285,"stargazers_count":674,"open_issues_count":0,"forks_count":119,"subscribers_count":31,"default_branch":"master","last_synced_at":"2025-04-07T04:05:18.766Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-11T16:07:11.000Z","updated_at":"2025-04-06T11:04:10.000Z","dependencies_parsed_at":"2024-08-14T15:51:52.876Z","dependency_job_id":"28a01c61-4c50-432c-b859-763f8ed5a9cc","html_url":"https://github.com/NVIDIA/multi-gpu-programming-models","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fmulti-gpu-programming-models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fmulti-gpu-programming-models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fmulti-gpu-programming-models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fmulti-gpu-programming-models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/multi-gpu-programming-models/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247851620,"owners_count":21006789,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T16:00:47.435Z","updated_at":"2025-04-08T13:31:23.829Z","avatar_url":"https://github.com/NVIDIA.png","language":"Cuda","readme":"# Multi GPU Programming Models\nThis project implements the well known multi GPU Jacobi solver with different multi GPU Programming Models:\n* `single_threaded_copy`        Single Threaded using cudaMemcpy for inter GPU communication\n* `multi_threaded_copy`         Multi Threaded with OpenMP using cudaMemcpy for inter GPU communication\n* `multi_threaded_copy_overlap` Multi Threaded with OpenMP using cudaMemcpy for inter GPU communication with overlapping communication\n* `multi_threaded_p2p`          Multi Threaded with OpenMP using GPUDirect P2P mappings for inter GPU communication\n* `multi_threaded_p2p_opt`      Multi Threaded with OpenMP using GPUDirect P2P mappings for inter GPU communication with delayed norm execution\n* `multi_threaded_um`           Multi Threaded with OpenMP relying on transparent peer mappings with Unified Memory for inter GPU communication\n* `mpi`                         Multi Process with MPI using CUDA-aware MPI for inter GPU communication\n* `mpi_overlap`                 Multi Process with MPI using CUDA-aware MPI for inter GPU communication with overlapping communication\n* `nccl`                        Multi Process with MPI and NCCL using NCCL for inter GPU communication\n* `nccl_overlap`                Multi Process with MPI and NCCL using NCCL for inter GPU communication with overlapping communication\n* `nccl_graphs`                 Multi Process with MPI and NCCL using NCCL for inter GPU communication with overlapping communication combined with CUDA Graphs\n* `nvshmem`                     Multi Process with MPI and NVSHMEM using NVSHMEM for inter GPU communication.\n* `multi_node_p2p`              Multi Process Multi Node variant using the low level CUDA Driver [Virtual Memory Management](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#virtual-memory-management) and [Multicast Object Management](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MULTICAST.html#group__CUDA__MULTICAST) APIs. This example is for developers of libraries like NCCL or NVSHMEM. It shows how higher-level programming models like NVSHMEM work internally within a (multinode) NVLINK domain. Application developers generally should use the higher-level MPI, NCCL, or NVSHMEM interfaces instead of this API.\n\nEach variant is a stand alone `Makefile` project and most variants have been discussed in various GTC Talks, e.g.:\n* `single_threaded_copy`, `multi_threaded_copy`, `multi_threaded_copy_overlap`, `multi_threaded_p2p`, `multi_threaded_p2p_opt`, `mpi`, `mpi_overlap` and `nvshmem` on DGX-1V at GTC Europe 2017 in 23031 - Multi GPU Programming Models\n* `single_threaded_copy`, `multi_threaded_copy`, `multi_threaded_copy_overlap`, `multi_threaded_p2p`, `multi_threaded_p2p_opt`, `mpi`, `mpi_overlap` and `nvshmem` on DGX-2 at GTC 2019 in S9139 - Multi GPU Programming Models\n* `multi_threaded_copy`, `multi_threaded_copy_overlap`, `multi_threaded_p2p`, `multi_threaded_p2p_opt`, `mpi`, `mpi_overlap`, `nccl`, `nccl_overlap` and `nvshmem`  on DGX A100 at GTC 2021 in [A31140 - Multi-GPU Programming Models](https://www.nvidia.com/en-us/on-demand/session/gtcfall21-a31140/)\n\nSome examples in this repository are the basis for an interactive tutorial: [FZJ-JSC/tutorial-multi-gpu](https://github.com/FZJ-JSC/tutorial-multi-gpu). \n\n# Requirements\n* CUDA: version 11.0 (9.2 if build with `DISABLE_CUB=1`) or later is required by all variants.\n  * `nccl_graphs` requires NCCL 2.15.1, CUDA 11.7 and CUDA Driver 515.65.01 or newer\n  * `multi_node_p2p` requires CUDA 12.4, a CUDA Driver 550.54.14 or newer and the NVIDIA IMEX daemon running.\n* OpenMP capable compiler: Required by the Multi Threaded variants. The examples have been developed and tested with gcc.\n* MPI: The `mpi` and `mpi_overlap` variants require a CUDA-aware[^1] implementation. For NVSHMEM, NCCL and `multi_node_p2p`, a non CUDA-aware MPI is sufficient. The examples have been developed and tested with OpenMPI.\n* NVSHMEM (version 0.4.1 or later): Required by the NVSHMEM variant.\n* NCCL (version 2.8 or later): Required by the NCCL variant\n\n# Building \nEach variant comes with a `Makefile` and can be built by simply issuing `make`, e.g. \n```sh\nmulti-gpu-programming-models$ cd multi_threaded_copy\nmulti_threaded_copy$ make\nnvcc -DHAVE_CUB -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -ldl -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 jacobi.cu -o jacobi\nmulti_threaded_copy$ ls jacobi\njacobi\n```\n\n# Run instructions\nAll variants have the following command line options\n* `-niter`: How many iterations to carry out (default 1000)\n* `-nccheck`: How often to check for convergence (default 1)\n* `-nx`: Size of the domain in x direction (default 16384)\n* `-ny`: Size of the domain in y direction (default 16384)\n* `-csv`: Print performance results as -csv\n* `-use_hp_streams`: In `mpi_overlap` use high priority streams to hide kernel launch latencies of boundary kernels.\n\nThe `nvshmem` variant additionally provides\n* `-use_block_comm`: Use block cooperative `nvshmemx_float_put_nbi_block` instead of `nvshmem_float_p` for communication.\n* `-norm_overlap`: Enable delayed norm execution as also implemented in `multi_threaded_p2p_opt` \n* `-neighborhood_sync`: Use custom neighbor only sync instead of `nvshmemx_barrier_all_on_stream`\n\nThe `multi_node_p2p` variant additionally provides\n* `-use_mc_red`: Use a device side barrier and allreduce leveraging Multicast Objects instead of MPI primitives\n\nThe `nccl` variants additionally provide\n* `-user_buffer_reg`: Avoid extra internal copies in NCCL communication with [User Buffer Registration](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/bufferreg.html#user-buffer-registration). Required NCCL APIs are available with NCCL 2.19.1 or later. NCCL 2.23.4 added support for the used communication pattern.\n\nThe provided script `bench.sh` contains some examples executing all the benchmarks presented in the GTC Talks referenced above.\n\n# Developers guide\nThe code applies the style guide implemented in [`.clang-format`](.clang-format) file. [`clang-format`](https://clang.llvm.org/docs/ClangFormat.html) version 7 or later should be used to format the code prior to submitting it. E.g. with\n```sh\nmulti-gpu-programming-models$ cd multi_threaded_copy\nmulti_threaded_copy$ clang-format -style=file -i jacobi.cu\n```\n\n[^1]: A check for CUDA-aware support is done at compile and run time (see [the OpenMPI FAQ](https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-aware-support) for details). If your CUDA-aware MPI implementation does not support this check, which requires `MPIX_CUDA_AWARE_SUPPORT` and `MPIX_Query_cuda_support()` to be defined in `mpi-ext.h`, it can be skipped by setting `SKIP_CUDA_AWARENESS_CHECK=1`.\n","funding_links":[],"categories":["Cuda","Frameworks","Graphics"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fmulti-gpu-programming-models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVIDIA%2Fmulti-gpu-programming-models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fmulti-gpu-programming-models/lists"}