{"id":13645209,"url":"https://github.com/NVIDIA/gdrcopy","last_synced_at":"2025-04-21T13:32:14.854Z","repository":{"id":24336717,"uuid":"27734219","full_name":"NVIDIA/gdrcopy","owner":"NVIDIA","description":"A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology","archived":false,"fork":false,"pushed_at":"2024-10-21T03:04:17.000Z","size":695,"stargazers_count":875,"open_issues_count":54,"forks_count":144,"subscribers_count":55,"default_branch":"master","last_synced_at":"2024-10-29T18:09:47.614Z","etag":null,"topics":["gpu-memory","gpudirect-rdma","kernel-mode-driver","libraries","linux","nvidia"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2014-12-08T20:46:42.000Z","updated_at":"2024-10-28T19:03:25.000Z","dependencies_parsed_at":"2024-01-14T09:57:21.208Z","dependency_job_id":"8a3bddeb-6f7a-41e6-9a9b-de21d8748aff","html_url":"https://github.com/NVIDIA/gdrcopy","commit_stats":{"total_commits":467,"total_committers":18,"mean_commits":"25.944444444444443","dds":0.5460385438972163,"last_synced_commit":"77d804a07afc98601320233efbc835ca5ac847a4"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fgdrcopy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fgdrcopy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fgdrcopy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fgdrcopy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/gdrcopy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223867888,"owners_count":17216973,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpu-memory","gpudirect-rdma","kernel-mode-driver","libraries","linux","nvidia"],"created_at":"2024-08-02T01:02:31.256Z","updated_at":"2024-11-09T18:30:32.039Z","avatar_url":"https://github.com/NVIDIA.png","language":"C++","readme":"# GDRCopy\n\nA low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA\ntechnology.\n\n\n## Introduction\n\nWhile GPUDirect RDMA is meant for direct access to GPU memory from\nthird-party devices, it is possible to use these same APIs to create\nperfectly valid CPU mappings of the GPU memory.\n\nThe advantage of a CPU driven copy is the very small overhead\ninvolved. That might be useful when low latencies are required.\n\n\n## What is inside\n\nGDRCopy offers the infrastructure to create user-space mappings of GPU memory,\nwhich can then be manipulated as if it was plain host memory (caveats apply\nhere).\n\nA simple by-product of it is a copy library with the following characteristics:\n- very low overhead, as it is driven by the CPU. As a reference, currently a \n  cudaMemcpy can incur in a 6-7us overhead.\n\n- An initial memory *pinning* phase is required, which is potentially expensive,\n  10us-1ms depending on the buffer size.\n\n- Fast H-D, because of write-combining. H-D bandwidth is 6-8GB/s on Ivy\n  Bridge Xeon but it is subject to NUMA effects.\n\n- Slow D-H, because the GPU BAR, which backs the mappings, can't be\n  prefetched and so burst reads transactions are not generated through\n  PCIE\n\nThe library comes with a few tests like:\n- gdrcopy_sanity, which contains unit tests for the library and the driver.\n- gdrcopy_copybw, a minimal application which calculates the R/W bandwidth for a specific buffer size.\n- gdrcopy_copylat, a benchmark application which calculates the R/W copy latency for a range of buffer sizes.\n- gdrcopy_apiperf, an application for benchmarking the latency of each GDRCopy API call.\n- gdrcopy_pplat, a benchmark application which calculates the round-trip ping-pong latency between GPU and CPU.\n\n## Requirements\n\nGPUDirect RDMA requires [NVIDIA Data Center GPU](https://www.nvidia.com/en-us/data-center/) or [NVIDIA RTX GPU](https://www.nvidia.com/en-us/design-visualization/rtx/) (formerly Tesla and Quadro) based on Kepler or newer generations, see [GPUDirect\nRDMA](http://developer.nvidia.com/gpudirect).  For more general information,\nplease refer to the official GPUDirect RDMA [design\ndocument](http://docs.nvidia.com/cuda/gpudirect-rdma).\n\nThe device driver requires GPU display driver \u003e= 418.40 on ppc64le and \u003e= 331.14 on other platforms. The library and tests\nrequire CUDA \u003e= 6.0.\n\nDKMS is a prerequisite for installing GDRCopy kernel module package. On RHEL\nor SLE,\nhowever, users have an option to build kmod and install it instead of the DKMS\npackage. See [Build and installation](#build-and-installation) section for more details.\n\n```shell\n# On RHEL\n# dkms can be installed from epel-release. See https://fedoraproject.org/wiki/EPEL.\n$ sudo yum install dkms\n\n# On Debian - No additional dependency\n\n# On SLE / Leap\n# On SLE dkms can be installed from PackageHub.\n$ sudo zypper install dkms rpmbuild\n```\n\nCUDA and GPU display driver must be installed before building and/or installing GDRCopy.\nThe installation instructions can be found in https://developer.nvidia.com/cuda-downloads.\n\nGPU display driver header files are also required. They are installed as a part\nof the driver (or CUDA) installation with  *runfile*. If you install the driver\nvia package management, we suggest\n- On RHEL, `sudo dnf module install nvidia-driver:latest-dkms`.\n- On Debian, `sudo apt install nvidia-dkms-\u003cyour-nvidia-driver-version\u003e`.\n- On SLE, `sudo zypper install nvidia-gfx\u003cyour-nvidia-driver-version\u003e-kmp`.\n\nThe supported architectures are Linux x86\\_64, ppc64le, and arm64. The supported\nplatforms are RHEL8, RHEL9, Ubuntu20\\_04, Ubuntu22\\_04,\nSLE-15 (any SP) and Leap 15.x.\n\nRoot privileges are necessary to load/install the kernel-mode device\ndriver.\n\n\n## Build and installation\n\nWe provide three ways for building and installing GDRCopy.\n\n### rpm package\n\n```shell\n# For RHEL:\n$ sudo yum groupinstall 'Development Tools'\n$ sudo yum install dkms rpm-build make\n\n# For SLE:\n$ sudo zypper in dkms rpmbuild\n\n$ cd packages\n$ CUDA=\u003ccuda-install-top-dir\u003e ./build-rpm-packages.sh\n$ sudo rpm -Uvh gdrcopy-kmod-\u003cversion\u003edkms.noarch.\u003cplatform\u003e.rpm\n$ sudo rpm -Uvh gdrcopy-\u003cversion\u003e.\u003carch\u003e.\u003cplatform\u003e.rpm\n$ sudo rpm -Uvh gdrcopy-devel-\u003cversion\u003e.noarch.\u003cplatform\u003e.rpm\n```\nDKMS package is the default kernel module package that `build-rpm-packages.sh`\ngenerates. To create kmod package, `-m` option must be passed to the script.\nUnlike the DKMS package, the kmod package contains a prebuilt GDRCopy kernel\nmodule which is specific to the NVIDIA driver version and the Linux kernel\nversion used to build it.\n\n\n### deb package\n\n```shell\n$ sudo apt install build-essential devscripts debhelper fakeroot pkg-config dkms\n$ cd packages\n$ CUDA=\u003ccuda-install-top-dir\u003e ./build-deb-packages.sh\n$ sudo dpkg -i gdrdrv-dkms_\u003cversion\u003e_\u003carch\u003e.\u003cplatform\u003e.deb\n$ sudo dpkg -i libgdrapi_\u003cversion\u003e_\u003carch\u003e.\u003cplatform\u003e.deb\n$ sudo dpkg -i gdrcopy-tests_\u003cversion\u003e_\u003carch\u003e.\u003cplatform\u003e.deb\n$ sudo dpkg -i gdrcopy_\u003cversion\u003e_\u003carch\u003e.\u003cplatform\u003e.deb\n```\n\n### from source\n\n```shell\n$ make prefix=\u003cinstall-to-this-location\u003e CUDA=\u003ccuda-install-top-dir\u003e all install\n$ sudo ./insmod.sh\n```\n\n### Notes\n\nCompiling the gdrdrv driver requires the NVIDIA driver source code, which is typically installed at\n`/usr/src/nvidia-\u003cversion\u003e`. Our make file automatically detects and picks that source code. In case there are multiple\nversions installed, it is possible to pass the correct path by defining the NVIDIA_SRC_DIR variable, e.g. `export\nNVIDIA_SRC_DIR=/usr/src/nvidia-520.61.05/nvidia` before building the gdrdrv module.\n\nThere are two major flavors of NVIDIA driver: 1) proprietary, and 2)\n[opensource](https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/). We detect the flavor\nwhen compiling gdrdrv based on the source code of the NVIDIA driver. Different flavors come with different features and\nrestrictions:\n- gdrdrv compiled with the opensource flavor will provide functionality and high performance on all platforms. However,\n  you will not be able to load this gdrdrv driver when the proprietary NVIDIA driver is loaded.\n- gdrdrv compiled with the proprietary flavor can always be loaded regardless of the flavor of NVIDIA driver you have\n  loaded. However, it may have suboptimal performance on coherent platforms such as Grace-Hopper. Functionally, it will not\n  work correctly on Intel CPUs with Linux kernel built with confidential compute (CC) support, i.e.\n  `CONFIG_ARCH_HAS_CC_PLATFORM=y`, *WHEN* CC is enabled at runtime.\n\n\n## Tests\n\nExecute provided tests:\n```shell\n$ gdrcopy_sanity \nTotal: 28, Passed: 28, Failed: 0, Waived: 0\n\nList of passed tests:\n    basic_child_thread_pins_buffer_cumemalloc\n    basic_child_thread_pins_buffer_vmmalloc\n    basic_cumemalloc\n    basic_small_buffers_mapping\n    basic_unaligned_mapping\n    basic_vmmalloc\n    basic_with_tokens\n    data_validation_cumemalloc\n    data_validation_vmmalloc\n    invalidation_access_after_free_cumemalloc\n    invalidation_access_after_free_vmmalloc\n    invalidation_access_after_gdr_close_cumemalloc\n    invalidation_access_after_gdr_close_vmmalloc\n    invalidation_fork_access_after_free_cumemalloc\n    invalidation_fork_access_after_free_vmmalloc\n    invalidation_fork_after_gdr_map_cumemalloc\n    invalidation_fork_after_gdr_map_vmmalloc\n    invalidation_fork_child_gdr_map_parent_cumemalloc\n    invalidation_fork_child_gdr_map_parent_vmmalloc\n    invalidation_fork_child_gdr_pin_parent_with_tokens\n    invalidation_fork_map_and_free_cumemalloc\n    invalidation_fork_map_and_free_vmmalloc\n    invalidation_two_mappings_cumemalloc\n    invalidation_two_mappings_vmmalloc\n    invalidation_unix_sock_shared_fd_gdr_map_cumemalloc\n    invalidation_unix_sock_shared_fd_gdr_map_vmmalloc\n    invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc\n    invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc\n\n\n$ gdrcopy_copybw\nGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00\nGPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00\nGPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00\nGPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00\nGPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00\nGPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00\nGPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00\nGPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00\nselecting device 0\ntesting size: 131072\nrounded size: 131072\ngpu alloc fn: cuMemAlloc\ndevice ptr: 7f1153a00000\nmap_d_ptr: 0x7f1172257000\ninfo.va: 7f1153a00000\ninfo.mapped_size: 131072\ninfo.page_size: 65536\ninfo.mapped: 1\ninfo.wc_mapping: 1\npage offset: 0\nuser-space pointer:0x7f1172257000\nwriting test, size=131072 offset=0 num_iters=10000\nwrite BW: 9638.54MB/s\nreading test, size=131072 offset=0 num_iters=100\nread BW: 530.135MB/s\nunmapping buffer\nunpinning buffer\nclosing gdrdrv\n\n\n$ gdrcopy_copylat\nGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00\nGPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00\nGPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00\nGPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00\nGPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00\nGPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00\nGPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00\nGPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00\nselecting device 0\ndevice ptr: 0x7fa2c6000000\nallocated size: 16777216\ngpu alloc fn: cuMemAlloc\n\nmap_d_ptr: 0x7fa2f9af9000\ninfo.va: 7fa2c6000000\ninfo.mapped_size: 16777216\ninfo.page_size: 65536\ninfo.mapped: 1\ninfo.wc_mapping: 1\npage offset: 0\nuser-space pointer: 0x7fa2f9af9000\n\ngdr_copy_to_mapping num iters for each size: 10000\nWARNING: Measuring the API invocation overhead as observed by the CPU. Data\nmight not be ordered all the way to the GPU internal visibility.\nTest             Size(B)     Avg.Time(us)\ngdr_copy_to_mapping             1         0.0889\ngdr_copy_to_mapping             2         0.0884\ngdr_copy_to_mapping             4         0.0884\ngdr_copy_to_mapping             8         0.0884\ngdr_copy_to_mapping            16         0.0905\ngdr_copy_to_mapping            32         0.0902\ngdr_copy_to_mapping            64         0.0902\ngdr_copy_to_mapping           128         0.0952\ngdr_copy_to_mapping           256         0.0983\ngdr_copy_to_mapping           512         0.1176\ngdr_copy_to_mapping          1024         0.1825\ngdr_copy_to_mapping          2048         0.2549\ngdr_copy_to_mapping          4096         0.4366\ngdr_copy_to_mapping          8192         0.8141\ngdr_copy_to_mapping         16384         1.6155\ngdr_copy_to_mapping         32768         3.2284\ngdr_copy_to_mapping         65536         6.4906\ngdr_copy_to_mapping        131072        12.9761\ngdr_copy_to_mapping        262144        25.9459\ngdr_copy_to_mapping        524288        51.9100\ngdr_copy_to_mapping       1048576       103.8028\ngdr_copy_to_mapping       2097152       207.5990\ngdr_copy_to_mapping       4194304       415.2856\ngdr_copy_to_mapping       8388608       830.6355\ngdr_copy_to_mapping      16777216      1661.3285\n\ngdr_copy_from_mapping num iters for each size: 100\nTest             Size(B)     Avg.Time(us)\ngdr_copy_from_mapping           1         0.9069\ngdr_copy_from_mapping           2         1.7170\ngdr_copy_from_mapping           4         1.7169\ngdr_copy_from_mapping           8         1.7164\ngdr_copy_from_mapping          16         0.8601\ngdr_copy_from_mapping          32         1.7024\ngdr_copy_from_mapping          64         3.1016\ngdr_copy_from_mapping         128         3.4944\ngdr_copy_from_mapping         256         3.6400\ngdr_copy_from_mapping         512         2.4394\ngdr_copy_from_mapping        1024         2.8022\ngdr_copy_from_mapping        2048         4.6615\ngdr_copy_from_mapping        4096         7.9783\ngdr_copy_from_mapping        8192        14.9209\ngdr_copy_from_mapping       16384        28.9571\ngdr_copy_from_mapping       32768        56.9373\ngdr_copy_from_mapping       65536       114.1008\ngdr_copy_from_mapping      131072       234.9382\ngdr_copy_from_mapping      262144       496.4011\ngdr_copy_from_mapping      524288       985.5196\ngdr_copy_from_mapping     1048576      1970.7057\ngdr_copy_from_mapping     2097152      3942.5611\ngdr_copy_from_mapping     4194304      7888.9468\ngdr_copy_from_mapping     8388608     18361.5673\ngdr_copy_from_mapping    16777216     36758.8342\nunmapping buffer\nunpinning buffer\nclosing gdrdrv\n\n\n$ gdrcopy_apiperf -s 8\nGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00\nGPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00\nGPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00\nGPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00\nGPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00\nGPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00\nGPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00\nGPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00\nselecting device 0\ndevice ptr: 0x7f1563a00000\nallocated size: 65536\nSize(B) pin.Time(us)    map.Time(us)    get_info.Time(us)   unmap.Time(us)\nunpin.Time(us)\n65536   1346.034060 3.603800    0.340270    4.700930    676.612800\nHistogram of gdr_pin_buffer latency for 65536 bytes\n[1303.852000    -   2607.704000]    93\n[2607.704000    -   3911.556000]    0\n[3911.556000    -   5215.408000]    0\n[5215.408000    -   6519.260000]    0\n[6519.260000    -   7823.112000]    0\n[7823.112000    -   9126.964000]    0\n[9126.964000    -   10430.816000]   0\n[10430.816000   -   11734.668000]   0\n[11734.668000   -   13038.520000]   0\n[13038.520000   -   14342.372000]   2\n\nclosing gdrdrv\n\n\n\n$ numactl -N 1 -l gdrcopy_pplat\nGPU id:0; name: NVIDIA A40; Bus id: 0000:09:00\nselecting device 0\ndevice ptr: 0x7f99d2600000\ngpu alloc fn: cuMemAlloc\nmap_d_ptr: 0x7f9a054fb000\ninfo.va: 7f99d2600000\ninfo.mapped_size: 4\ninfo.page_size: 65536\ninfo.mapped: 1\ninfo.wc_mapping: 1\npage offset: 0\nuser-space pointer: 0x7f9a054fb000\nCPU does gdr_copy_to_mapping and GPU writes back via cuMemHostAlloc'd buffer.\nRunning 1000 iterations with data size 4 bytes.\nRound-trip latency per iteration is 1.08762 us\nunmapping buffer\nunpinning buffer\nclosing gdrdrv\n```\n\n## NUMA effects\n\nDepending on the platform architecture, like where the GPU are placed in\nthe PCIe topology, performance may suffer if the processor which is driving\nthe copy is not the one which is hosting the GPU, for example in a\nmulti-socket server.\n\nIn the example below, GPU ID 0 is hosted by\nCPU socket 0. By explicitly playing with the OS process and memory\naffinity, it is possible to run the test onto the optimal processor:\n\n```shell\n$ numactl -N 0 -l gdrcopy_copybw -d 0 -s $((64 * 1024)) -o $((0 * 1024)) -c $((64 * 1024))\nGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00\nGPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00\nGPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00\nGPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00\nGPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00\nGPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00\nGPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00\nGPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00\nselecting device 0\ntesting size: 65536\nrounded size: 65536\ngpu alloc fn: cuMemAlloc\ndevice ptr: 7f5817a00000\nmap_d_ptr: 0x7f583b186000\ninfo.va: 7f5817a00000\ninfo.mapped_size: 65536\ninfo.page_size: 65536\ninfo.mapped: 1\ninfo.wc_mapping: 1\npage offset: 0\nuser-space pointer:0x7f583b186000\nwriting test, size=65536 offset=0 num_iters=1000\nwrite BW: 9768.3MB/s\nreading test, size=65536 offset=0 num_iters=1000\nread BW: 548.423MB/s\nunmapping buffer\nunpinning buffer\nclosing gdrdrv\n```\n\nor on the other socket:\n```shell\n$ numactl -N 1 -l gdrcopy_copybw -d 0 -s $((64 * 1024)) -o $((0 * 1024)) -c $((64 * 1024))\nGPU id:0; name: Tesla V100-SXM2-32GB; Bus id: 0000:06:00\nGPU id:1; name: Tesla V100-SXM2-32GB; Bus id: 0000:07:00\nGPU id:2; name: Tesla V100-SXM2-32GB; Bus id: 0000:0a:00\nGPU id:3; name: Tesla V100-SXM2-32GB; Bus id: 0000:0b:00\nGPU id:4; name: Tesla V100-SXM2-32GB; Bus id: 0000:85:00\nGPU id:5; name: Tesla V100-SXM2-32GB; Bus id: 0000:86:00\nGPU id:6; name: Tesla V100-SXM2-32GB; Bus id: 0000:89:00\nGPU id:7; name: Tesla V100-SXM2-32GB; Bus id: 0000:8a:00\nselecting device 0\ntesting size: 65536\nrounded size: 65536\ngpu alloc fn: cuMemAlloc\ndevice ptr: 7fbb63a00000\nmap_d_ptr: 0x7fbb82ab0000\ninfo.va: 7fbb63a00000\ninfo.mapped_size: 65536\ninfo.page_size: 65536\ninfo.mapped: 1\ninfo.wc_mapping: 1\npage offset: 0\nuser-space pointer:0x7fbb82ab0000\nwriting test, size=65536 offset=0 num_iters=1000\nwrite BW: 9224.36MB/s\nreading test, size=65536 offset=0 num_iters=1000\nread BW: 521.262MB/s\nunmapping buffer\nunpinning buffer\nclosing gdrdrv\n```\n\n\n## Restrictions and known issues\n\nGDRCopy works with regular CUDA device memory only, as returned by cudaMalloc.\nIn particular, it does not work with CUDA managed memory.\n\n`gdr_pin_buffer()` accepts any addresses returned by cudaMalloc and its family.\nIn contrast, `gdr_map()` requires that the pinned address is aligned to the GPU page.\nNeither CUDA Runtime nor Driver APIs guarantees that GPU memory allocation\nfunctions return aligned addresses. Users are responsible for proper alignment\nof addresses passed to the library.\n\nTwo cudaMalloc'd memory regions may be contiguous. Users may call\n`gdr_pin_buffer` and `gdr_map` with address and size that extend across these\ntwo regions. This use case is not well-supported in GDRCopy. On rare occassions,\nusers may experience 1.) an error in `gdr_map`, or 2.) low copy performance\nbecause `gdr_map` cannot provide write-combined mapping.\n\nIn some GPU driver versions, pinning the same GPU address multiple times\nconsumes additional BAR1 space. This is because the space is not properly\nreused. If you encounter this issue, we suggest that you try the latest version\nof NVIDIA GPU driver.\n\nOn POWER9 where CPU and GPU are connected via NVLink, CUDA9.2 and GPU Driver\nv396.37 are the minimum requirements in order to achieve the full performance.\nGDRCopy works with ealier CUDA and GPU driver versions but the achievable\nbandwidth is substantially lower.\n\nIf gdrdrv is compiled with the proprietary flavor of NVIDIA driver, GDRCopy does not fully support Linux with the\nconfidential computing (CC) configuration with Intel CPU. In particular, it does not functional if\n`CONFIG_ARCH_HAS_CC_PLATFORM=y` and CC is enabled at runtime. However, it works if CC is disabled or\n`CONFIG_ARCH_HAS_CC_PLATFORM=n`. This issue is not applied to AMD CPU. To avoid this issue, please compile and load\ngdrdrv with the opensource flavor of NVIDIA driver.\n\nTo allow the loading of unsupported 3rd party modules in SLE, set `allow_unsupported_modules 1` in\n/etc/modprobe.d/unsupported-modules. After making this change, modules missing the \"supported\" flag, will be allowed to\nload.\n\n\n## Bug filing\n\nFor reporting issues you may be having using any of NVIDIA software or\nreporting suspected bugs we would recommend you use the bug filing system\nwhich is available to NVIDIA registered developers on the developer site.\n\nIf you are not a member you can [sign\nup](https://developer.nvidia.com/accelerated-computing-developer).\n\nOnce a member you can submit issues using [this\nform](https://developer.nvidia.com/nvbugs/cuda/add). Be sure to select\nGPUDirect in the \"Relevant Area\" field.\n\nYou can later track their progress using the __My Bugs__ link on the left of\nthis [view](https://developer.nvidia.com/user).\n\n## Acknowledgment\n\nIf you find this software useful in your work, please cite:\nR. Shi et al., \"Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters,\" 2014 21st International Conference on High Performance Computing (HiPC), Dona Paula, 2014, pp. 1-10, doi: 10.1109/HiPC.2014.7116873.\n","funding_links":[],"categories":["C++"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fgdrcopy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVIDIA%2Fgdrcopy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fgdrcopy/lists"}