{"id":13819073,"url":"https://github.com/rapidsai/rmm","last_synced_at":"2025-05-14T05:10:26.243Z","repository":{"id":37412665,"uuid":"160453496","full_name":"rapidsai/rmm","owner":"rapidsai","description":"RAPIDS Memory Manager","archived":false,"fork":false,"pushed_at":"2025-05-11T12:00:06.000Z","size":6137,"stargazers_count":578,"open_issues_count":56,"forks_count":218,"subscribers_count":29,"default_branch":"branch-25.06","last_synced_at":"2025-05-11T12:29:25.125Z","etag":null,"topics":["cuda","memory-allocation","memory-management","rapids"],"latest_commit_sha":null,"homepage":"https://docs.rapids.ai/api/rmm/stable/","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rapidsai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-12-05T03:13:32.000Z","updated_at":"2025-05-11T12:00:07.000Z","dependencies_parsed_at":"2023-09-22T05:45:19.764Z","dependency_job_id":"2c0a97b4-da36-42c9-ad18-da9971488766","html_url":"https://github.com/rapidsai/rmm","commit_stats":{"total_commits":2405,"total_committers":101,"mean_commits":23.81188118811881,"dds":0.7505197505197505,"last_synced_commit":"6c904f7b960c3b4944ea19281a6c7e0e16b55275"},"previous_names":[],"tags_count":95,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rapidsai%2Frmm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rapidsai%2Frmm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rapidsai%2Frmm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rapidsai%2Frmm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rapidsai","download_url":"https://codeload.github.com/rapidsai/rmm/tar.gz/refs/heads/branch-25.06","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253568660,"owners_count":21928906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","memory-allocation","memory-management","rapids"],"created_at":"2024-08-04T08:00:39.268Z","updated_at":"2025-05-14T05:10:26.222Z","avatar_url":"https://github.com/rapidsai.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# \u003cdiv align=\"left\"\u003e\u003cimg src=\"img/rapids_logo.png\" width=\"90px\"/\u003e\u0026nbsp;RMM: RAPIDS Memory Manager\u003c/div\u003e\n\n**NOTE:** For the latest stable [README.md](https://github.com/rapidsai/rmm/blob/main/README.md) ensure you are on the `main` branch.\n\n## Resources\n\n- [RMM Reference Documentation](https://docs.rapids.ai/api/rmm/stable/): Python and C++ API references, tutorials, and topic guides.\n- [RAPIDS Installation Guide](https://docs.rapids.ai/install/): Instructions for installing RMM.\n- [GitHub Repository](https://github.com/rapidsai/rmm): Download the RMM source code.\n- [Issue Tracker](https://github.com/rapidsai/rmm/issues): Report issues or request features.\n- [RAPIDS Community](https://rapids.ai/learn-more/#get-involved): Get help, contribute, and collaborate.\n\n## Overview\n\nAchieving optimal performance in GPU-centric workflows frequently requires customizing how host and\ndevice memory are allocated. For example, using \"pinned\" host memory for asynchronous\nhost \u003c-\u003e device memory transfers, or using a device memory pool sub-allocator to reduce the cost of\ndynamic device memory allocation.\n\nThe goal of the RAPIDS Memory Manager (RMM) is to provide:\n- A common interface that allows customizing [device](#device_memory_resource) and\n  [host](#host_memory_resource) memory allocation\n- A collection of [implementations](#available-resources) of the interface\n- A collection of [data structures](#device-data-structures) that use the interface for memory allocation\n\nFor information on the interface RMM provides and how to use RMM in your C++ code, see\n[below](#using-rmm-in-c).\n\nFor a walkthrough about the design of the RAPIDS Memory Manager, read [Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) on the NVIDIA Developer Blog.\n\n## Installation\n\n### Conda\n\nRMM can be installed with conda. You can get a minimal conda installation with [miniforge](https://github.com/conda-forge/miniforge).\n\nInstall RMM with:\n\n```bash\nconda install -c rapidsai -c conda-forge -c nvidia rmm cuda-version=12.9\n```\n\nWe also provide [nightly conda packages](https://anaconda.org/rapidsai-nightly) built from the HEAD\nof our latest development branch.\n\nNote: RMM is supported only on Linux, and only tested with Python versions 3.10, 3.11, 3.12, and 3.13.\n\nNote: The RMM package from conda requires building with GCC 9 or later. Otherwise, your application may fail to build.\n\nSee the [RAPIDS Installation Guide](https://docs.rapids.ai/install/) for system requirements.\n\n## Building from Source\n\n### Get RMM Dependencies\n\nCompiler requirements:\n\n* `gcc`     version 9.3+\n* `nvcc`    version 11.4+\n* `cmake`   version 3.30.4+\n\nCUDA/GPU requirements:\n\n* CUDA 11.4+. You can obtain CUDA from\n  [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)\n\nGPU Support:\n* RMM is tested and supported only on Volta architecture and newer (Compute Capability 7.0+). It\n  may work on earlier architectures.\n\nPython requirements:\n* `rapids-build-backend` (available from PyPI or the `rapidsai` conda channel)\n* `scikit-build-core`\n* `cuda-python`\n* `cython`\n\nFor more details, see [pyproject.toml](python/rmm/pyproject.toml)\n\n\n### Script to build RMM from source\n\nTo install RMM from source, ensure the dependencies are met and follow the steps below:\n\n- Clone the repository\n```bash\n$ git clone https://github.com/rapidsai/rmm.git\n$ cd rmm\n```\n\n- Create the conda development environment `rmm_dev`\n```bash\n# create the conda environment (assuming in base `rmm` directory)\n$ conda env create --name rmm_dev --file conda/environments/all_cuda-129_arch-x86_64.yaml\n# activate the environment\n$ conda activate rmm_dev\n```\n\n- Build and install `librmm` using cmake \u0026 make. CMake depends on the `nvcc` executable being on\n  your path or defined in `CUDACXX` environment variable.\n\n```bash\n\n$ mkdir build                                       # make a build directory\n$ cd build                                          # enter the build directory\n$ cmake .. -DCMAKE_INSTALL_PREFIX=/install/path     # configure cmake ... use $CONDA_PREFIX if you're using Anaconda\n$ make -j                                           # compile the library librmm.so ... '-j' will start a parallel job using the number of physical cores available on your system\n$ make install                                      # install the library librmm.so to '/install/path'\n```\n\n- Building and installing `librmm` and `rmm` using build.sh. Build.sh creates build dir at root of\n  git repository. build.sh depends on the `nvcc` executable being on your path or defined in\n  `CUDACXX` environment variable.\n\n```bash\n\n$ ./build.sh -h                                     # Display help and exit\n$ ./build.sh -n librmm                              # Build librmm without installing\n$ ./build.sh -n rmm                                 # Build rmm without installing\n$ ./build.sh -n librmm rmm                          # Build librmm and rmm without installing\n$ ./build.sh librmm rmm                             # Build and install librmm and rmm\n```\n\n- To run tests (Optional):\n```bash\n$ cd build (if you are not already in build directory)\n$ make test\n```\n\n- Build, install, and test the `rmm` python package, in the `python` folder:\n```bash\n# In the root rmm directory\n$ python -m pip install -e ./python/rmm\n$ pytest -v\n```\n\nDone! You are ready to develop for the RMM OSS project.\n\n### Caching third-party dependencies\n\nRMM uses [CPM.cmake](https://github.com/cpm-cmake/CPM.cmake) to\nhandle third-party dependencies like spdlog, Thrust, GoogleTest,\nGoogleBenchmark. In general you won't have to worry about it. If CMake\nfinds an appropriate version on your system, it uses it (you can\nhelp it along by setting `CMAKE_PREFIX_PATH` to point to the\ninstalled location). Otherwise those dependencies will be downloaded as\npart of the build.\n\nIf you frequently start new builds from scratch, consider setting the\nenvironment variable `CPM_SOURCE_CACHE` to an external download\ndirectory to avoid repeated downloads of the third-party dependencies.\n\n## Using RMM in a downstream CMake project\n\nThe installed RMM library provides a set of config files that makes it easy to\nintegrate RMM into your own CMake project. Add the following to `CMakeLists.txt`:\n\n```cmake\nfind_package(rmm [VERSION])\n# ...\ntarget_link_libraries(\u003cyour-target\u003e (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)\n```\n\nSince RMM is a header-only library, this does not actually link RMM,\nbut it makes the headers available and pulls in transitive dependencies.\nIf RMM is not installed in a default location, use\n`CMAKE_PREFIX_PATH` or `rmm_ROOT` to point to its location.\n\nOne of RMM's dependencies is the Thrust library, so the above\nautomatically pulls in `Thrust` by means of a dependency on the\n`rmm::Thrust` target. By default it uses the standard configuration of\nThrust. If you want to customize it, you can set the variables\n`THRUST_HOST_SYSTEM` and `THRUST_DEVICE_SYSTEM`; see\n[Thrust's CMake documentation](https://github.com/NVIDIA/cccl/blob/main/thrust/thrust/cmake/README.md).\n\n### Using CPM to manage RMM\n\nRMM uses [CPM.cmake](https://github.com/cpm-cmake/CPM.cmake) to manage\nits dependencies, including [CCCL](https://github.com/nvidia/cccl), and you can\nuse CPM for your project's dependency on RMM.\n\nThere is an issue with using CPM's *single-argument compact syntax* for\nRMM/CCCL as it transitively marks targets as `SYSTEM` dependencies.\nThis causes the CCCL headers pulled in through CPM to be of lower priority\nto the preprocessor than the (potentially outdated) CCCL headers provided\nby the CUDA SDK. To avoid this issue, use CPM's *multi-argument syntax*\ninstead:\n\n```cmake\nCPMAddPackage(NAME rmm [VERSION]\n              GITHUB_REPOSITORY rapidsai/rmm\n              SYSTEM Off)\n# ...\ntarget_link_libraries(\u003cyour-target\u003e (PRIVATE|PUBLIC|INTERFACE) rmm::rmm)\n```\n\n# Using RMM in C++\n\nThe first goal of RMM is to provide a common interface for device and host memory allocation.\nThis allows both _users_ and _implementers_ of custom allocation logic to program to a single\ninterface.\n\nTo this end, RMM defines two abstract interface classes:\n- [`rmm::mr::device_memory_resource`](#device_memory_resource) for device memory allocation\n- [`rmm::mr::host_memory_resource`](#host_memory_resource) for host memory allocation\n\nThese classes are based on the\n[`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) interface\nclass introduced in C++17 for polymorphic memory allocation.\n\n## `device_memory_resource`\n\n`rmm::mr::device_memory_resource` is the base class that defines the interface for allocating and\nfreeing device memory.\n\nIt has two key functions:\n\n1. `void* device_memory_resource::allocate(std::size_t bytes, cuda_stream_view s)`\n   - Returns a pointer to an allocation of at least `bytes` bytes.\n\n2. `void device_memory_resource::deallocate(void* p, std::size_t bytes, cuda_stream_view s)`\n   - Reclaims a previous allocation of size `bytes` pointed to by `p`.\n   - `p` *must* have been returned by a previous call to `allocate(bytes)`, otherwise behavior is\n     undefined\n\nIt is up to a derived class to provide implementations of these functions. See\n[available resources](#available-resources) for example `device_memory_resource` derived classes.\n\nUnlike `std::pmr::memory_resource`, `rmm::mr::device_memory_resource` does not allow specifying an\nalignment argument. All allocations are required to be aligned to at least 256B. Furthermore,\n`device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream\non which to perform the (de)allocation.\n\n## Stream-ordered Memory Allocation\n\n`rmm::mr::device_memory_resource` is a base class that provides stream-ordered memory allocation.\nThis allows optimizations such as re-using memory deallocated on the same stream without the\noverhead of synchronization.\n\nA call to `device_memory_resource::allocate(bytes, stream_a)` returns a pointer that is valid to use\non `stream_a`. Using the memory on a different stream (say `stream_b`) is Undefined Behavior unless\nthe two streams are first synchronized, for example by using `cudaStreamSynchronize(stream_a)` or by\nrecording a CUDA event on `stream_a` and then calling `cudaStreamWaitEvent(stream_b, event)`.\n\nThe stream specified to `device_memory_resource::deallocate` should be a stream on which it is valid\nto use the deallocated memory immediately for another allocation. Typically this is the stream\non which the allocation was *last* used before the call to `deallocate`. The passed stream may be\nused internally by a `device_memory_resource` for managing available memory with minimal\nsynchronization, and it may also be synchronized at a later time, for example using a call to\n`cudaStreamSynchronize()`.\n\nFor this reason, it is Undefined Behavior to destroy a CUDA stream that is passed to\n`device_memory_resource::deallocate`. If the stream on which the allocation was last used has been\ndestroyed before calling `deallocate` or it is known that it will be destroyed, it is likely better\nto synchronize the stream (before destroying it) and then pass a different stream to `deallocate`\n(e.g. the default stream).\n\nNote that device memory data structures such as `rmm::device_buffer` and `rmm::device_uvector`\nfollow these stream-ordered memory allocation semantics and rules.\n\nFor further information about stream-ordered memory allocation semantics, read\n[Using the NVIDIA CUDA Stream-Ordered Memory\nAllocator](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/)\non the NVIDIA Developer Blog.\n\n## Available Device Resources\n\nRMM provides several `device_memory_resource` derived classes to satisfy various user requirements.\nFor more detailed information about these resources, see their respective documentation.\n\n### `cuda_memory_resource`\n\nAllocates and frees device memory using `cudaMalloc` and `cudaFree`.\n\n### `managed_memory_resource`\n\nAllocates and frees device memory using `cudaMallocManaged` and `cudaFree`.\n\nNote that `managed_memory_resource` cannot be used with NVIDIA Virtual GPU Software (vGPU, for use\nwith virtual machines or hypervisors) because [NVIDIA CUDA Unified Memory is not supported by\nNVIDIA vGPU](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu).\n\n### `pool_memory_resource`\n\nA coalescing, best-fit pool sub-allocator.\n\n### `fixed_size_memory_resource`\n\nA memory resource that can only allocate a single fixed size. Average allocation and deallocation\ncost is constant.\n\n### `binning_memory_resource`\n\nConfigurable to use multiple upstream memory resources for allocations that fall within different\nbin sizes. Often configured with multiple bins backed by `fixed_size_memory_resource`s and a single\n`pool_memory_resource` for allocations larger than the largest bin size.\n\n## Default Resources and Per-device Resources\n\nRMM users commonly need to configure a `device_memory_resource` object to use for all allocations\nwhere another resource has not explicitly been provided. A common example is configuring a\n`pool_memory_resource` to use for all allocations to get fast dynamic allocation.\n\nTo enable this use case, RMM provides the concept of a \"default\" `device_memory_resource`. This\nresource is used when another is not explicitly provided.\n\nAccessing and modifying the default resource is done through two functions:\n- `device_memory_resource* get_current_device_resource()`\n   - Returns a pointer to the default resource for the current CUDA device.\n   - The initial default memory resource is an instance of `cuda_memory_resource`.\n   - This function is thread safe with respect to concurrent calls to it and\n     `set_current_device_resource()`.\n   - For more explicit control, you can use `get_per_device_resource()`, which takes a device ID.\n\n- `device_memory_resource* set_current_device_resource(device_memory_resource* new_mr)`\n   - Updates the default memory resource pointer for the current CUDA device to `new_mr`\n   - Returns the previous default resource pointer\n   - If `new_mr` is `nullptr`, then resets the default resource to `cuda_memory_resource`\n   - This function is thread safe with respect to concurrent calls to it and\n     `get_current_device_resource()`\n   - For more explicit control, you can use `set_per_device_resource()`, which takes a device ID.\n\n### Example\n\n```c++\nrmm::mr::cuda_memory_resource cuda_mr;\n// Construct a resource that uses a coalescing best-fit pool allocator\n// With the pool initially half of available device memory\nauto initial_size = rmm::percent_of_free_device_memory(50);\nrmm::mr::pool_memory_resource\u003crmm::mr::cuda_memory_resource\u003e pool_mr{\u0026cuda_mr, initial_size};\nrmm::mr::set_current_device_resource(\u0026pool_mr); // Updates the current device resource pointer to `pool_mr`\nrmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(); // Points to `pool_mr`\n```\n\n### Multiple Devices\n\nA `device_memory_resource` should only be used when the active CUDA device is the same device\nthat was active when the `device_memory_resource` was created. Otherwise behavior is undefined.\n\nIf a `device_memory_resource` is used with a stream associated with a different CUDA device than the\ndevice for which the memory resource was created, behavior is undefined.\n\nCreating a `device_memory_resource` for each device requires care to set the current device before\ncreating each resource, and to maintain the lifetime of the resources as long as they are set as\nper-device resources. Here is an example loop that creates `unique_ptr`s to `pool_memory_resource`\nobjects for each device and sets them as the per-device resource for that device.\n\n```c++\nusing pool_mr = rmm::mr::pool_memory_resource\u003crmm::mr::cuda_memory_resource\u003e;\nstd::vector\u003cunique_ptr\u003cpool_mr\u003e\u003e per_device_pools;\nfor(int i = 0; i \u003c N; ++i) {\n  cudaSetDevice(i); // set device i before creating MR\n  // Use a vector of unique_ptr to maintain the lifetime of the MRs\n  // Note: for brevity, omitting creation of upstream and computing initial_size\n  per_device_pools.push_back(std::make_unique\u003cpool_mr\u003e(upstream, initial_size));\n  // Set the per-device resource for device i\n  set_per_device_resource(cuda_device_id{i}, \u0026per_device_pools.back());\n}\n```\n\nNote that the CUDA device that is current when creating a `device_memory_resource` must also be\ncurrent any time that `device_memory_resource` is used to deallocate memory, including in a\ndestructor. The RAII class `rmm::device_buffer` and classes that use it as a backing store\n(`rmm::device_scalar` and `rmm::device_uvector`) handle this by storing the active device when the\nconstructor is called, and then ensuring that the stored device is active whenever an allocation or\ndeallocation is performed (including in the destructor). The user must therefore only ensure that\nthe device active during _creation_ of an `rmm::device_buffer` matches the active device of the\nmemory resource being used.\n\nHere is an incorrect example that creates a memory resource on device `0` and then uses it to\nallocate a `device_buffer` on device `1`:\n\n```c++\n{\n  RMM_CUDA_TRY(cudaSetDevice(0));\n  auto mr = rmm::mr::cuda_memory_resource{};\n  {\n    RMM_CUDA_TRY(cudaSetDevice(1));\n    // Invalid, current device is 1, but MR is only valid for device 0\n    rmm::device_buffer buf(16, rmm::cuda_stream_default, \u0026mr);\n  }\n}\n```\n\nA correct example creates the device buffer with device `0` active. After that it is safe to switch\ndevices and let the buffer go out of scope and destruct with a different device active. For example,\nthis code is correct:\n\n```c++\n{\n  RMM_CUDA_TRY(cudaSetDevice(0));\n  auto mr = rmm::mr::cuda_memory_resource{};\n  rmm::device_buffer buf(16, rmm::cuda_stream_default, \u0026mr);\n  RMM_CUDA_TRY(cudaSetDevice(1));\n  ...\n  // No need to switch back to device 0 before ~buf runs\n}\n```\n\n#### Use of `rmm::device_vector` with multiple devices\n\n`rmm:device_vector` uses an `rmm::mr::thrust_allocator` to enable `thrust::device_vector` to\nallocate and deallocate memory using RMM. As such, the usual rules for usage of the backing memory\nresource apply: the active device must match the active device at resource construction time. To\nfacilitate use in an RAII setting, `rmm::mr::thrust_allocator` records the active device at\nconstruction time and ensures that device is active whenever it allocates or deallocates memory.\nUsage of `rmm::device_vector` with multiple devices is therefore the same as `rmm::device_buffer`.\nOne must _create_ `device_vector`s with the correct device active, but it is safe to destroy them\nwith a different active device.\n\nFor example, recapitulating the previous example using `rmm::device_vector`:\n\n```c++\n{\n  RMM_CUDA_TRY(cudaSetDevice(0));\n  auto mr = rmm::mr::cuda_memory_resource{};\n  rmm::device_vector\u003cint\u003e vec(16, rmm::mr::thrust_allocator\u003cint\u003e(rmm::cuda_stream_default, \u0026mr));\n  RMM_CUDA_TRY(cudaSetDevice(1));\n  ...\n  // No need to switch back to device 0 before ~vec runs\n}\n```\n\n\u003e [!NOTE]\n\u003e Although allocation and deallocation in the `thrust_allocator` run with the correct active device,\n\u003e modification of `rmm::device_vector` might necessitate a kernel launch, and this must run with the\n\u003e correct device active. For example, `.resize()` might both allocate _and_ launch a kernel to\n\u003e initialize new elements: the user must arrange for this kernel launch to occur with the correct\n\u003e device for the memory resource active.\n\n## `cuda_stream_view` and `cuda_stream`\n\n`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's\npurpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,\nwhich can lead to ambiguity in APIs when it is assigned `0`.)  All RMM stream-ordered APIs take a\n`rmm::cuda_stream_view` argument.\n\n`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides\nRAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`\ncan never represent the CUDA default stream or per-thread default stream; it only ever represents\na single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved.\n\n## `cuda_stream_pool`\n\n`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to\ncreate a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the\nstream pool can be faster than creating the streams on the fly. The size of the pool is configurable.\nDepending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of\n`rmm::cuda_stream_view` that represent identical CUDA streams.\n\n## Thread Safety\n\nAll current device memory resources are thread safe unless documented otherwise. More specifically,\ncalls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to\neither of these functions from other threads. They are _not_ thread safe with respect to\nconstruction and destruction of the memory resource object.\n\nNote that a class `thread_safe_resource_adapter` is provided which can be used to adapt a memory\nresource that is not thread safe to be thread safe (as described above). This adapter is not needed\nwith any current RMM device memory resources.\n\n## Allocators\n\nC++ interfaces commonly allow customizable memory allocation through an [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) object.\nRMM provides several `Allocator` and `Allocator`-like classes.\n\n### `polymorphic_allocator`\n\nA [stream-ordered](#stream-ordered-memory-allocation) allocator similar to [`std::pmr::polymorphic_allocator`](https://en.cppreference.com/w/cpp/memory/polymorphic_allocator).\nUnlike the standard C++ `Allocator` interface, the `allocate` and `deallocate` functions take a `cuda_stream_view` indicating the stream on which the (de)allocation occurs.\n\n### `stream_allocator_adaptor`\n\n`stream_allocator_adaptor` can be used to adapt a stream-ordered allocator to present a standard `Allocator` interface to consumers that may not be designed to work with a stream-ordered interface.\n\nExample:\n```c++\nrmm::cuda_stream stream;\nrmm::mr::polymorphic_allocator\u003cint\u003e stream_alloc;\n\n// Constructs an adaptor that forwards all (de)allocations to `stream_alloc` on `stream`.\nauto adapted = rmm::mr::stream_allocator_adaptor(stream_alloc, stream);\n\n// Allocates 100 bytes using `stream_alloc` on `stream`\nauto p = adapted.allocate(100);\n...\n// Deallocates using `stream_alloc` on `stream`\nadapted.deallocate(p,100);\n```\n\n### `thrust_allocator`\n\n`thrust_allocator` is a device memory allocator that uses the strongly typed `thrust::device_ptr`, making it usable with containers like `thrust::device_vector`.\n\nSee [below](#using-rmm-with-thrust) for more information on using RMM with Thrust.\n\n## Device Data Structures\n\n### `device_buffer`\n\nAn untyped, uninitialized RAII class for stream ordered device memory allocation.\n\n#### Example\n\n```c++\ncuda_stream_view s{...};\n// Allocates at least 100 bytes on stream `s` using the *default* resource\nrmm::device_buffer b{100,s};\nvoid* p = b.data();                   // Raw, untyped pointer to underlying device memory\n\nkernel\u003c\u003c\u003c..., s.value()\u003e\u003e\u003e(b.data()); // `b` is only safe to use on `s`\n\nrmm::mr::device_memory_resource * mr = new my_custom_resource{...};\n// Allocates at least 100 bytes on stream `s` using the resource `mr`\nrmm::device_buffer b2{100, s, mr};\n```\n\n### `device_uvector\u003cT\u003e`\nA typed, uninitialized RAII class for allocation of a contiguous set of elements in device memory.\nSimilar to a `thrust::device_vector`, but as an optimization, does not default initialize the\ncontained elements. This optimization restricts the types `T` to trivially copyable types.\n\n#### Example\n\n```c++\ncuda_stream_view s{...};\n// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the\n// default resource\nrmm::device_uvector\u003cint32_t\u003e v(100, s);\n// Initializes the elements to 0\nthrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0});\n\nrmm::mr::device_memory_resource * mr = new my_custom_resource{...};\n// Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the resource `mr`\nrmm::device_uvector\u003cint32_t\u003e v2{100, s, mr};\n```\n\n### `device_scalar`\nA typed, RAII class for allocation of a single element in device memory.\nThis is similar to a `device_uvector` with a single element, but provides convenience functions like\nmodifying the value in device memory from the host, or retrieving the value from device to host.\n\n#### Example\n```c++\ncuda_stream_view s{...};\n// Allocates uninitialized storage for a single `int32_t` in device memory\nrmm::device_scalar\u003cint32_t\u003e a{s};\na.set_value(42, s); // Updates the value in device memory to `42` on stream `s`\n\nkernel\u003c\u003c\u003c...,s.value()\u003e\u003e\u003e(a.data()); // Pass raw pointer to underlying element in device memory\n\nint32_t v = a.value(s); // Retrieves the value from device to host on stream `s`\n```\n\n## `host_memory_resource`\n\n`rmm::mr::host_memory_resource` is the base class that defines the interface for allocating and\nfreeing host memory.\n\nSimilar to `device_memory_resource`, it has two key functions for (de)allocation:\n\n1. `void* host_memory_resource::allocate(std::size_t bytes, std::size_t alignment)`\n   - Returns a pointer to an allocation of at least `bytes` bytes aligned to the specified\n     `alignment`\n\n2. `void host_memory_resource::deallocate(void* p, std::size_t bytes, std::size_t alignment)`\n   - Reclaims a previous allocation of size `bytes` pointed to by `p`.\n\n\nUnlike `device_memory_resource`, the `host_memory_resource` interface and behavior is identical to\n`std::pmr::memory_resource`.\n\n## Available Host Resources\n\n### `new_delete_resource`\n\nUses the global `operator new` and `operator delete` to allocate host memory.\n\n### `pinned_memory_resource`\n\nAllocates \"pinned\" host memory using `cuda(Malloc/Free)Host`.\n\n## Host Data Structures\n\nRMM does not currently provide any data structures that interface with `host_memory_resource`.\nIn the future, RMM will provide a similar host-side structure like `device_buffer` and an allocator\nthat can be used with STL containers.\n\n## Using RMM with Thrust\n\nRAPIDS and other CUDA libraries make heavy use of Thrust. Thrust uses CUDA device memory in two\nsituations:\n\n 1. As the backing store for `thrust::device_vector`, and\n 2. As temporary storage inside some algorithms, such as `thrust::sort`.\n\nRMM provides `rmm::mr::thrust_allocator` as a conforming Thrust allocator that uses\n`device_memory_resource`s.\n\n### Thrust Algorithms\n\nTo instruct a Thrust algorithm to use `rmm::mr::thrust_allocator` to allocate temporary storage, you\ncan use the custom Thrust CUDA device execution policy: `rmm::exec_policy(stream)`.\n\n```c++\nthrust::sort(rmm::exec_policy(stream, ...);\n```\n\nThe first `stream` argument is the `stream` to use for `rmm::mr::thrust_allocator`.\nThe second `stream` argument is what should be used to execute the Thrust algorithm.\nThese two arguments must be identical.\n\n## Logging\n\nRMM includes two forms of logging. Memory event logging and debug logging.\n\n### Memory Event Logging and `logging_resource_adaptor`\n\nMemory event logging writes details of every allocation or deallocation to a CSV (comma-separated\nvalue) file. In C++, Memory Event Logging is enabled by using the `logging_resource_adaptor` as a\nwrapper around any other `device_memory_resource` object.\n\nEach row in the log represents either an allocation or a deallocation. The columns of the file are\n\"Thread, Time, Action, Pointer, Size, Stream\".\n\nThe CSV output files of the `logging_resource_adaptor` can be used as input to `REPLAY_BENCHMARK`,\nwhich is available when building RMM from source, in the `gbenchmarks` folder in the build directory.\nThis log replayer can be useful for profiling and debugging allocator issues.\n\nThe following C++ example creates a logging version of a `cuda_memory_resource` that outputs the log\nto the file \"logs/test1.csv\".\n\n```c++\nstd::string filename{\"logs/test1.csv\"};\nrmm::mr::cuda_memory_resource upstream;\nrmm::mr::logging_resource_adaptor\u003crmm::mr::cuda_memory_resource\u003e log_mr{\u0026upstream, filename};\n```\n\nIf a file name is not specified, the environment variable `RMM_LOG_FILE` is queried for the file\nname. If `RMM_LOG_FILE` is not set, then an exception is thrown by the `logging_resource_adaptor`\nconstructor.\n\nIn Python, memory event logging is enabled when the `logging` parameter of `rmm.reinitialize()` is\nset to `True`. The log file name can be set using the `log_file_name` parameter. See\n`help(rmm.reinitialize)` for full details.\n\n### Debug Logging\n\nRMM leverages [`rapids-logger`](https://github.com/rapidsai/rapids-logger) to log trace and debug\ninformation to a file. This information can show when errors occur, when additional memory is\nallocated from upstream resources, etc. By default output is logged to stderr, but the environment\nvariable `RMM_DEBUG_LOG_FILE` can be set to specify a path and file name to dump the logs to\ninstead.\n\nThere is a CMake configuration variable `RMM_LOGGING_LEVEL`, which can be set to enable compilation\nof more detailed logging. The default is `INFO`. Available levels are `TRACE`, `DEBUG`, `INFO`,\n`WARN`, `ERROR`, `CRITICAL` and `OFF`.\n\nNote that to see logging below the `INFO` level, the application must also set the logging level at\nrun time. C++ applications must must call `rmm::default_logger().set_level()`, for example to enable all\nlevels of logging down to `TRACE`, call `rmm::default_logger().set_level(rapids_logger::level_enum::trace)` (and compile\nlibrmm with `-DRMM_LOGGING_LEVEL=TRACE`). Python applications must call `rmm.set_logging_level()`,\nfor example to enable all levels of logging down to `TRACE`, call `rmm.set_logging_level(\"trace\")`\n(and compile the RMM Python module with `-DRMM_LOGGING_LEVEL=TRACE`).\n\nNote that debug logging is different from the CSV memory allocation logging provided by\n`rmm::mr::logging_resource_adapter`. The latter is for logging a history of allocation /\ndeallocation actions which can be useful for replay with RMM's replay benchmark.\n\n## RMM and CUDA Memory Bounds Checking\n\nMemory allocations taken from a memory resource that allocates a pool of memory (such as\n`pool_memory_resource` and `arena_memory_resource`) are part of the same low-level CUDA memory\nallocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely to\nbe detected by CUDA tools such as\n[CUDA Compute Sanitizer](https://docs.nvidia.com/cuda/compute-sanitizer/index.html) memcheck.\n\nExceptions to this are `cuda_memory_resource`, which wraps `cudaMalloc`, and\n`cuda_async_memory_resource`, which uses `cudaMallocAsync` with CUDA's built-in memory pool\nfunctionality (CUDA 11.2 or later required). Illegal memory accesses to memory allocated by these\nresources are detectable with Compute Sanitizer Memcheck.\n\nIt may be possible in the future to add support for memory bounds checking with other memory\nresources using NVTX APIs.\n\n# Using RMM in Python\n\nThere are two ways to use RMM in Python code:\n\n1. Using the `rmm.DeviceBuffer` API to explicitly create and manage\n   device memory allocations\n2. Transparently via external libraries such as CuPy and Numba\n\nRMM provides a `MemoryResource` abstraction to control _how_ device\nmemory is allocated in both the above uses.\n\n## DeviceBuffer\n\nA DeviceBuffer represents an **untyped, uninitialized device memory\nallocation**.  DeviceBuffers can be created by providing the\nsize of the allocation in bytes:\n\n```python\n\u003e\u003e\u003e import rmm\n\u003e\u003e\u003e buf = rmm.DeviceBuffer(size=100)\n```\n\nThe size of the allocation and the memory address associated with it\ncan be accessed via the `.size` and `.ptr` attributes respectively:\n\n```python\n\u003e\u003e\u003e buf.size\n100\n\u003e\u003e\u003e buf.ptr\n140202544726016\n```\n\nDeviceBuffers can also be created by copying data from host memory:\n\n```python\n\u003e\u003e\u003e import rmm\n\u003e\u003e\u003e import numpy as np\n\u003e\u003e\u003e a = np.array([1, 2, 3], dtype='float64')\n\u003e\u003e\u003e buf = rmm.DeviceBuffer.to_device(a.tobytes())\n\u003e\u003e\u003e buf.size\n24\n```\n\nConversely, the data underlying a DeviceBuffer can be copied to the\nhost:\n\n```python\n\u003e\u003e\u003e np.frombuffer(buf.tobytes())\narray([1., 2., 3.])\n```\n\n## MemoryResource objects\n\n`MemoryResource` objects are used to configure how device memory allocations are made by\nRMM.\n\nBy default if a `MemoryResource` is not set explicitly, RMM uses the `CudaMemoryResource`, which\nuses `cudaMalloc` for allocating device memory.\n\n`rmm.reinitialize()` provides an easy way to initialize RMM with specific memory resource options\nacross multiple devices. See `help(rmm.reinitialize)` for full details.\n\nFor lower-level control, the `rmm.mr.set_current_device_resource()` function can be\nused to set a different MemoryResource for the current CUDA device.  For\nexample, enabling the `ManagedMemoryResource` tells RMM to use\n`cudaMallocManaged` instead of `cudaMalloc` for allocating memory:\n\n```python\n\u003e\u003e\u003e import rmm\n\u003e\u003e\u003e rmm.mr.set_current_device_resource(rmm.mr.ManagedMemoryResource())\n```\n\n\u003e :warning: The default resource must be set for any device **before**\n\u003e allocating any device memory on that device.  Setting or changing the\n\u003e resource after device allocations have been made can lead to unexpected\n\u003e behaviour or crashes. See [Multiple Devices](#multiple-devices)\n\nAs another example, `PoolMemoryResource` allows you to allocate a\nlarge \"pool\" of device memory up-front. Subsequent allocations will\ndraw from this pool of already allocated memory.  The example\nbelow shows how to construct a PoolMemoryResource with an initial size\nof 1 GiB and a maximum size of 4 GiB. The pool uses\n`CudaMemoryResource` as its underlying (\"upstream\") memory resource:\n\n```python\n\u003e\u003e\u003e import rmm\n\u003e\u003e\u003e pool = rmm.mr.PoolMemoryResource(\n...     rmm.mr.CudaMemoryResource(),\n...     initial_pool_size=\"1GiB\", # equivalent to initial_pool_size=2**30\n...     maximum_pool_size=\"4GiB\"\n... )\n\u003e\u003e\u003e rmm.mr.set_current_device_resource(pool)\n```\nOther MemoryResources include:\n\n* `FixedSizeMemoryResource` for allocating fixed blocks of memory\n* `BinningMemoryResource` for allocating blocks within specified \"bin\" sizes from different memory\nresources\n\nMemoryResources are highly configurable and can be composed together in different ways.\nSee `help(rmm.mr)` for more information.\n\n## Using RMM with third-party libraries\n\n### Using RMM with CuPy\n\nYou can configure [CuPy](https://cupy.dev/) to use RMM for memory\nallocations by setting the CuPy CUDA allocator to\n`rmm_cupy_allocator`:\n\n```python\n\u003e\u003e\u003e from rmm.allocators.cupy import rmm_cupy_allocator\n\u003e\u003e\u003e import cupy\n\u003e\u003e\u003e cupy.cuda.set_allocator(rmm_cupy_allocator)\n```\n\n\n**Note:** This only configures CuPy to use the current RMM resource for allocations.\nIt does not initialize nor change the current resource, e.g., enabling a memory pool.\nSee [here](#memoryresource-objects) for more information on changing the current memory resource.\n\n### Using RMM with Numba\n\nYou can configure Numba to use RMM for memory allocations using the\nNumba [EMM Plugin](https://numba.readthedocs.io/en/stable/cuda/external-memory.html#setting-emm-plugin).\n\nThis can be done in two ways:\n\n1. Setting the environment variable `NUMBA_CUDA_MEMORY_MANAGER`:\n\n  ```python\n  $ NUMBA_CUDA_MEMORY_MANAGER=rmm.allocators.numba python (args)\n  ```\n\n2. Using the `set_memory_manager()` function provided by Numba:\n\n  ```python\n  \u003e\u003e\u003e from numba import cuda\n  \u003e\u003e\u003e from rmm.allocators.numba import RMMNumbaManager\n  \u003e\u003e\u003e cuda.set_memory_manager(RMMNumbaManager)\n  ```\n\n**Note:** This only configures Numba to use the current RMM resource for allocations.\nIt does not initialize nor change the current resource, e.g., enabling a memory pool.\nSee [here](#memoryresource-objects) for more information on changing the current memory resource.\n\n### Using RMM with PyTorch\n\n[PyTorch](https://pytorch.org/docs/stable/notes/cuda.html) can use RMM\nfor memory allocation.  For example, to configure PyTorch to use an\nRMM-managed pool:\n\n```python\nimport rmm\nfrom rmm.allocators.torch import rmm_torch_allocator\nimport torch\n\nrmm.reinitialize(pool_allocator=True)\ntorch.cuda.memory.change_current_allocator(rmm_torch_allocator)\n```\n\nPyTorch and RMM will now share the same memory pool.\n\nYou can, of course, use a custom memory resource with PyTorch as well:\n\n```python\nimport rmm\nfrom rmm.allocators.torch import rmm_torch_allocator\nimport torch\n\n# note that you can configure PyTorch to use RMM either before or\n# after changing RMM's memory resource.  PyTorch will use whatever\n# memory resource is configured to be the \"current\" memory resource at\n# the time of allocation.\ntorch.cuda.change_current_allocator(rmm_torch_allocator)\n\n# configure RMM to use a managed memory resource, wrapped with a\n# statistics resource adaptor that can report information about the\n# amount of memory allocated:\nmr = rmm.mr.StatisticsResourceAdaptor(rmm.mr.ManagedMemoryResource())\nrmm.mr.set_current_device_resource(mr)\n\nx = torch.tensor([1, 2]).cuda()\n\n# the memory resource reports information about PyTorch allocations:\nmr.allocation_counts\nOut[6]:\n{'current_bytes': 16,\n 'current_count': 1,\n 'peak_bytes': 16,\n 'peak_count': 1,\n 'total_bytes': 16,\n 'total_count': 1}\n```\n\n## Taking ownership of C++ objects from Python\n\nWhen interacting with a C++ library that uses RMM from Python, one\nmust be careful when taking ownership of `rmm::device_buffer` objects\non the Python side. The `rmm::device_buffer` does not contain an\nowning reference to the memory resource used for its allocation (only\na `device_async_resource_ref`), and the allocating user is expected to\nkeep this memory resource alive for at least the lifetime of the\nbuffer. When taking ownership of such a buffer in Python, we have no\nway (in the general case) of ensuring that the memory resource will\noutlive the buffer we are now holding.\n\nTo avoid any issues, we need two things:\n\n1. The C++ library we are interfacing with should accept a memory\n   resource that is used for allocations that are returned to the\n   user.\n2. When calling into the library from python, we should provide a\n   memory resource whose lifetime we control. This memory resource\n   should then be provided when we take ownership of any allocated\n   `rmm::device_buffer`s.\n\nFor example, suppose we have a C++ function that allocates\n`device_buffer`s, which has a utility overload that defaults the\nmemory resource to the current device resource:\n\n```c++\nstd::unique_ptr\u003crmm::device_buffer\u003e allocate(\n  std::size_t size,\n  rmm::mr::device_async_resource_ref mr = get_current_device_resource())\n{\n    return std::make_unique\u003crmm::device_buffer\u003e(size, rmm::cuda_stream_default, mr);\n}\n```\n\nThe Python `DeviceBuffer` class has a convenience Cython function,\n`c_from_unique_ptr` to construct a `DeviceBuffer` from a\n`unique_ptr\u003crmm::device_buffer\u003e`, taking ownership of it. To do this\nsafely, we must ensure that the allocation that was done on the C++\nside uses a memory resource we control. So:\n\n```cython\n# Bad, doesn't control lifetime\nbuffer_bad = DeviceBuffer.c_from_unique_ptr(allocate(10))\n\n# Good, allocation happens with a memory resource we control\n# mr is a DeviceMemoryResource\nbuffer_good = DeviceBuffer.c_from_unique_ptr(\n    allocate(10, mr.get_mr()),\n    mr=mr,\n)\n```\n\nNote two differences between the bad and good cases:\n\n1. In the good case we pass the memory resource to the allocation\n   function.\n2. In the good case, we pass _the same_ memory resource to the\n   `DeviceBuffer` constructor so that its lifetime is tied to the\n   lifetime of the buffer.\n\n### Potential pitfalls of relying on `get_current_device_resource`\n\nFunctions in both the C++ and Python APIs that perform allocation\ntypically default the memory resource argument to the value of\n`get_current_device_resource`. This is to simplify the interface for\ncallers. When using a C++ library from Python, this defaulting is\nsafe, _as long as_ it is only the Python process that ever calls\n`set_current_device_resource`.\n\nThis is because the current device resource on the C++ side has a\nlifetime which is expected to be managed by the user. The resources\nset by `rmm::mr::set_current_device_resource` are stored in a static\n`std::map` whose keys are device ids and values are raw pointers to\nthe memory resources. Consequently,\n`rmm::mr::get_current_device_resource` returns an object with no\nlifetime provenance. This is, for the reasons discussed above, not\nusable from Python. To handle this on the Python side, the\nPython-level `set_current_device_resource` sets the C++ resource _and_\nstores the Python object in a static global dictionary. The Python\n`get_current_device_resource` then _does not use_\n`rmm::mr::get_current_device_resource` and instead looks up the\ncurrent device resource in this global dictionary.\n\nHence, if the C++ library we are interfacing with calls\n`rmm::mr::set_current_device_resource`, the C++ and Python sides of\nthe program can disagree on what `get_current_device_resource`\nreturns. The only safe thing to do if using the simplified interfaces\nis therefore to ensure that `set_current_device_resource` is only ever\ncalled on the Python side.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frapidsai%2Frmm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frapidsai%2Frmm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frapidsai%2Frmm/lists"}