{"id":38524884,"url":"https://github.com/yosh-matsuda/gpu-ptr","last_synced_at":"2026-01-17T06:45:40.795Z","repository":{"id":231977506,"uuid":"778811411","full_name":"yosh-matsuda/gpu-ptr","owner":"yosh-matsuda","description":"Cross-platform GPU smart pointer with C++20 range support","archived":false,"fork":false,"pushed_at":"2026-01-11T15:14:20.000Z","size":72,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-11T18:39:32.095Z","etag":null,"topics":["cpp","cpp20","cuda","gpu","header-only","hip"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yosh-matsuda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-03-28T13:03:07.000Z","updated_at":"2026-01-11T15:14:24.000Z","dependencies_parsed_at":"2024-05-03T06:42:55.660Z","dependency_job_id":"d8ed9d69-d25f-4494-a82d-7a3ff28ea13f","html_url":"https://github.com/yosh-matsuda/gpu-ptr","commit_stats":null,"previous_names":["yosh-matsuda/gpu-ptr"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/yosh-matsuda/gpu-ptr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-ptr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-ptr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-ptr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-ptr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yosh-matsuda","download_url":"https://codeload.github.com/yosh-matsuda/gpu-ptr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-ptr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28502867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T04:31:57.058Z","status":"ssl_error","status_checked_at":"2026-01-17T04:31:45.816Z","response_time":85,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cpp20","cuda","gpu","header-only","hip"],"created_at":"2026-01-17T06:45:40.725Z","updated_at":"2026-01-17T06:45:40.783Z","avatar_url":"https://github.com/yosh-matsuda.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gpu-ptr: Make GPU programming more modern C++ friendly\n\ngpu-ptr is a header-only C++20 library that brings RAII and Range-based abstractions to GPU memory management and data layouts, enabling code safety and performance optimizations with zero overhead. By abstracting away raw pointers and memory layouts, gpu-ptr allows developers to focus on algorithm logic rather than resource bookkeeping.\n\nOptimized Performance of GPU, with syntax of Modern C++.\n\n## Features\n\n*   Smart pointer-like wrappers:\n    *   Full RAII (Resource Acquisition Is Initialization) support for GPU memory management, ensuring automatic cleanup.\n*   Performance-Oriented Memory Layouts:\n    *   AoS to SoA Conversion: Converting Array-of-Structures (AoS) to Structure-of-Arrays (SoA) to ensure coalesced memory access for maximum GPU throughput. AoS stores data as contiguous structures, while SoA separates each field into its own array for better memory access patterns.\n    *   Jagged Array Wrappers: Manage multi-dimensional data with varying row lengths using a single, efficient 1-D memory allocation and optimized performance with improving cache hit rates.\n*   C++20 Integration:\n    *   Compatible with modern standards, including ranges and iterator concepts even for GPU kernel code.\n    *   Range adapters for grid-stride access patterns (e.g., block-thread, grid-thread, grid-block, etc.).\n*   Dual backend:\n    *   Support for NVIDIA CUDA and AMD HIP.\n*   Header-only library and no external dependencies.\n\n### Requirements\n\n*   CUDA 12.0 or later / HIP 6.2 or later\n*   C++20 compatible compiler (e.g., GCC 12+, Clang 14+)\n\n## Quick Start\n\n### Installation\n\nAs a header-only library, you can simply copy the `include` directory to your project. If you are using CMake, you can add the following lines to your `CMakeLists.txt`:\n\n```cmake\ntarget_link_libraries(your_target PRIVATE gpu_ptr::gpu_ptr)\n```\n\nAlternatively, you can use CMake's `FetchContent` module instead of manually downloading the library:\n\n```cmake\ninclude(FetchContent)\nFetchContent_Declare(\n    gpu_ptr\n    GIT_REPOSITORY https://github.com/yosh-matsuda/gpu-ptr.git\n    GIT_TAG v0.3.0\n)\nFetchContent_MakeAvailable(gpu_ptr)\ntarget_link_libraries(your_target PRIVATE gpu_ptr::gpu_ptr)\n```\n\n### Example: Device memory management with smart pointers\n\ngpu-ptr provides several smart pointer-like classes to manage GPU memory, including `array` and `managed_array` for arrays with range concepts, and `value` and `managed_value` for single value pointers.  \nThese classes automatically handle memory allocation and deallocation on the GPU. The `managed_` variants use unified memory, allowing seamless access from both host and device.\n\n```cpp\n#include \u003ccooperative_groups.h\u003e\n#include \u003cgpu_ptr.hpp\u003e\n#include \u003ciostream\u003e\n\nusing namespace gpu_ptr;\n\n// Example kernel: initialize all elements\ntemplate \u003cstd::ranges::input_range T\u003e\n__global__ void kernel(T array)\n{\n    for (auto\u0026 v : array | views::grid_thread_stride)\n        v += 1;\n}\n\nvoid example()\n{\n    // Allocate managed (or unmanaged) memory for 1024 integers\n    auto array = managed_array\u003cint\u003e(1024);\n\n    // Launch kernel to set values\n    kernel\u003c\u003c\u003c1, 128\u003e\u003e\u003e(array);\n\n    // Wrapper for cudaDeviceSynchronize/hipDeviceSynchronize\n    api::gpuDeviceSynchronize();\n\n    // Print results\n    for (const auto\u0026 v: array) std::cout \u003c\u003c v \u003c\u003c \" \";\n}\n```\n\n### Example: Conversion from host to device memory and vice versa\n\nArrays and values classes can be easily converted from and to C++ containers (e.g., `std::vector`, `std::array`). The data is copied from host to device during construction.\n\n```cpp\n#include \u003cgpu_ptr.hpp\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_ptr;\n\nvoid example()\n{\n    // Create vector on host\n    auto vec = std::vector\u003cint\u003e(100);\n    for (auto i = 0; auto\u0026 v: vec) v = i++;\n\n    // Convert from host vector to device array\n    auto array = managed_array(vec);\n\n    // Call kernel to perform operations on GPU\n    // ...\n\n    // Convert from device array to host vector\n    vec = array.to\u003cstd::vector\u003e();\n}\n```\n\n### Example: Grid-stride range adapters\n\nThe kernel code can utilize C++20 range views for grid-stride access patterns (so-called grid-stride loop). gpu-ptr provides several [Range Adapter Closure Object](https://en.cppreference.com/w/cpp/named_req/RangeAdaptorClosureObject.html) such as `views::block_thread_stride`, `views::grid_thread_stride`, and `views::grid_block_stride` to facilitate this without any overhead. The following example demonstrates how to achieve memory coalescing when initializing nested arrays using grid-stride access.\n\n```cpp\n#include \u003cgpu_ptr.hpp\u003e\n#include \u003ciostream\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_ptr;\n\n// Example kernel: initialize nested array\ntemplate \u003cstd::ranges::input_range T\u003e\nrequires std::ranges::input_range\u003cstd::ranges::range_value_t\u003cT\u003e\u003e\n__global__ void kernel_example(T array)\n{\n    for (auto\u0026 a : array | views::grid_block_stride)\n    {\n        for (auto\u0026 v : a | views::block_thread_stride)\n        {\n            v *= 2;\n        }\n    }\n\n    /* The above code is equivalent to the following:\n    using namespace cooperative_groups;\n    const auto grid = this_grid();\n    const auto block = this_thread_block();\n    for (auto i = grid.block_rank(); i \u003c array.size(); i += grid.num_blocks())\n    {\n        for (auto j = block.thread_rank(); j \u003c array[i].size(); j += block.size())\n        {\n            array[i][j] *= 2;\n        }\n    }\n    */\n}\n\nvoid example()\n{\n    // Create nested vector on host\n    auto vec_vec = std::vector(256, std::vector\u003cint\u003e(1024, 1));\n\n    // Convert from nested host vector to nested device array\n    auto nested_array = managed_array(vec_vec);\n\n    // Launch kernel to initialize nested array\n    kernel_example\u003c\u003c\u003c32, 256\u003e\u003e\u003e(nested_array);\n    api::gpuDeviceSynchronize();\n\n    // Print results\n    for (const auto\u0026 inner_array : nested_array)\n    {\n        for (const auto\u0026 v : inner_array) std::cout \u003c\u003c v \u003c\u003c \" \";\n        std::cout \u003c\u003c std::endl;\n    }\n}\n```\n\n### Example: AoS and SoA\n\ngpu-ptr supports both Array of structures (AoS) and Structure of arrays (SoA) for memory layout optimization via `array` and `structure_of_arrays` classes, respectively. In either case, gpu-ptr provides a structure retrieval interface via array indices. Thus, `structure_of_arrays\u003ctuple-derived\u003e` can be used as a drop-in replacement for `array\u003ctuple-derived\u003e` with optimizing memory layout, enabling seamless memory layout optimization without altering your algorithm's implementation.\n\nThe memory layout comparison between `array` (AoS) and `structure_of_arrays` (SoA) is as follows:\n\n![array of structure vs. structure_of_arrays](https://github.com/user-attachments/assets/219085eb-80c7-44e5-9e3b-6607bd8174bf)\n\n```cpp\n#include \u003cgpu_ptr.hpp\u003e\n#include \u003ctuple\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_ptr;\n\n// std::tuple or its derived struct can be used as structure type\n// The below example shows a tuple-derived struct with three members and their accessors\ntemplate \u003ctypename... Ts\u003e\nrequires (sizeof...(Ts) == 3)\nstruct CustomTuple : public std::tuple\u003cTs...\u003e\n{\n    using std::tuple\u003cTs...\u003e::tuple;\n    __host__ __device__ auto\u0026 get_a() { return std::get\u003c0\u003e(*this); }\n    __host__ __device__ auto\u0026 get_b() { return std::get\u003c1\u003e(*this); }\n    __host__ __device__ auto\u0026 get_c() { return std::get\u003c2\u003e(*this); }\n};\nusing Struct = CustomTuple\u003cint, float, double\u003e;\n\n// Example kernel: process both AoS and SoA\ntemplate \u003cstd::ranges::input_range T\u003e\n__global__ void kernel_example(T array)\n{\n    for (auto\u0026\u0026 v : array | views::grid_thread_stride)\n    {\n        // Access structure members for both AoS and SoA\n        v.get_a() *= 2;\n        v.get_b() *= 2.0f;\n        v.get_c() *= 2.0;\n    }\n}\n\nvoid example()\n{\n    // Create vector of structures\n    auto vec = std::vector\u003cStruct\u003e(100, {1, 2.0f, 3.0});\n\n    // Array of structures (AoS): single array for entire structure\n    auto aos = managed_array\u003cStruct\u003e(vec);\n    kernel_example\u003c\u003c\u003c1, 32\u003e\u003e\u003e(aos);\n\n    // Structure of arrays (SoA): multiple arrays for each member internally\n    auto soa = managed_structure_of_arrays\u003cStruct\u003e(vec);\n    kernel_example\u003c\u003c\u003c1, 32\u003e\u003e\u003e(soa);\n}\n```\n\n\u003e [!TIP]\n\u003e Which is better, AoS or SoA? It depends on the access pattern to the structure members within a warp, the size of the structure, and the number of available registers. If all threads in a warp access the same member of the structure, SoA is generally better for maximizing coalesced memory access. However, if each thread accesses entire structures or different members, AoS may be more efficient due to better cache locality. Benchmarking both layouts with representative workloads is recommended to determine the optimal choice for your specific use case. Whichever, **gpu-ptr makes it easy to switch between AoS and SoA without changing the access interface**.\n\n### Example: Jagged array\n\ngpu-ptr provides `jagged_array` class to manage multi-dimensional array with varying row lengths, using a **single memory allocation to maximize coalescing access**.\nThis behaves like a wrapper for `managed_array` or `managed_jagged_array` with multi-dimensional indexing. The `jagged_array` is constructed from a 1-D array with sizes or multi-dimensional container (e.g., `std::vector\u003cstd::vector\u003cT\u003e\u003e`).\n\nThe logical and physical data layout of `jagged_array` is as follows:\n\n![data layout of jagged array](https://github.com/user-attachments/assets/7773537d-7259-4d2c-a695-8572906a6057)\n\n```cpp\n#include \u003cgpu_ptr.hpp\u003e\n#include \u003ciostream\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_ptr;\n\n// Example kernel: modify all elements\ntemplate \u003cstd::ranges::input_range T\u003e\n__global__ void kernel(T array)\n{\n    for (auto\u0026 v : array | views::grid_thread_stride)\n        v *= 2;\n}\n\nauto vec = std::vector\u003cint\u003e{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};\nauto vec_vec = std::vector\u003cstd::vector\u003cint\u003e\u003e{{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}};\n\nvoid example()\n{\n    // Create jagged array from nested std::vector\n    auto jarray = jagged_array\u003cmanaged_array\u003cint\u003e\u003e(vec_vec);\n    // Equivalent to the above:\n    // auto jarray = jagged_array\u003cmanaged_array\u003cint\u003e\u003e({1, 2, 3, 4}, vec);\n\n    // Launch kernel to re-set values\n    kernel\u003c\u003c\u003c1, 32\u003e\u003e\u003e(jarray);\n    api::gpuDeviceSynchronize();\n\n    // Access each row and each element\n    for (std::size_t i = 0; i \u003c jarray.num_rows(); ++i)\n    {\n        for (std::size_t j = 0; j \u003c jarray.size(i); ++j)\n        {\n            std::cout \u003c\u003c jarray[{i, j}] \u003c\u003c \" \";\n        }\n        std::cout \u003c\u003c std::endl;\n    }\n}\n```\n\n### Zero-overhead\n\ngpu-ptr is designed to have zero-overhead compared to traditional raw pointer usage. The following example shows equivalent kernels using raw pointers and `managed_array` as verified by PTX analysis. The generated PTX assembly code for both kernels is identical, confirming that there is no performance penalty when using gpu-ptr abstractions with range adapters.\n\n```cpp\n// Traditional raw pointer kernel\n__global__ void func0(int* data, std::uint32_t size)\n{\n    const auto block = cooperative_groups::this_thread_block();\n    for (std::uint32_t i = block.thread_rank(); i \u003c size; i += block.size())\n    {\n        data[i] = 1;\n    }\n}\n\n// Using managed_array\n__global__ void func1(managed_array\u003cint, std::uint32_t\u003e arr)\n{\n    const auto block = cooperative_groups::this_thread_block();\n    for (std::uint32_t i = block.thread_rank(); i \u003c arr.size(); i += block.size())\n    {\n        arr[i] = 1;\n    }\n}\n\n// Using managed_array with range adapters\n__global__ void func3(managed_array\u003cint, std::uint32_t\u003e arr)\n{\n    for (auto\u0026 v : arr | views::block_thread_stride)\n    {\n        v = 1;\n    }\n}\n```\n\n\u003cdetails\u003e\n\n\u003csummary\u003eThe identical PTX assembly code (except parameters) generated for the above three kernels is as follows:\u003c/summary\u003e\n\n```text\n// Function Definition: _Z5func3... (Mangled C++ name for a template function)\n// It accepts a 'managed_array' struct by value, which is 24 bytes in size.\n.visible .entry _Z5func3IN7gpu_ptr13managed_arrayIijEEEvT_(\n    .param .align 8 .b8 _Z5func3IN7gpu_ptr13managed_arrayIijEEEvT__param_0[24]\n)\n{\n    // Register Declarations\n    .reg .pred  %p\u003c3\u003e;   // Predicate registers for logic\n    .reg .b32  %r\u003c17\u003e;  // 32-bit integer registers\n    .reg .b64  %rd\u003c6\u003e;  // 64-bit registers for pointers/addresses\n\n    // --- Extracting values from the Struct Parameter ---\n    // The struct is laid out in memory:\n    // Offset 0: Array Size (32-bit)\n    // Offset 8: Base Pointer (64-bit)\n    ld.param.u64  %rd1, [_Z5func3IN7gpu_ptr13managed_arrayIijEEEvT__param_0+8]; // Load data pointer\n    ld.param.u32  %r5, [_Z5func3IN7gpu_ptr13managed_arrayIijEEEvT__param_0];    // Load array size N\n\n    // --- Calculate 1D Thread Index (%r16) ---\n    // Formula: idx = (tid.z * ntid.y + tid.y) * ntid.x + tid.x\n    mov.u32  %r2, %ntid.y;              // %r2 = blockDim.y\n    mov.u32  %r9, %tid.z;               // %r9 = threadIdx.z\n    mov.u32  %r10, %tid.y;              // %r10 = threadIdx.y\n    mad.lo.s32  %r11, %r2, %r9, %r10;   // %r11 = (blockDim.y * threadIdx.z) + threadIdx.y\n\n    mov.u32  %r3, %ntid.x;              // %r3 = blockDim.x\n    mov.u32  %r12, %tid.x;              // %r12 = threadIdx.x\n    mad.lo.s32  %r16, %r11, %r3, %r12;  // %r16 = (%r11 * blockDim.x) + threadIdx.x\n\n    // --- Initial Boundary Check ---\n    setp.ge.u32  %p1, %r16, %r5;        // if (idx \u003e= N)\n    @%p1 bra  $L__BB2_3;                // Exit if initial index is out of bounds\n\n    // --- Calculate Stride (Total Threads in Block) ---\n    // stride = blockDim.x * blockDim.y * blockDim.z\n    mul.lo.s32  %r13, %r3, %r2;         // %r13 = blockDim.x * blockDim.y\n    mov.u32  %r14, %ntid.z;             // %r14 = blockDim.z\n    mul.lo.s32  %r6, %r13, %r14;        // %r6 = total threads (stride)\n\n    // Convert generic pointer to global memory pointer\n    cvta.to.global.u64  %rd3, %rd1;     // %rd3 = global(data)\n\n// --- Main Loop: Grid-Stride Writing ---\n$L__BB2_2:\n    // Memory address calculation: addr = base + (idx * 4)\n    mul.wide.u32  %rd4, %r16, 4;        // Calculate 64-bit byte offset\n    add.s64  %rd5, %rd3, %rd4;          // Add offset to base pointer\n\n    // Store integer 1 into memory\n    mov.u32  %r15, 1;                   // Value = 1\n    st.global.u32  [%rd5], %r15;        // data[idx] = 1;\n\n    // Increment index by stride\n    add.s32  %r16, %r16, %r6;           // idx += stride;\n\n    // Loop Condition\n    setp.lt.u32  %p2, %r16, %r5;        // if (idx \u003c N)\n    @%p2 bra  $L__BB2_2;                // Repeat if within bounds\n\n$L__BB2_3:\n    ret;                                // Kernel end\n}\n```\n\n\u003c/details\u003e\n\n### Tips\n\nTo reduce the number of registers used by the kernel, consider setting `size_type` to `std::uint32_t` instead of `default_size_type (= std::size_t)` when declaring GPU pointer types. For example, use `managed_array\u003cT, std::uint32_t\u003e` when the number of elements is less than 2\u003csup\u003e32\u003c/sup\u003e. To change `default_size_type` to `std::uint32_t`, define the `GPU_USE_32BIT_SIZE_TYPE_DEFAULT` macro before including `gpu_ptr.hpp`.\n\n### Backends selection\n\nDefine `ENABLE_HIP` macro to use HIP backend. By default, CUDA backend is used. You can define this in your CMakeLists.txt or compiler flags.\n\n## Reference\n\n### `array` / `managed_array`\n\n```cpp\ntemplate \u003ctypename T, typename size_type = size_type_default\u003e\nrequires std::is_trivially_copyable_v\u003cT\u003e\nclass array;\n\ntemplate \u003ctypename T, typename size_type = size_type_default\u003e\nclass managed_array;\n```\n\nThe `array` and `managed_array` classes provide smart pointer-like wrappers for managing arrays on the GPU. They support C++20 ranges and iterator concepts, allowing seamless integration with modern C++ code and exporting to/from range-based containers.  The managed variant uses unified memory for easy access from both host and device. The non-managed variant allocates memory on the device using `cudaMalloc/hipMalloc` and `cudaMemcpy/hipMemcpy` for data transfer, which requires the type `T` to be trivially copyable for safety.\n\n#### Constructors\n\n```cpp\n// (1) default constructor\narray();\nmanaged_array();\n\n// (2) copy/move constructors\n__host__ __device__ array(const array\u0026 other);\n__host__ __device__ array(array\u0026\u0026 other) noexcept;\n__host__ __device__ managed_array(const managed_array\u0026 other);\n__host__ __device__ managed_array(managed_array\u0026\u0026 other) noexcept;\n\n// (3) construct with size\n__host__ explicit array(std::size_t n);\n__host__ array(std::size_t n, const T\u0026 init_value);\n__host__ array(std::size_t n, default_init_tag default_init);\n__host__ explicit managed_array(std::size_t n);\n__host__ managed_array(std::size_t n, const T\u0026 init_value);\n__host__ managed_array(std::size_t n, default_init_tag default_init);\n\n// (4) construct from range (e.g., std::vector, std::array)\ntemplate \u003cstd::ranges::input_range Range\u003e\n__host__ explicit array(const Range\u0026 range);\ntemplate \u003cstd::ranges::input_range Range\u003e\n__host__ explicit managed_array(const Range\u0026 range);\n__host__ array(std::initializer_list\u003cT\u003e list);\n__host__ managed_array(std::initializer_list\u003cT\u003e list);\n\n// (5) construct from raw pointer (device pointer)\n__host__ array(T* device_ptr, std::size_t n);\n__device__ array(T* device_ptr, size_type n);\n__host__ managed_array(T* device_ptr, std::size_t n);\n__device__ managed_array(T* device_ptr, size_type n);\n```\n\nWhere:\n\n1.  Default constructor creates an empty array with null pointer.\n2.  Copy and move constructors for copying pointer and size.\n3.  Constructors with size allocate memory on the GPU. Optionally, an initial value or [default initialization](https://en.cppreference.com/w/cpp/language/default_initialization.html).\n4.  Constructors from ranges copy data from host containers to device memory. Copying from `array` and `managed_array` types is not allowed to avoid unintended device-to-device copies. Use `to\u003c\u003e()` method for explicit device-to-device copy instead.\n5.  Constructors from raw device pointers wrap existing device memory.\n\nFor nested ranges, nested `managed_array` is deduced: `std::vector\u003cstd::vector\u003cT\u003e\u003e -\u003e managed_array\u003cmanaged_array\u003cT\u003e\u003e`.\n\n#### Exporters\n\n```cpp\n// (1) Copy data to host container\ntemplate \u003ctypename Container\u003e\n__host__ Container to() const;\ntemplate \u003ctemplate \u003ctypename...\u003e typename Container\u003e\n__host__ Container\u003cT\u003e to() const;\ntemplate \u003ctemplate \u003ctypename...\u003e typename Container\u003e\n__host__ Container\u003cContainer\u003c...\u003e\u003e to() const; // nested ranges deduction only for managed_array\n\n// (2) Copy data to gpu-ptr array\ntemplate \u003ctypename U\u003e\n__host__ array\u003cU\u003e to\u003carray\u003cU\u003e\u003e() const;\ntemplate \u003ctypename U\u003e\n__host__ managed_array\u003cU\u003e to\u003cmanaged_array\u003cU\u003e\u003e() const;\n\n// (3) Static cast to host container\ntemplate \u003ctypename Container\u003e\n__host__ explicit operator Container() const;\n```\n\nWhere:\n\n1.  `to\u003cConstainer\u003e()` copies data from device to host container (e.g., `std::vector\u003cT\u003e`, `std::list\u003cT\u003e`). Range value type can be deduced automatically (e.g., `to\u003cstd::vector\u003e()`). For nested ranges, nested containers are deduced only for `managed_array`, (e.g., `managed_array\u003cmanaged_array\u003cU\u003e\u003e::to\u003cstd::vector\u003e -\u003e std::vector\u003cstd::vector\u003cU\u003e\u003e`).\n2.  `to\u003carray\u003cU\u003e\u003e()` and `to\u003cmanaged_array\u003cU\u003e\u003e()` copy data from device array to another gpu-ptr array type with element type `U`.\n3.  Explicit conversion operator to host container, equivalent to `to\u003cContainer\u003e()`, but conversion to gpu-array types are not supported.\n\n#### Range interface\n\nMember types:\n\n```cpp\narray::size_type;\narray::value_type;\narray::reference;\narray::const_reference;\narray::iterator;\narray::const_iterator;\narray::pointer;\narray::const_pointer;\n\nmanaged_array::size_type;\nmanaged_array::value_type;\nmanaged_array::reference;\nmanaged_array::const_reference;\nmanaged_array::iterator;\nmanaged_array::const_iterator;\nmanaged_array::pointer;\nmanaged_array::const_pointer;\n```\n\nMember functions:\n\n```cpp\n__host__ __device__ reference operator[](size_type index) noexcept;\n__host__ __device__ const_reference operator[](size_type index) const noexcept;\n__host__ __device__ iterator begin() noexcept;\n__host__ __device__ const_iterator begin() const noexcept;\n__host__ __device__ iterator end() noexcept;\n__host__ __device__ const_iterator end() const noexcept;\n__host__ __device__ const_iterator cbegin() const noexcept;\n__host__ __device__ const_iterator cend() const noexcept;\n__host__ __device__ std::reverse_iterator\u003citerator\u003e rbegin() noexcept;\n__host__ __device__ std::reverse_iterator\u003cconst_iterator\u003e rbegin() const noexcept;\n__host__ __device__ std::reverse_iterator\u003citerator\u003e rend() noexcept;\n__host__ __device__ std::reverse_iterator\u003cconst_iterator\u003e rend() const noexcept;\n__host__ __device__ reference front() noexcept;\n__host__ __device__ const_reference front() const noexcept;\n__host__ __device__ reference back() noexcept;\n__host__ __device__ const_reference back() const noexcept;\n__host__ __device__ pointer data() noexcept;\n__host__ __device__ const_pointer data() const noexcept;\n__host__ __device__ size_type size() const noexcept;\n__host__ __device__ bool empty() const noexcept;\n```\n\nConcepts:\n\n```cpp\nstd::ranges::range\u003carray\u003cT\u003e\u003e;\nstd::ranges::borrowed_range\u003carray\u003cT\u003e\u003e;  // only for device code\nstd::ranges::view\u003carray\u003cT\u003e\u003e;\nstd::ranges::output_range\u003carray\u003cT\u003e, T\u003e;\nstd::ranges::input_range\u003carray\u003cT\u003e\u003e;\nstd::ranges::forward_range\u003carray\u003cT\u003e\u003e;\nstd::ranges::bidirectional_range\u003carray\u003cT\u003e\u003e;\nstd::ranges::random_access_range\u003carray\u003cT\u003e\u003e;\nstd::ranges::sized_range\u003carray\u003cT\u003e\u003e;\nstd::ranges::contiguous_range\u003carray\u003cT\u003e\u003e;\nstd::ranges::common_range\u003carray\u003cT\u003e\u003e;\nstd::ranges::viewable_range\u003carray\u003cT\u003e\u003e;\n\nstd::ranges::range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::borrowed_range\u003cmanaged_array\u003cT\u003e\u003e;  // only for device code\nstd::ranges::view\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::output_range\u003cmanaged_array\u003cT\u003e, T\u003e;\nstd::ranges::input_range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::forward_range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::bidirectional_range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::random_access_range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::sized_range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::contiguous_range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::common_range\u003cmanaged_array\u003cT\u003e\u003e;\nstd::ranges::viewable_range\u003cmanaged_array\u003cT\u003e\u003e;\n```\n\n#### Smart pointer interface\n\n```cpp\n// (1) Reset pointer and size\n__host__ __device__ void reset();\n__host__ void reset(T* device_ptr, std::size_t n);\n__device__ void reset(T* device_ptr, size_type n);\n\n// (2) Boolean conversion\n__host__ __device__ explicit operator bool() const noexcept;\n\n// (3) Use count\n__host__ std::uint32_t use_count() const noexcept;\n```\n\nWhere:\n\n1.  If host code calls `reset(...)`, the current device memory is freed and set new device pointer and size. If device code calls `reset(...)`, it only sets the internal pointer and size without freeing memory.\n2.  Bool conversion operator to check if the internal pointer is not null.\n3.  Returns the current use count of the internal pointer. Note that this is only valid in host code.\n\nNote: The device-side `reset` function does not affect to the memory management on the host side. It only changes the internal pointer and size on the device side.\n\n#### Memory management\n\nNote: Memory management functions are only available for `managed_array` since they use unified memory.\n\n```cpp\n// (1) Prefetch\n__host__ void prefetch(size_type start_idx, size_type len, int device_id = current_device_id, api::gpuStream_t stream = 0, bool recursive = true) const;\n__host__ void prefetch(int device_id = current_device_id, api::gpuStream_t stream = 0, bool recursive = true) const;\n\n// (2) Prefetch to host memory\n__host__ void prefetch_to_cpu(size_type start_idx, size_type len, api::gpuStream_t stream = 0, bool recursive = true) const;\n__host__ void prefetch_to_cpu(api::gpuStream_t stream = 0, bool recursive = true) const;\n\n// (3) Memory advice\n__host__ void mem_advise(size_type n, size_type len, api::gpuMemoryAdvise advise, int device_id = current_device_id, bool recursive = true) const;\n__host__ void mem_advise(api::gpuMemoryAdvise advise, int device_id = current_device_id, bool recursive = true) const;\n\n// (4) Memory advice to host memory\n__host__ void mem_advise(size_type n, size_type len, api::gpuMemoryAdvise advise, bool recursive = true) const;\n__host__ void mem_advise(api::gpuMemoryAdvise advise, bool recursive = true) const;\n```\n\nWhere:\n\n1.  Wrapper for `cudaMemPrefetchAsync/hipMemPrefetchAsync` to prefetch unified memory to specified device. The former overload prefetches a memory range, while the latter overload prefetches the entire memory. If `recursive` is true and the value type of the array has `prefetch(...)` function, prefetch is called recursively for nested or member arrays.\n2.  Host memory prefetching overloads with similar behavior to (1).\n3.  Wrapper for `cudaMemAdvise/hipMemAdvise` to set memory advice for unified memory. The former overload sets advice for a memory range, while the latter overload sets advice for the entire memory. If `recursive` is true and the value type of the array has `mem_advise(...)` function, mem_advise is called recursively for nested or member arrays.\n4.  Host memory advice overloads with similar behavior to (3).\n\n### `value` / `managed_value`\n\n```cpp\ntemplate \u003ctypename T\u003e\nrequires std::is_trivially_copyable_v\u003cT\u003e\nclass value;\n\ntemplate \u003ctypename T\u003e\nclass managed_value;\n```\n\nThe `value` and `managed_value` classes provide smart pointer-like wrappers for managing single values on the GPU. They allow seamless integration and exporting to/from host values. The managed variant uses unified memory for easy access from both host and device. The non-managed variant allocates memory on the device using `cudaMalloc/hipMalloc` and `cudaMemcpy/hipMemcpy` for data transfer, which requires the type `T` to be trivially copyable for safety.\n\n#### Constructors\n\n```cpp\n// (1) default constructor\nvalue();\nmanaged_value();\n\n// (2) copy/move constructors\n__host__ __device__ value(const value\u0026 other);\n__host__ __device__ value(value\u0026\u0026 other) noexcept;\n__host__ __device__ managed_value(const managed_value\u0026 other);\n__host__ __device__ managed_value(managed_value\u0026\u0026 other) noexcept;\n\n// (3) construct with initial value\n__host__ explicit value(const T\u0026 init_value);\n__host__ explicit managed_value(const T\u0026 init_value);\n__host__ explicit value(default_init_tag default_init);\n__host__ explicit managed_value(default_init_tag default_init);\n\n// (4) Construct the element in-place by arguments\ntemplate \u003ctypename... Args\u003e\n__host__ explicit value(Args\u0026\u0026... args);\ntemplate \u003ctypename... Args\u003e\n__host__ explicit managed_value(Args\u0026\u0026... args);\n\n// (5) construct from raw pointer (device pointer)\n__host__ __device__ array(T* device_ptr);\n__host__ __device__ managed_array(T* device_ptr);\n```\n\nWhere:\n\n1.  Default constructor creates an empty value with null pointer.\n2.  Copy and move constructors for copying pointer.\n3.  Constructors with initial value or [default initialization](https://en.cppreference.com/w/cpp/language/default_initialization.html).\n4.  Constructors that forward arguments to construct the element in-place. The arguments are perfectly forwarded to the constructor of `T`.\n5.  Constructors from raw device pointers wrap existing device memory.\n\nNote: The device-side `reset` function does not affect to the memory management on the host side. It only changes the internal pointer and size on the device side.\n\n#### Smart pointer interface\n\nMember types:\n\n```cpp\nvalue::element_type;\nmanaged_value::element_type;\n```\n\nMember functions:\n\n```cpp\n// (1) Operators for `value`\n__device__ T\u0026 operator*() const noexcept;\n__host__ T operator*() const;\n__device__ T* operator-\u003e() const noexcept;\n__host__ proxy_object operator-\u003e() const;\n\n// (2) Operators for `managed_value`\n__host__ __device__ T\u0026 operator*() const noexcept;\n__host__ __device__ T* operator-\u003e() const noexcept;\n\n// (3) Get raw pointer\n__host__ __device__ T* get() const noexcept;\n\n// (4) Reset pointer\n__host__ __device__ void reset(T* device_ptr = nullptr);\n\n// (5) Boolean conversion\n__host__ __device__ explicit operator bool() const noexcept;\n\n// (6) Use count\n__host__ std::uint32_t use_count() const noexcept;\n```\n\nWhere:\n\n1.  Dereference and member access operators for `value`. Note that dereference operator is only available in device code, while member access operator returns a proxy object in host code to access copy of the value.\n2.  Dereference and member access operators for `managed_value`, available in both host and device code.\n3.  Get the raw device pointer.\n4.  If host code calls `reset(...)`, the current device memory is freed and set new device pointer. If device code calls `reset(...)`, it only sets the internal pointer without freeing memory.\n5.  Bool conversion operator to check if the internal pointer is not null.\n6.  Returns the current use count of the internal pointer. Note that this is only valid in host code.\n\nNote: The device-side `reset` function does not affect to the memory management on the host side. It only changes the internal pointer on the device side.\n\n#### Memory management\n\nNote: Memory management functions are only available for `managed_value` since they use unified memory.\n\n```cpp\n// (1) Prefetch\n__host__ void prefetch(int device_id = current_device_id, api::gpuStream_t stream = 0, bool recursive = true) const;\n\n// (2) Prefetch to host memory\n__host__ void prefetch_to_cpu(api::gpuStream_t stream = 0, bool recursive = true) const;\n\n// (3) Memory advice\n__host__ void mem_advise(api::gpuMemoryAdvise advise, int device_id = current_device_id, bool recursive = true) const;\n\n// (4) Memory advice to host memory\n__host__ void mem_advise_to_cpu(api::gpuMemoryAdvise advise, bool recursive = true) const;\n```\n\nWhere:\n\n1.  Wrapper for `cudaMemPrefetchAsync/hipMemPrefetchAsync` to prefetch unified memory to specified device. If `recursive` is true and the value type has `prefetch(...)` function, prefetch is called recursively for member arrays.\n2.  Host memory prefetching overload with similar behavior to (1).\n3.  Wrapper for `cudaMemAdvise/hipMemAdvise` to set memory advice for unified memory. If `recursive` is true and the value type has `mem_advise(...)` function, mem_advise is called recursively for member arrays.\n4.  Host memory advice overload with similar behavior to (3).\n\n### `structure_of_arrays` / `managed_structure_of_arrays`\n\n```cpp\ntemplate \u003ctypename... Ts\u003e\nusing structure_of_arrays\u003cTs...\u003e = structure_of_arrays\u003cstd::tuple\u003cTs...\u003e\u003e;\ntemplate \u003ctemplate \u003ctypename...\u003e typename Tuple = std::tuple, typename... Ts, typename SizeType = size_type_default\u003e\nclass structure_of_arrays\u003cTuple\u003cTs...\u003e, SizeType\u003e;\n\n\ntemplate \u003ctypename... Ts\u003e\nusing managed_structure_of_arrays\u003cTs...\u003e = managed_structure_of_arrays\u003cstd::tuple\u003cTs...\u003e\u003e;\ntemplate \u003ctemplate \u003ctypename...\u003e typename Tuple = std::tuple, typename... Ts, typename SizeType = size_type_default\u003e\nclass managed_structure_of_arrays\u003cTuple\u003cTs...\u003e, SizeType\u003e;\n```\n\nThe `structure_of_arrays` and `managed_structure_of_arrays` classes provide smart pointer-like wrappers for managing Structure-of-Arrays (SoA) memory layout on the GPU. They allow for optimized memory access patterns by storing each member of a structure in separate contiguous arrays. The index access interface allows retrieval of the entire structure at a given index. This class is useful for maximizing coalesced memory access on GPUs. These classes support C++20 ranges and iterator concepts.\n\nThe value type of `structure_of_arrays` must be tuple-derived template class that inherits from `std::tuple\u003cTs...\u003e` or is itself. The example definition of such tuple-derived class is as follows:\n\n```cpp\ntemplate \u003ctypename... Ts\u003e\nrequires (sizeof...(Ts) == 3)\nstruct CustomTuple : public std::tuple\u003cTs...\u003e\n{\n    using std::tuple\u003cTs...\u003e::tuple;\n    __host__ __device__ auto\u0026 get_a() { return std::get\u003c0\u003e(*this); }\n    __host__ __device__ auto\u0026 get_b() { return std::get\u003c1\u003e(*this); }\n    __host__ __device__ auto\u0026 get_c() { return std::get\u003c2\u003e(*this); }\n};\n```\n\nThe template parameters `Ts...` correspond to the member types of the tuple-derived class. All parameters must be value types (i.e., not reference types), since the members are stored in separate arrays and returns by tuple of reference types of each element when accessed.\n\n#### Constructors\n\n```cpp\n// (1) default constructor\nstructure_of_arrays();\nmanaged_structure_of_arrays();\n\n// (2) copy/move constructors\n__host__ __device__ structure_of_arrays(const structure_of_arrays\u0026 other);\n__host__ __device__ structure_of_arrays(structure_of_arrays\u0026\u0026 other) noexcept;\n__host__ __device__ managed_structure_of_arrays(const managed_structure_of_arrays\u0026 other);\n__host__ __device__ managed_structure_of_arrays(managed_structure_of_arrays\u0026\u0026 other) noexcept\n\n// (3) construct with size\n__host__ explicit structure_of_arrays(std::size_t n);\n__host__ structure_of_arrays(std::size_t n, const Tuple\u003cTs...\u003e\u0026 init_value);\n__host__ structure_of_arrays(std::size_t n, default_init_tag default_init);\n__host__ explicit managed_structure_of_arrays(std::size_t n);\n__host__ managed_structure_of_arrays(std::size_t n, const Tuple\u003cTs...\u003e\u0026 init_value);\n__host__ managed_structure_of_arrays(std::size_t n, default_init_tag default_init);\n\n// (4) construct from range of tuple-derived class\ntemplate \u003cstd::ranges::input_range Range\u003e\n__host__ explicit structure_of_arrays(const Range\u0026 range);\ntemplate \u003cstd::ranges::input_range Range\u003e\n__host__ explicit managed_structure_of_arrays(const Range\u0026 range);\n__host__ structure_of_arrays(std::initializer_list\u003cTuple\u003cTs...\u003e\u003e list);\n__host__ managed_structure_of_arrays(std::initializer_list\u003cTuple\u003cTs...\u003e\u003e list);\n\n// (5) construct from multiple ranges\ntemplate \u003cstd::ranges::input_range... Ranges\u003e\n__host__ explicit structure_of_arrays(const Ranges\u0026 ranges...);\n__host__ explicit structure_of_arrays(std::initializer_list\u003cTs\u003e... lists);\ntemplate \u003cstd::ranges::input_range... Ranges\u003e\n__host__ explicit managed_structure_of_arrays(const Ranges\u0026 ranges...);\n__host__ explicit managed_structure_of_arrays(std::initializer_list\u003cTs\u003e... lists);\n```\n\nWhere:\n\n1.  Default constructor creates an empty structure_of_arrays with null pointers.\n2.  Copy and move constructors for copying pointers and size.\n3.  Constructors with size allocate memory on the GPU. Optionally, an initial value or [default initialization](https://en.cppreference.com/w/cpp/language/default_initialization.html).\n4.  Constructors from ranges of tuple-derived class copy data from host containers to device memory. Copying from `structure_of_arrays` and `managed_structure_of_arrays` types is not allowed to avoid unintended device-to-device copies.\n5.  Constructors from multiple ranges copy data from each host container to corresponding member arrays on the device.\n\n#### Exporters\n\n```cpp\n// (1) Copy data to host container\ntemplate \u003ctypename Container\u003e\n__host__ Container to() const;\ntemplate \u003ctemplate \u003ctypename...\u003e typename Container\u003e\n__host__ Container\u003cTuple\u003cTs...\u003e\u003e to() const;\n\n// (2) Static cast to host container\ntemplate \u003ctypename Container\u003e\n__host__ explicit operator Container() const;\n```\n\nWhere:\n\n1.  `to\u003cConstainer\u003e()` copies data from device to host container (e.g., `std::vector\u003cTuple\u003cTs...\u003e\u003e`, `std::list\u003cTuple\u003cTs...\u003e\u003e`). Range value type can be deduced automatically (e.g., `to\u003cstd::vector\u003e() -\u003e std::vector\u003cTuple\u003cTs...\u003e\u003e`).\n2.  Explicit conversion operator to host container, equivalent to `to\u003cContainer\u003e()`.\n\n#### Range interface\n\nMember types:\n\n```cpp\nstructure_of_arrays::size_type;\ntemplate \u003cstd::size_t N\u003e\nstructure_of_arrays::element_type;\n\nmanaged_structure_of_arrays::size_type;\ntemplate \u003cstd::size_t N\u003e\nmanaged_structure_of_arrays::element_type;\n```\n\nMember functions:\n\n```cpp\nusing value = Tuple\u003cTs...\u003e;\nusing reference = Tuple\u003cTs\u0026...\u003e;\nusing const_reference = Tuple\u003cconst Ts\u0026...\u003e;\nusing iterator = ...;\nusing const_iterator = ...;\n\n__host__ __device__ reference operator[](size_type index) \u0026;\n__host__ __device__ const_reference operator[](size_type index) const\u0026;\n__host__ __device__ value operator[](size_type index) \u0026\u0026;\n__host__ __device__ iterator begin() noexcept;\n__host__ __device__ const_iterator begin() const noexcept;\n__host__ __device__ iterator end() noexcept;\n__host__ __device__ const_iterator end() const noexcept;\ntemplate \u003cstd::size_t N\u003e\n__host__ __device__ Ts[N]* data() noexcept;\ntemplate \u003cstd::size_t N\u003e\n__host__ __device__ const Ts[N]* data() const noexcept;\n__host__ __device__ size_type size() const noexcept;\n__host__ __device__ bool empty() const noexcept;\n```\n\nConcepts:\n\n```cpp\nusing soa_type = structure_of_arrays\u003cTuple\u003cTs...\u003e\u003e;\nusing managed_soa_type = managed_structure_of_arrays\u003cTuple\u003cTs...\u003e\u003e;\n\nstd::ranges::range\u003csoa_type\u003e;\nstd::ranges::borrowed_range\u003csoa_type\u003e;  // only for device code\nstd::ranges::view\u003csoa_type\u003e;\nstd::ranges::output_range\u003csoa_type, T\u003e; // since C++23\nstd::ranges::input_range\u003csoa_type\u003e;\nstd::ranges::forward_range\u003csoa_type\u003e;\nstd::ranges::bidirectional_range\u003csoa_type\u003e;\nstd::ranges::random_access_range\u003csoa_type\u003e;\nstd::ranges::sized_range\u003csoa_type\u003e;\nstd::ranges::common_range\u003csoa_type\u003e;\nstd::ranges::viewable_range\u003csoa_type\u003e;\n\nstd::ranges::range\u003cmanaged_soa_type\u003e;\nstd::ranges::borrowed_range\u003cmanaged_soa_type\u003e;  // only for device code\nstd::ranges::view\u003cmanaged_soa_type\u003e;\nstd::ranges::output_range\u003cmanaged_soa_type, T\u003e; // since C++23\nstd::ranges::input_range\u003cmanaged_soa_type\u003e;\nstd::ranges::forward_range\u003cmanaged_soa_type\u003e;\nstd::ranges::bidirectional_range\u003cmanaged_soa_type\u003e;\nstd::ranges::random_access_range\u003cmanaged_soa_type\u003e;\nstd::ranges::sized_range\u003cmanaged_soa_type\u003e;\nstd::ranges::common_range\u003cmanaged_soa_type\u003e;\nstd::ranges::viewable_range\u003cmanaged_soa_type\u003e;\n```\n\nNote: When you define your own tuple-derived class, you may need to specialize `std::common_type` and `std::basic_common_reference` to satisfy some range concepts. For example:\n\n```cpp\ntemplate \u003cclass... TTypes, class... UTypes\u003e\nrequires requires { typename CustomTuple\u003cstd::common_type_t\u003cTTypes, UTypes\u003e...\u003e; }\nstruct std::common_type\u003cCustomTuple\u003cTTypes...\u003e, CustomTuple\u003cUTypes...\u003e\u003e\n{\n    using type = CustomTuple\u003cstd::common_type_t\u003cTTypes, UTypes\u003e...\u003e;\n};\n\ntemplate \u003cclass... TTypes, class... UTypes, template \u003cclass\u003e class TQual, template \u003cclass\u003e class UQual\u003e\nrequires requires { typename CustomTuple\u003cstd::common_reference_t\u003cTQual\u003cTTypes\u003e, UQual\u003cUTypes\u003e\u003e...\u003e; }\nstruct std::basic_common_reference\u003cCustomTuple\u003cTTypes...\u003e, CustomTuple\u003cUTypes...\u003e, TQual, UQual\u003e\n{\n    using type = CustomTuple\u003cstd::common_reference_t\u003cTQual\u003cTTypes\u003e, UQual\u003cUTypes\u003e\u003e...\u003e;\n};\n```\n\n#### Smart pointer interface\n\n```cpp\n// (1) Reset pointer and size\n__host__ void reset();\ntemplate \u003cstd::size_t N\u003e\n__device__ void reset(pointer\u003cN\u003e device_ptr);\ntemplate \u003cstd::size_t N\u003e\n__device__ void reset(const array\u003cTs[N]\u003e\u0026 device_array);\ntemplate \u003cstd::size_t N\u003e\n__device__ void reset(const managed_array\u003cTs[N]\u003e\u0026 device_array);\n\n// (2) Boolean conversion\n__host__ __device__ explicit operator bool() const noexcept;\n\n// (3) Use count\n__host__ std::uint32_t use_count() const noexcept;\n```\n\nWhere:\n\n1.  If host code calls `reset()`, the current device memory is freed and set new device pointers and size. If device code calls `reset\u003cN\u003e(...)`, it only sets the internal pointers of `N`-th array without freeing memory. The overloads with `array` and `managed_array` set the internal pointers and checking size consistency with `assert()` from the given device arrays.\n2.  Bool conversion operator to check if the internal pointer is not null.\n3.  Returns the current use count of the internal pointer. Note that this is only valid in host code.\n\nNote: The device-side `reset` function does not affect to the memory management on the host side. It only changes the internal pointer on the device side.\n\n#### Memory management\n\nNote: Memory management functions are only available for `managed_structure_of_arrays` since they use unified memory.\n\n```cpp\n// (1) Prefetch\n__host__ void prefetch(size_type start_idx, size_type len, int device_id = current_device_id, api::gpuStream_t stream = 0, bool recursive = true) const;\n__host__ void prefetch(int device_id = current_device_id, api::gpuStream_t stream = 0, bool recursive = true) const;\n\n// (2) Prefetch to host memory\n__host__ void prefetch_to_cpu(size_type start_idx, size_type len, api::gpuStream_t stream = 0, bool recursive = true) const;\n__host__ void prefetch_to_cpu(api::gpuStream_t stream = 0, bool recursive = true) const;\n\n// (3) Memory advice\n__host__ void mem_advise(size_type n, size_type len, api::gpuMemoryAdvise advise, int device_id = current_device_id, bool recursive = true) const;\n__host__ void mem_advise(api::gpuMemoryAdvise advise, int device_id = current_device_id, bool recursive = true) const;\n\n// (4) Memory advice to host memory\n__host__ void mem_advise(size_type n, size_type len, api::gpuMemoryAdvise advise, bool recursive = true) const;\n__host__ void mem_advise(api::gpuMemoryAdvise advise, bool recursive = true) const;\n```\n\nWhere:\n\n1.  Wrapper for `cudaMemPrefetchAsync/hipMemPrefetchAsync` to prefetch unified memory to specified device. The former overload prefetches a memory range, while the latter overload prefetches the entire memory. If `recursive` is true and the value type of the array has `prefetch(...)` function, prefetch is called recursively for nested or member arrays.\n2.  Host memory prefetching overloads with similar behavior to (1).\n3.  Wrapper for `cudaMemAdvise/hipMemAdvise` to set memory advice for unified memory. The former overload sets advice for a memory range, while the latter overload sets advice for the entire memory. If `recursive` is true and the value type of the array has `mem_advise(...)` function, mem_advise is called recursively for nested or member arrays.\n4.  Host memory advice overloads with similar behavior to (3).\n\n### `jagged_array`\n\n```cpp\ntemplate \u003ctypename T, typename SizeType = size_type_default\u003e\nclass jagged_array : public managed_array\u003cT, SizeType\u003e;\ntemplate \u003ctemplate \u003ctypename...\u003e typename Tuple = std::tuple, typename... Ts, typename SizeType = size_type_default\u003e\nclass jagged_array : public managed_structure_of_arrays\u003cTuple\u003cTs...\u003e, SizeType\u003e;\n```\n\nThe `jagged_array` class provides wrapper for managing multi-dimensional arrays with varying row lengths (jagged arrays) on the GPU. It derives from the base array type, which can be either `managed_array\u003cT\u003e` or `structure_of_arrays\u003cTuple\u003cTs...\u003e\u003e`, to utilize their memory management and range interfaces. The jagged array has additional offsets to handle varying row sizes, allowing efficient access to elements using multi-dimensional indices.\n\nNote that the only internal storage types currently supported are `managed_array` and `managed_structure_of_arrays`.\n\n#### Constructors\n\n```cpp\n// (1) default constructor\njagged_array();\n\n// (2) construct from sizes\ntemplate \u003cstd::ranges::input_range SizeRange\u003e\n__host__ explicit jagged_array(const SizeRange\u0026 sizes);\n__host__ explicit jagged_array(std::initializer_list\u003csize_type\u003e sizes);\n\n// (3) construct from sizes and base array (for managed_array)\ntemplate \u003cstd::ranges::input_range SizeRange\u003e\n__host__ jagged_array(const SizeRange\u0026 sizes, const managed_array\u003cT, SizeType\u003e\u0026 base_array);\n__host__ jagged_array(std::initializer_list\u003csize_type\u003e sizes, const managed_array\u003cT, SizeType\u003e\u0026 base_array);\n\n// (4) construct from sizes and base array (for managed_structure_of_arrays)\ntemplate \u003cstd::ranges::input_range SizeRange\u003e\n__host__ jagged_array(const SizeRange\u0026 sizes, const managed_structure_of_arrays\u003cTuple\u003cTs...\u003e, SizeType\u003e\u0026 base_array);\n__host__ jagged_array(std::initializer_list\u003csize_type\u003e sizes, const managed_structure_of_arrays\u003cTuple\u003cTs...\u003e, SizeType\u003e\u0026 base_array);\n\n// (5) construct from sizes and flat host container\ntemplate \u003cstd::ranges::input_range SizeRange, std::ranges::input_range Container\u003e\n__host__ jagged_array(const SizeRange\u0026 sizes, const Container\u0026 range);\n__host__ jagged_array(std::initializer_list\u003csize_type\u003e sizes, const Container\u0026 range);\n\n// (6) construct from nested host container\ntemplate \u003cstd::ranges::input_range NestedContainer\u003e\n__host__ jagged_array(const NestedContainer\u0026 nested_range);\n__host__ jagged_array(std::initializer_list\u003cstd::initializer_list\u003cT\u003e\u003e nested_list); // for managed_array\n__host__ jagged_array(std::initializer_list\u003cstd::initializer_list\u003cTuple\u003cTs...\u003e\u003e\u003e nested_list); // for managed_structure_of_arrays\n```\n\nWhere:\n\n1.  Default constructor creates an empty jagged_array with null pointers.\n2.  Constructors from sizes allocate memory on the GPU for the jagged array based on the provided row sizes. The sizes can be provided as a range or an initializer list.\n3.  Constructors from sizes and base array for `managed_array` type. The base array should contain the concatenated elements of all rows. The data is not copied; it is shared with the provided base array.\n4.  Constructors from sizes and base array for `managed_structure_of_arrays` type. The base array should contain the concatenated elements of all rows in SoA layout. The data is not copied; it is shared with the provided base array.\n5.  Constructors from sizes and flat host container copy data from the provided host container to the jagged array on the device. The host container should contain the concatenated elements of all rows.\n6.  Constructors from nested host container copy data from the provided nested host container to the each row of jagged array on the device.\n\n#### Exporters\n\nInherited from the base array type (`managed_array` or `managed_structure_of_arrays`).\n\n#### Range interface\n\nInherited from the base array type (`managed_array` or `managed_structure_of_arrays`).\n\nAdditional member functions:\n\n```cpp\n// (1) Range interface for each row\n__host__ __device__ std::ranges::subrange row(size_type i) noexcept;\n__host__ __device__ std::ranges::subrange row(size_type i) const noexcept;\n__host__ __device__ auto begin(size_type i) noexcept;\n__host__ __device__ auto begin(size_type i) const noexcept;\n__host__ __device__ auto end(size_type i) noexcept;\n__host__ __device__ auto end(size_type i) const noexcept;\n__host__ __device__ auto data(size_type i) noexcept;        // if base is managed_array\n__host__ __device__ auto data(size_type i) const noexcept;  // if base is managed_array\n__host__ __device__ size_type size(size_type i) const noexcept;\n__host__ __device__ size_type num_rows() const noexcept;\n\n// (2) Indexing operator with multi-dimensional indices\n__host__ __device__ decltype(auto) operator[](std::array\u003csize_type, 2\u003e idx) \u0026;\n__host__ __device__ decltype(auto) operator[](std::array\u003csize_type, 2\u003e idx) const\u0026;\n__host__ __device__ decltype(auto) operator[](std::array\u003csize_type, 2\u003e idx) \u0026\u0026;\n__host__ __device__ decltype(auto) operator[](size_type i, size_type j) \u0026;      // for C++23\n__host__ __device__ decltype(auto) operator[](size_type i, size_type j) const\u0026; // for C++23\n__host__ __device__ decltype(auto) operator[](size_type i, size_type j) \u0026\u0026;     // for C++23\n```\n\n#### Smart pointer interface\n\nInherited from the base array type (`managed_array` or `managed_structure_of_arrays`).\n\n#### Memory management\n\nInherited from the base array type (`managed_array` or `managed_structure_of_arrays`).\n\n### Grid-stride range adapter\n\nThe following range adapter closure objects in the `views` namespace are provided for grid-stride loops in GPU kernels.\n\n```cpp\nviews::block_thread_stride;\nviews::cluster_thread_stride;   // [*]\nviews::grid_thread_stride;\nviews::cluster_block_stride;    // [*]\nviews::grid_block_stride;       // [*]\nviews::grid_cluster_stride;     // [*]\n\n// [*] Currently available only with CUDA backends.\n```\n\nThey produce a view that consists of advancing the N-th element of the original range by a specified stride M. The pairs N and M correspond to the index of the thread/block/cluster within the block/cluster/grid and the number of threads/blocks/clusters, respectively.\n\n#### `views::block_thread_stride`\n\nThe stride access based on the **thread index** and the number of threads in the **block**.\n\nExample usage:\n\n```cpp\nfor (auto\u0026 v : array | views::block_thread_stride)\n{\n    // loop body\n    v = ...;\n}\n```\n\nwhich is equivalent to:\n\n```cpp\nconst auto block = cooperative_groups::this_thread_block();\nfor (auto i = static_cast\u003cdecltype(array.size())\u003e(block.thread_rank()); i \u003c array.size(); i += block.num_threads())\n{\n    // loop body\n    array[i] = ...;\n}\n```\n\n#### `views::cluster_thread_stride`\n\nThe stride access based on the **thread index** and the number of threads in the **cluster**.\n\nExample usage:\n\n```cpp\nfor (auto\u0026 v : array | views::cluster_thread_stride)\n{\n    // loop body\n    v = ...;\n}\n```\n\nwhich is equivalent to:\n\n```cpp\nconst auto cluster = cooperative_groups::this_cluster();\nfor (auto i = static_cast\u003cdecltype(array.size())\u003e(cluster.thread_rank()); i \u003c array.size(); i += cluster.num_threads())\n{\n    // loop body\n    array[i] = ...;\n}\n```\n\n#### `views::grid_thread_stride`\n\nThe stride access based on the **thread index** and the number of threads in the **grid**.\n\nExample usage:\n\n```cpp\nfor (auto\u0026 v : array | views::grid_thread_stride)\n{\n    // loop body\n    v = ...;\n}\n```\n\nwhich is equivalent to:\n\n```cpp\nconst auto grid = cooperative_groups::this_grid();\nfor (auto i = static_cast\u003cdecltype(array.size())\u003e(grid.thread_rank()); i \u003c array.size(); i += grid.num_threads())\n{\n    // loop body\n    array[i] = ...;\n}\n```\n\n#### `views::cluster_block_stride`\n\nThe stride access based on the **block index** and the number of blocks in the **cluster**.\n\nExample usage:\n\n```cpp\nfor (auto\u0026 v : array | views::cluster_block_stride)\n{\n    // loop body\n    v = ...;\n}\n```\n\nwhich is equivalent to:\n\n```cpp\nconst auto cluster = cooperative_groups::this_cluster();\nfor (auto i = static_cast\u003cdecltype(array.size())\u003e(cluster.block_rank()); i \u003c array.size(); i += cluster.num_blocks())\n{\n    // loop body\n    array[i] = ...;\n}\n```\n\n#### `views::grid_block_stride`\n\nThe stride access based on the **block index** and the number of blocks in the **grid**.\n\nExample usage:\n\n```cpp\nfor (auto\u0026 v : array | views::grid_block_stride)\n{\n    // loop body\n    v = ...;\n}\n```\n\nwhich is equivalent to:\n\n```cpp\nconst auto grid = cooperative_groups::this_grid();\nfor (auto i = static_cast\u003cdecltype(array.size())\u003e(grid.block_rank()); i \u003c array.size(); i += grid.num_blocks())\n{\n    // loop body\n    array[i] = ...;\n}\n```\n\n#### `views::grid_cluster_stride`\n\nThe stride access based on the **cluster index** and the number of clusters in the **grid**.\n\nExample usage:\n\n```cpp\nfor (auto\u0026 v : array | views::grid_cluster_stride)\n{\n    // loop body\n    v = ...;\n}\n```\n\nwhich is equivalent to:\n\n```cpp\nconst auto grid = cooperative_groups::this_grid();\nfor (auto i = static_cast\u003cdecltype(array.size())\u003e(grid.cluster_rank()); i \u003c array.size(); i += grid.num_clusters())\n{\n    // loop body\n    array[i] = ...;\n}\n```\n\n### Utilities\n\n#### CUDA/HIP API wrappers\n\nThe `gpu_ptr::api` namespace provides wrappers for commonly used CUDA and HIP API functions and types. The API functions are prefixed with `gpu` to avoid name conflicts instead of `cuda` or `hip`. See the definitions in the [gpu_runtime_api.hpp](include/gpu_runtime_api.hpp) file for details.\n\n#### Macros\n\n**Backend selection:**\n\nDefine `ENABLE_HIP` to use HIP backend. Otherwise, CUDA backend is used by default.\n\n**Default size type selection:**\n\nDefine `GPU_USE_32BIT_SIZE_TYPE_DEFAULT` to use `std::uint32_t` as the default size type for array-like classes. Otherwise, `std::size_t` is used by default.\n\n**API error checking:**\n\n`GPU_CHECK_ERROR()` function macro to check CUDA/HIP API errors. If an error occurs, it throws a `std::runtime_error` with the error message. Example usage:\n\n```cpp\nGPU_CHECK_ERROR(gpu_ptr::api::gpuGetDevice(\u0026device_id));\n```\n\n**Device and host compilation macros:**\n\ngpu-ptr library defines `GPU_DEVICE_COMPILE`, `GPU_OVERLOAD_DEVICE`, and `GPU_OVERLOAD_HOST` macros depending on host or device code compilation. The `GPU_DEVICE_COMPILE` macro is defined when compiling device code. The `GPU_OVERLOAD_DEVICE` and `GPU_OVERLOAD_HOST` macros handle the differences in behavior between CUDA and HIP for [overloading based on host and device code](https://llvm.org/docs/CompileCudaWithLLVM.html#overloading-based-on-host-and-device-attributes). The nvcc does not allow overloading based on `__host__` and `__device__` attributes with the same function signature, while hipcc allows it.\n\nExample usage:\n\n```cpp\n__host__ __device__ void func()\n{\n#ifdef GPU_DEVICE_COMPILE\n    // Device code\n#else\n    // Host code\n#endif\n}\n\n#ifdef GPU_OVERLOAD_HOST\n__host__ void foo()\n{\n    // Host code\n}\n#endif\n#ifdef GPU_OVERLOAD_DEVICE\n__device__ int foo()\n{\n    // Device code\n}\n#endif\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyosh-matsuda%2Fgpu-ptr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyosh-matsuda%2Fgpu-ptr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyosh-matsuda%2Fgpu-ptr/lists"}