{"id":50664340,"url":"https://github.com/yosh-matsuda/gpu-array","last_synced_at":"2026-06-08T05:01:12.942Z","repository":{"id":231977506,"uuid":"778811411","full_name":"yosh-matsuda/gpu-array","owner":"yosh-matsuda","description":"Maximum GPU performance with Modern C++ syntax. RAII and Range-based abstraction to GPU memory management and data layouts, enabling code safety and performance optimization with zero overhead.","archived":false,"fork":false,"pushed_at":"2026-06-07T07:35:12.000Z","size":155,"stargazers_count":3,"open_issues_count":1,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-06-07T09:19:44.633Z","etag":null,"topics":["cpp","cpp20","cuda","gpu","header-only","hip"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yosh-matsuda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-03-28T13:03:07.000Z","updated_at":"2026-06-07T07:35:15.000Z","dependencies_parsed_at":"2024-05-03T06:42:55.660Z","dependency_job_id":"d8ed9d69-d25f-4494-a82d-7a3ff28ea13f","html_url":"https://github.com/yosh-matsuda/gpu-array","commit_stats":null,"previous_names":["yosh-matsuda/gpu-ptr","yosh-matsuda/gpu-array"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/yosh-matsuda/gpu-array","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-array","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-array/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-array/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-array/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yosh-matsuda","download_url":"https://codeload.github.com/yosh-matsuda/gpu-array/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yosh-matsuda%2Fgpu-array/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34048682,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cpp20","cuda","gpu","header-only","hip"],"created_at":"2026-06-08T05:00:51.364Z","updated_at":"2026-06-08T05:01:12.911Z","avatar_url":"https://github.com/yosh-matsuda.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gpu-array: Make GPU programming more modern C++ friendly\n\ngpu-array is a header-only C++20 library that brings RAII and Range-based abstractions to GPU memory management and data layouts, enabling code safety and performance optimizations with zero overhead. By abstracting away raw pointers and memory layouts, gpu-array allows developers to focus on algorithm logic rather than resource bookkeeping.\n\nMaximum GPU performance with Modern C++ syntax.\n\n[![Tests](https://github.com/yosh-matsuda/gpu-array/actions/workflows/tests.yml/badge.svg)](https://github.com/yosh-matsuda/gpu-array/actions/workflows/tests.yml)\n\n## Features\n\n*   Smart pointer-like wrappers:\n    *   Full RAII (Resource Acquisition Is Initialization) support for GPU memory management, ensuring automatic cleanup.\n*   Performance-Oriented Memory Layouts:\n    *   AoS to SoA Conversion: Converting Array-of-Structures (AoS) to Structure-of-Arrays (SoA) to ensure coalesced memory access for maximum GPU throughput. AoS stores data as contiguous structures, while SoA separates each field into its own array for better memory access patterns.\n    *   Jagged Array Wrappers: Manage multi-dimensional data with varying row lengths using a single, efficient 1-D memory allocation and optimized for cache locality and performance.\n*   C++20 Integration:\n    *   Compatible with modern standards, including ranges and iterator concepts even for GPU kernel code.\n    *   Range adapters for grid-stride access patterns (e.g., block-thread, grid-thread, grid-block, etc.).\n    *   GPU-ready range views for indexing (`views::enumerate`) and lock-step traversal (`views::zip`).\n*   Dual backend:\n    *   Support for NVIDIA CUDA and AMD HIP.\n*   Header-only library and no external dependencies.\n\n### Requirements\n\ngpu-array requires a C++20 compiler and either the CUDA or HIP development toolkit.\n\n\u003cdetails\u003e\n\u003csummary\u003eSupported and tested toolkit/compiler combinations\u003c/summary\u003e\n\nThe following toolkit/compiler combinations are supported and tested:\n\n| Backend | Toolkit | Tested Compiler |\n| --- | --- | --- |\n| CUDA | 12.6.3 | GCC 13, Clang 16-18 |\n| CUDA | 12.8.1 | GCC 13-14, Clang 16-19 |\n| CUDA | 12.9.1 | **Not Supported** |\n| CUDA | 13.0.2 | GCC 13-15, Clang 16-20 |\n| CUDA | 13.1.1 | GCC 13-15, Clang 16-21 |\n| CUDA | 13.2.0 | GCC 13-15, Clang 16-21 |\n| ROCm/HIP NVIDIA | 6.2.4 + CUDA 12.8.1 | Clang 18 |\n| ROCm/HIP NVIDIA | 6.4.4 + CUDA 12.8.1 | Clang 18 |\n| ROCm/HIP NVIDIA | 7.0.3 + CUDA 12.8.1 | Clang 18 |\n| ROCm/HIP NVIDIA | 7.1.1 + CUDA 12.8.1 | Clang 18 |\n| ROCm/HIP NVIDIA | 7.2.4 + CUDA 13.2.0 | Clang 18 |\n| ROCm/HIP AMD | 6.2.4 | AMD Clang 18 |\n| ROCm/HIP AMD | 6.4.4 | AMD Clang 19 |\n| ROCm/HIP AMD | 7.0.3 | AMD Clang 20 |\n| ROCm/HIP AMD | 7.1.1 | AMD Clang 21 |\n| ROCm/HIP AMD | 7.2.4 | AMD Clang 22 |\n\n\u003c/details\u003e\n\nCUDA 12.9.1 is not supported because `nvcc` 12.9 is known to segfault while compiling gpu-array tests.\n\nWith ROCm/HIP, do not put `std::ranges` concept constraints directly on `__global__`\nfunction templates. Current ROCm compilers can reject otherwise valid constrained\nkernel launches during overload resolution. Check such constraints in an ordinary\nhost wrapper and launch an unconstrained `__global__` kernel from that wrapper.\n\nThe practical C++ compiler floor is GCC 13 or Clang 16. CUDA's official host\ncompiler tables allow older compiler majors, but gpu-array relies on C++20\nranges and related library support, so older compilers are outside the supported\nrange.\n\n## Quick Start\n\n### Installation\n\nAs a header-only library, you can simply copy the `include` directory to your project. If you are using CMake, you can add the following lines to your `CMakeLists.txt`:\n\n```cmake\nadd_subdirectory(path/to/gpu-array)\ntarget_link_libraries(your_target PRIVATE gpu_array::gpu_array)\n```\n\nAlternatively, you can use CMake's `FetchContent` module instead of manually downloading the library:\n\n```cmake\ninclude(FetchContent)\nFetchContent_Declare(\n    gpu_array\n    GIT_REPOSITORY https://github.com/yosh-matsuda/gpu-array.git\n    GIT_TAG v0.4.0\n)\nFetchContent_MakeAvailable(gpu_array)\ntarget_link_libraries(your_target PRIVATE gpu_array::gpu_array)\n```\n\n### Example: Device memory management with smart pointers\n\ngpu-array provides several smart pointer-like classes to manage GPU memory, including `array` and `managed_array` for arrays with range concepts, and `value` and `managed_value` for single value pointers.  \nThese classes automatically handle memory allocation and deallocation on the GPU. The `managed_` variants use unified memory, allowing seamless access from both host and device.\n\n```cpp\n#include \u003ccooperative_groups.h\u003e\n#include \u003cgpu_array.hpp\u003e\n#include \u003ciostream\u003e\n\nusing namespace gpu_array;\n\n// Example kernel: initialize all elements\ntemplate \u003cstd::ranges::input_range T\u003e\n__global__ void kernel(T array)\n{\n    for (auto\u0026 v : array | views::grid_thread_stride)\n        v += 1;\n}\n\nvoid example()\n{\n    // Allocate managed (or unmanaged) memory for 1024 integers\n    auto array = managed_array\u003cint\u003e(1024);\n\n    // Launch kernel to set values\n    kernel\u003c\u003c\u003c1, 128\u003e\u003e\u003e(array);\n\n    // Wrapper for cudaDeviceSynchronize/hipDeviceSynchronize\n    api::gpuDeviceSynchronize();\n\n    // Print results\n    for (const auto\u0026 v: array) std::cout \u003c\u003c v \u003c\u003c \" \";\n}\n```\n\ngpu-array also provides safe initialization for memory allocation. For memory accessible only from the GPU (`array` and `value`), it checks at compile time that the type is safe for `memcpy` and initializes the allocated memory using the specified method. For Unified memory accessible from both the GPU and CPU (`managed_array` and `managed_value`), it constructs each element by calling its constructor.\n\n### Example: Conversion from host to device memory and vice versa\n\nArrays and values classes can be easily converted from and to C++ containers (e.g., `std::vector`, `std::array`). The data is copied from host to device during construction.\n\n```cpp\n#include \u003cgpu_array.hpp\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_array;\n\nvoid example()\n{\n    // Create vector on host\n    auto vec = std::vector\u003cint\u003e(100);\n    for (auto i = 0; auto\u0026 v: vec) v = i++;\n\n    // Convert from host vector to device array\n    auto array = managed_array(vec);\n\n    // Call kernel to perform operations on GPU\n    // ...\n\n    // Convert from device array to host vector\n    vec = array.to\u003cstd::vector\u003e();\n}\n```\n\n### Example: Grid-stride range adapters\n\nThe kernel code can utilize C++20 range views for grid-stride access patterns (so-called grid-stride loop). gpu-array provides several [Range Adapter Closure Object](https://en.cppreference.com/w/cpp/named_req/RangeAdaptorClosureObject.html) such as `views::block_thread_stride`, `views::grid_thread_stride`, and `views::grid_block_stride` to facilitate this without any overhead. The following example demonstrates how to achieve memory coalescing when initializing nested arrays using grid-stride access.\n\n```cpp\n#include \u003cgpu_array.hpp\u003e\n#include \u003ciostream\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_array;\n\n// Example kernel: initialize nested array\ntemplate \u003cstd::ranges::input_range T\u003e\nrequires std::ranges::input_range\u003cstd::ranges::range_value_t\u003cT\u003e\u003e\n__global__ void kernel_example(T array)\n{\n    for (auto\u0026 a : array | views::grid_block_stride)\n    {\n        for (auto\u0026 v : a | views::block_thread_stride)\n        {\n            v *= 2;\n        }\n    }\n\n    /* The above code is equivalent to the following:\n    using namespace cooperative_groups;\n    const auto grid = this_grid();\n    const auto block = this_thread_block();\n    for (auto i = grid.block_rank(); i \u003c array.size(); i += grid.num_blocks())\n    {\n        for (auto j = block.thread_rank(); j \u003c array[i].size(); j += block.size())\n        {\n            array[i][j] *= 2;\n        }\n    }\n    */\n}\n\nvoid example()\n{\n    // Create nested vector on host\n    auto vec_vec = std::vector(256, std::vector\u003cint\u003e(1024, 1));\n\n    // Convert from nested host vector to nested device array\n    auto nested_array = managed_array(vec_vec);\n\n    // Launch kernel to initialize nested array\n    kernel_example\u003c\u003c\u003c32, 256\u003e\u003e\u003e(nested_array);\n    api::gpuDeviceSynchronize();\n\n    // Print results\n    for (const auto\u0026 inner_array : nested_array)\n    {\n        for (const auto\u0026 v : inner_array) std::cout \u003c\u003c v \u003c\u003c \" \";\n        std::cout \u003c\u003c std::endl;\n    }\n}\n```\n\nThe view adapters can also be composed with `views::enumerate` and `views::zip`. Use `views::enumerate` when the original index is part of the computation, and `views::zip` when multiple ranges should be traversed in lock step. Apply stride adapters after these sized views to distribute the resulting elements across GPU threads.\n\n```cpp\n// Use indices inside a grid-stride loop\n__global__ void kernel_with_index(managed_array\u003cint\u003e array)\n{\n    for (auto\u0026\u0026 [i, v] : array | views::enumerate | views::grid_thread_stride)\n    {\n        v += static_cast\u003cint\u003e(i);\n    }\n}\n\n// Traverse two arrays together; iteration stops at the shorter size\n__global__ void add_kernel(managed_array\u003cint\u003e lhs, managed_array\u003cint\u003e rhs)\n{\n    for (auto\u0026\u0026 [x, y] : views::zip(lhs, rhs) | views::grid_thread_stride)\n    {\n        x += y;\n    }\n}\n```\n\n### Example: AoS and SoA\n\ngpu-array supports both Array of structures (AoS) and Structure of arrays (SoA) for memory layout optimization via `array` and `structure_of_arrays` classes, respectively. The memory layout comparison between `array` (AoS) and `structure_of_arrays` (SoA) is as follows:\n\n![array of structure vs. structure_of_arrays](https://github.com/user-attachments/assets/219085eb-80c7-44e5-9e3b-6607bd8174bf)\n\n In either case, gpu-array provides a structure retrieval interface via array indices. Thus, `structure_of_arrays\u003ctuple-derived\u003e` can be used as a drop-in replacement for `array\u003ctuple-derived\u003e` with optimizing memory layout without altering your algorithm's implementation.\n\n```cpp\n#include \u003cgpu_array.hpp\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_array;\n\n// gpu_array::tuple is a lightweight std::tuple-like type for GPU device code.\n// gpu_array::tuple (or std::tuple) or its derived struct can be used as structure type\n// The below example shows a tuple-derived struct with three members and their accessors\ntemplate \u003ctypename... Ts\u003e\nrequires (sizeof...(Ts) == 3)\nstruct CustomTuple : public tuple\u003cTs...\u003e\n{\n    using tuple\u003cTs...\u003e::tuple;\n    using tuple\u003cTs...\u003e::operator=;\n    __host__ __device__ auto\u0026 get_a() { return get\u003c0\u003e(*this); }\n    __host__ __device__ auto\u0026 get_b() { return get\u003c1\u003e(*this); }\n    __host__ __device__ auto\u0026 get_c() { return get\u003c2\u003e(*this); }\n};\nusing Struct = CustomTuple\u003cint, float, double\u003e;\n\n// Example kernel: process both AoS and SoA\ntemplate \u003ctypename T\u003e\n__global__ void kernel_example(T array)\n{\n    for (auto\u0026\u0026 v : array | views::grid_thread_stride)\n    {\n        // Access structure members for both AoS and SoA\n        v.get_a() *= 2;\n        v.get_b() *= 2.0f;\n        v.get_c() *= 2.0;\n    }\n}\n\nvoid example()\n{\n    // Create vector of structures\n    auto vec = std::vector\u003cStruct\u003e(100, {1, 2.0f, 3.0});\n\n    // Array of structures (AoS): single array for entire structure\n    auto aos = managed_array\u003cStruct\u003e(vec);\n    kernel_example\u003c\u003c\u003c1, 32\u003e\u003e\u003e(aos);\n    api::gpuDeviceSynchronize();\n\n    // Structure of arrays (SoA): multiple arrays for each member internally\n    auto soa = managed_structure_of_arrays\u003cStruct\u003e(vec);\n    kernel_example\u003c\u003c\u003c1, 32\u003e\u003e\u003e(soa);\n    api::gpuDeviceSynchronize();\n}\n```\n\n\u003e [!TIP]\n\u003e Which is better, AoS or SoA? It depends on the access pattern to the structure members within a warp, the size of the structure, and the number of available registers. If all threads in a warp access the same member of the structure, SoA is generally better for maximizing coalesced memory access. However, if each thread accesses entire structures or different members, AoS may be more efficient due to better cache locality. Benchmarking both layouts with representative workloads is recommended to determine the optimal choice for your specific use case. Whichever, **gpu-array makes it easy to switch between AoS and SoA without changing the access interface**.\n\n### Example: Jagged array\n\ngpu-array provides `jagged_array` class to manage multi-dimensional array with varying row lengths, using a **single memory allocation to maximize coalescing access**.\nThis behaves like a wrapper for `managed_array` or `managed_structure_of_arrays` with multi-dimensional indexing. The `jagged_array` is constructed from a 1-D array with sizes or multi-dimensional container (e.g., `std::vector\u003cstd::vector\u003cT\u003e\u003e`).\n\nThe logical and physical data layout of `jagged_array` is as follows:\n\n![data layout of jagged array](https://github.com/user-attachments/assets/7773537d-7259-4d2c-a695-8572906a6057)\n\n```cpp\n#include \u003cgpu_array.hpp\u003e\n#include \u003ciostream\u003e\n#include \u003cvector\u003e\n\nusing namespace gpu_array;\n\n// Example kernel: modify all elements\ntemplate \u003cstd::ranges::input_range T\u003e\n__global__ void kernel(T array)\n{\n    for (auto\u0026 v : array | views::grid_thread_stride)\n        v *= 2;\n}\n\nauto vec = std::vector\u003cint\u003e{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};\nauto vec_vec = std::vector\u003cstd::vector\u003cint\u003e\u003e{{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}};\n\nvoid example()\n{\n    // Create jagged array from nested std::vector\n    auto jarray = jagged_array\u003cmanaged_array\u003cint\u003e\u003e(vec_vec);\n    // Equivalent to the above:\n    // auto jarray = jagged_array\u003cmanaged_array\u003cint\u003e\u003e({1, 2, 3, 4}, vec);\n\n    // Launch kernel to re-set values\n    kernel\u003c\u003c\u003c1, 32\u003e\u003e\u003e(jarray);\n    api::gpuDeviceSynchronize();\n\n    // Access each row and each element\n    for (std::size_t i = 0; i \u003c jarray.num_rows(); ++i)\n    {\n        for (std::size_t j = 0; j \u003c jarray.size(i); ++j)\n        {\n            std::cout \u003c\u003c jarray[{i, j}] \u003c\u003c \" \";\n        }\n        std::cout \u003c\u003c std::endl;\n    }\n}\n```\n\n### Zero-overhead\n\ngpu-array is designed to have zero-overhead compared to traditional raw pointer usage. The following equivalent kernels are covered by a CUDA-only, release-like PTX/ptxas regression test. For these representative patterns, the normalized PTX instructions and ptxas resource usage match the raw-pointer baseline, so the tested abstractions add no extra loop-body instructions, register pressure, stack usage, or spills.\n\n```cpp\n// Traditional raw pointer kernel\n__global__ void func0(int* data, std::uint32_t size)\n{\n    const auto block = cooperative_groups::this_thread_block();\n    for (std::uint32_t i = block.thread_rank(); i \u003c size; i += block.size())\n    {\n        data[i] = 1;\n    }\n}\n\n// Using managed_array\n__global__ void func1(managed_array\u003cint, std::uint32_t\u003e arr)\n{\n    const auto block = cooperative_groups::this_thread_block();\n    for (std::uint32_t i = block.thread_rank(); i \u003c arr.size(); i += block.size())\n    {\n        arr[i] = 1;\n    }\n}\n\n// Using managed_array with range adapters\n__global__ void func3(managed_array\u003cint, std::uint32_t\u003e arr)\n{\n    for (auto\u0026 v : arr | views::block_thread_stride)\n    {\n        v = 1;\n    }\n}\n```\n\n\u003cdetails\u003e\n\n\u003csummary\u003eA representative PTX assembly body generated for the above kernels is as follows:\u003c/summary\u003e\n\n```text\n// Function Definition: _Z5func3... (Mangled C++ name for a template function)\n// It accepts a 'managed_array' struct by value, which is 24 bytes in size.\n.visible .entry _Z5func3IN7gpu_array13managed_arrayIijEEEvT_(\n    .param .align 8 .b8 _Z5func3IN7gpu_array13managed_arrayIijEEEvT__param_0[24]\n)\n{\n    // Register Declarations\n    .reg .pred  %p\u003c3\u003e;   // Predicate registers for logic\n    .reg .b32  %r\u003c17\u003e;  // 32-bit integer registers\n    .reg .b64  %rd\u003c6\u003e;  // 64-bit registers for pointers/addresses\n\n    // --- Extracting values from the Struct Parameter ---\n    // The struct is laid out in memory:\n    // Offset 0: Array Size (32-bit)\n    // Offset 8: Base Pointer (64-bit)\n    ld.param.u64  %rd1, [_Z5func3IN7gpu_array13managed_arrayIijEEEvT__param_0+8]; // Load data pointer\n    ld.param.u32  %r5, [_Z5func3IN7gpu_array13managed_arrayIijEEEvT__param_0];    // Load array size N\n\n    // --- Calculate 1D Thread Index (%r16) ---\n    // Formula: idx = (tid.z * ntid.y + tid.y) * ntid.x + tid.x\n    mov.u32  %r2, %ntid.y;              // %r2 = blockDim.y\n    mov.u32  %r9, %tid.z;               // %r9 = threadIdx.z\n    mov.u32  %r10, %tid.y;              // %r10 = threadIdx.y\n    mad.lo.s32  %r11, %r2, %r9, %r10;   // %r11 = (blockDim.y * threadIdx.z) + threadIdx.y\n\n    mov.u32  %r3, %ntid.x;              // %r3 = blockDim.x\n    mov.u32  %r12, %tid.x;              // %r12 = threadIdx.x\n    mad.lo.s32  %r16, %r11, %r3, %r12;  // %r16 = (%r11 * blockDim.x) + threadIdx.x\n\n    // --- Initial Boundary Check ---\n    setp.ge.u32  %p1, %r16, %r5;        // if (idx \u003e= N)\n    @%p1 bra  $L__BB2_3;                // Exit if initial index is out of bounds\n\n    // --- Calculate Stride (Total Threads in Block) ---\n    // stride = blockDim.x * blockDim.y * blockDim.z\n    mul.lo.s32  %r13, %r3, %r2;         // %r13 = blockDim.x * blockDim.y\n    mov.u32  %r14, %ntid.z;             // %r14 = blockDim.z\n    mul.lo.s32  %r6, %r13, %r14;        // %r6 = total threads (stride)\n\n    // Convert generic pointer to global memory pointer\n    cvta.to.global.u64  %rd3, %rd1;     // %rd3 = global(data)\n\n// --- Main Loop: Grid-Stride Writing ---\n$L__BB2_2:\n    // Memory address calculation: addr = base + (idx * 4)\n    mul.wide.u32  %rd4, %r16, 4;        // Calculate 64-bit byte offset\n    add.s64  %rd5, %rd3, %rd4;          // Add offset to base pointer\n\n    // Store integer 1 into memory\n    mov.u32  %r15, 1;                   // Value = 1\n    st.global.u32  [%rd5], %r15;        // data[idx] = 1;\n\n    // Increment index by stride\n    add.s32  %r16, %r16, %r6;           // idx += stride;\n\n    // Loop Condition\n    setp.lt.u32  %p2, %r16, %r5;        // if (idx \u003c N)\n    @%p2 bra  $L__BB2_2;                // Repeat if within bounds\n\n$L__BB2_3:\n    ret;                                // Kernel end\n}\n```\n\n\u003c/details\u003e\n\nThe automated `zero_overhead_ptx` CTest also checks `views::grid_thread_stride`, `views::enumerate`, `views::zip`, `views::zip | views::enumerate`, and `value` / `managed_value` dereference. It compares strict normalized opcode sequences where possible, and uses non-prologue opcode profiles when raw pointers and wrapper objects have different parameter-unpacking prologues.\n\n### Tips\n\nTo reduce the number of registers used by the kernel, consider setting `size_type` to `std::uint32_t` instead of `default_size_type (= std::size_t)` when declaring GPU pointer types. For example, use `managed_array\u003cT, std::uint32_t\u003e` when the number of elements is less than 2\u003csup\u003e32\u003c/sup\u003e. To change `default_size_type` to `std::uint32_t`, define the `GPU_USE_32BIT_SIZE_TYPE_DEFAULT` macro before including `gpu_array.hpp`.\n\n### Backends selection\n\nDefine `ENABLE_HIP` macro to use HIP backend. By default, CUDA backend is used. You can define this in your CMakeLists.txt or compiler flags.\n\n## Reference\n\nFull API reference is available in [docs/reference.md](docs/reference.md).\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyosh-matsuda%2Fgpu-array","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyosh-matsuda%2Fgpu-array","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyosh-matsuda%2Fgpu-array/lists"}