{"id":15634618,"url":"https://github.com/patwie/cuda-design-patterns","last_synced_at":"2025-04-14T00:37:47.241Z","repository":{"id":81506076,"uuid":"157919970","full_name":"PatWie/cuda-design-patterns","owner":"PatWie","description":"Some CUDA design patterns and a bit of template magic for CUDA","archived":false,"fork":false,"pushed_at":"2023-06-03T16:58:52.000Z","size":98,"stargazers_count":150,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-27T14:52:22.989Z","etag":null,"topics":["bazel","cpp11","cuda","cuda-development","cuda-device","cuda-kernels","cuda-utils","gpu","template-metaprogramming"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PatWie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-16T20:51:37.000Z","updated_at":"2025-03-23T03:47:57.000Z","dependencies_parsed_at":"2024-10-23T03:17:42.190Z","dependency_job_id":null,"html_url":"https://github.com/PatWie/cuda-design-patterns","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PatWie%2Fcuda-design-patterns","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PatWie%2Fcuda-design-patterns/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PatWie%2Fcuda-design-patterns/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PatWie%2Fcuda-design-patterns/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PatWie","download_url":"https://codeload.github.com/PatWie/cuda-design-patterns/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248803806,"owners_count":21164122,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bazel","cpp11","cuda","cuda-development","cuda-device","cuda-kernels","cuda-utils","gpu","template-metaprogramming"],"created_at":"2024-10-03T10:54:25.382Z","updated_at":"2025-04-14T00:37:47.221Z","avatar_url":"https://github.com/PatWie.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CUDA Design Patterns\n\nSome best practises I collected over the last years when writing CUDA kernels. These functions\ndo not dictate how to use CUDA, these just simplify your workflow. I am not a big fan of libraries which rename things via wrappers. All code below does add additional benefits in CUDA programming.\n\n## CUDA Boilerplate Code\n\n[EXAMPLE](./src/multiply/multiply_gpu.cu.cc)\n\n**Description:**\nAvoid plain a CUDA kernel functions and instead pack them into a struct.\n\n\n```cpp\ntemplate \u003ctypename ValueT\u003e\nstruct MyKernel : public cuda::Kernel {\n  void Launch(cudaStream_t stream = 0) {\n    cuda::Run\u003c\u003c\u003c1, 1, 0, stream\u003e\u003e\u003e(*this);\n  }\n  __device__ __forceinline__ void operator()() const override {\n    printf(\"hi from device code with value %f\\n\", val);\n  }\n\n  ValueT val;\n};\n\nMyKernel\u003cfloat, 32\u003e kernel;\nkernel.val = 42.f;\nkernel.Launch();\n```\n\n**Reasons:**\n\n- This allows much better organization of used parameters. We recommend\nto write them at the end of the struct, such that when writing the CUDA kernel itself\nthey are always visible.\n- These structs can contain or compute the launch configuration (grid, block, shm size) depending on the parameters.\n- Multiple kernel launches require less code, as we do not need to type out all parameters over and over again for a second or third launch.\n\n\n## Functors\n\n[EXAMPLE](./src/multiply.cc)\n\n**Description:**\nUse templated `structs` to switch seemlessly between CPU and GPU code:\n\n```cpp\nMultiply\u003cfloat, CpuDevice\u003e::Apply(A, B, 2, 2, C); // run CPU\nMultiply\u003cfloat, GpuDevice\u003e::Apply(A, B, 2, 2, C); // run GPU\nMultiply\u003cfloat\u003e::Apply(A, B, 2, 2, C); // run GPU if available else on CPU\n```\n\n**Reasons:**\n\n- Switching between different devices is straight-forward.\n- Understanding unit-tests which compare and verify the output becomes more easy.\n\n## Shared Memory\n\n[EXAMPLE](./src/sharedmemory.cu.cc)\n\nUse\n\n```cpp\ncuda::SharedMemory shm;\nfloat* floats_5 = shm.ref\u003cfloat\u003e(5);\nint* ints_3 = shm.ref\u003cint\u003e(3);\n```\n\ninstead of\n\n```cpp\nextern __shared__ char* shm[];\nfloat* val1 = reinterpret_cast\u003cfloat*\u003e(\u0026shm[0]); // 5 floats\nint* val2 = reinterpret_cast\u003cint*\u003e(\u0026shm[5]); // 3 ints\n```\n\n\n**Reasons:**\n\n- The number of values of specific data types to read should be on the same line as the declaration. This way adding additional shared memory becomes easier during development.\n\n## CUDA Kernel Dispatcher\n\n[EXAMPLE](./src/tune.cu.cc)\n\nLike in the *CUDA Boilerplate Code* example we pack our kernels into structs. For different hyper-parameters we use template specialization.\n\nGiven a generic CUDA kernel and a specialization\n\n```cpp\ntemplate \u003ctypename ValueT, int BLOCK_DIM_X\u003e\nstruct MyKernel : public cuda::Kernel {}\n\ntemplate \u003ctypename ValueT\u003e\nstruct MyKernel\u003cValueT, 4\u003e : public cuda::Kernel {}\n```\n\nwe use the kernel dispatcher\n\n```cpp\nMyKernel\u003cfloat, 4\u003e kernelA;\nMyKernel\u003cfloat, 8\u003e kernelB;\n\ncuda::KernelDispatcher\u003cint\u003e dispatcher(true);\ndispatcher.Register\u003cMyKernel\u003cfloat, 4\u003e\u003e(3); // for length up to 3 (inclusive) start MyKernel\u003cfloat, 4\u003e\ndispatcher.Register\u003cMyKernel\u003cfloat, 8\u003e\u003e(6); // for length up to 6 (inclusive) start MyKernel\u003cfloat, 8\u003e\n                                            // as `dispatcher(true)` this kernel will handle all\n                                            // larger values as well\nint i = 4;         // a runtime value\ndispatcher.Run(i); // triggers `kernelB`\n```\n\nThe dispatcher can also handle multi-dim values and a initializer\n\n```cpp\nstruct Initializer {\n  template \u003ctypename T\u003e\n  void operator()(T* el) {\n    el-\u003eval = 42.f;\n  }\n};\nInitializer init;\ncuda::KernelDispatcher\u003cstd::tuple\u003cint, int\u003e\u003e disp(true);\ndisp.Register\u003cExpertKernel2D\u003cfloat, 4, 3\u003e\u003e(std::make_tuple(4, 3), init);\ndisp.Register\u003cExpertKernel2D\u003cfloat, 8, 4\u003e\u003e(std::make_tuple(9, 4), init);\n```\n\n**Reasons:**\n\n- Changing the block-dims will have performance impact. A templated CUDA kernel can execute special implementations for different hyper-parameters.\n- A switch-statement dispatching run-time variables into a templated instantiation requires code-duplication, which can be avoid by the dispatcher.\n\n## CUDA Index Calculation\n\n[EXAMPLE](./src/deprecated_examples.cu_old)\n\nDo not compute indicies by hand when appropriate and use\n\n```cpp\n// or even ...\n// Used 8 registers, 368 bytes cmem[0]\n__global__ void readme_alternative2(float *src, float *dst,\n                                    int B, int H, int W, int C,\n                                    int b, int h, int w, int c) {\n  auto src_T = NdArray(src, B, H, W, C);\n  auto dst_T = NdArray(dst, B, H, W, C);\n  dst_T(b, h, w, c + 1) = src_T(b, h, w, c);\n\n  // Unflatten the index.\n  auto index = NdIndex\u003c4\u003e(B, H, W, C);\n  size_t flattened_index = index(b, h, w, c);\n\n  int b_=0, h_=0, w_=0, c_=0;\n  index.unflatten(flattened_index, b_, h_, w_, c_);\n}\n```\n\ninstead of\n\n```cpp\n// spot the bug\n// Used 6 registers, 368 bytes cmem[0]\n__global__ void readme_normal(float *src, float *dst,\n                              int B, int H, int W, int C,\n                              int b, int h, int w, int c) {\n  const int pos1 = b * (H * W * C) + h * (W * c) + w * (C) + c;\n  const int pos2 = b * (H * W * C) + h * (W * C) + w * (C) + (c + 1);\n  dst[pos2] = src[pos1];\n}\n```\n\n**Reasons**:\n\n- It is time-consuming and not worthwhile to concern yourself with index calculations. When writing CUDA code, you usually have many other vital things to ponder.\n- Each additional character increases the hit rate for a bug!\n- **I'm sick and tired of manually typing the indices.**\n- NdArray can have a positive impact on the number of used registers.\n\n**Cons:**\n\n- The compiler might not be able to optimize the `NdArray` overhead \"away\".\n- NdArray can have a negative impact on the number of used registers.\n\n## CMake Setup\n\n**Description:**\nUse CMake to configure which targets should be build. By default set `TEST_CUDA=ON` and `WITH_CUDA=OFF`.\nThe workflow (for this repository) is:\n\n```bash\nmkdir build \u0026\u0026 cd build\ncmake -DCMAKE_BUILD_TYPE=Release ..\n# or more specific\ncmake -DCMAKE_BUILD_TYPE=Release -DTEST_CUDA=ON -DCUDA_ARCH=\"52 60\" ..\nmake\nmake test\n```\n\n**Reasons:**\n\n-  Most CIs do not have a CUDA runtime installed. Whenever, `WITH_CUDA=ON` is activated the test code for CUDA will be also build.\n-  FindCuda might be more robust than a custom makefile.\n\n## Benchmark Kernels\n\n[EXAMPLE](./src/benchmark-multiply.cu.cc)\n\n**Description:**\nLike in the *CUDA Boilerplate Code* example we pack our kernels into structs. We might want th benchmark different template arguments.\n\n```cpp\ncuda::KernelBenchmark\u003cint\u003e bench;\nbench.Case\u003cmultiply_kernels::Multiply\u003cfloat, 4\u003e\u003e(init);\nbench.Case\u003cmultiply_kernels::Multiply\u003cfloat, 6\u003e\u003e(init);\nbench.Case\u003cmultiply_kernels::Multiply\u003cfloat, 8\u003e\u003e(init);\nbench.Case\u003cmultiply_kernels::Multiply\u003cfloat, 16\u003e\u003e(init);\nbench.Case\u003cmultiply_kernels::Multiply\u003cfloat, 32\u003e\u003e(init);\nbench.Start();\n```\n\nwill give the output:\n\n```\nUsing Device Number: 0\n  Device name: GeForce GTX 970\n  Memory Clock Rate (KHz): 3505000\n  Memory Bus Width (bits): 256\n  Peak Memory Bandwidth (GB/s): 224.320000\n\ntime 500.000000 - 1000.000000, iters: 5 - 100\n - multiply_kernels::Multiply\u003cfloat, 4\u003e    took     2.826743 ms stats(iters: 100, var:     0.067757, stddev:     0.260302)\n - multiply_kernels::Multiply\u003cfloat, 6\u003e    took     1.245100 ms stats(iters: 100, var:     0.019352, stddev:     0.139112)\n - multiply_kernels::Multiply\u003cfloat, 8\u003e    took     0.574468 ms stats(iters: 100, var:     0.000003, stddev:     0.001616)\n - multiply_kernels::Multiply\u003cfloat, 16\u003e   took     0.502195 ms stats(iters: 100, var:     0.000002, stddev:     0.001380)\n - multiply_kernels::Multiply\u003cfloat, 32\u003e   took     0.510635 ms stats(iters: 100, var:     0.000001, stddev:     0.001121)\n\n```\n\n## Tools\n- [online CUDA calculator](http://cuda.patwie.com/) instead of the NVIDIA Excel-sheet\n- [nvprof2json](https://github.com/PatWie/nvprof2json) to visualize NVIDIA profiling outputs in Google Chrome Browser (no dependencies compared to NVIDIA nvvp)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpatwie%2Fcuda-design-patterns","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpatwie%2Fcuda-design-patterns","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpatwie%2Fcuda-design-patterns/lists"}