{"id":18726543,"url":"https://github.com/chrischoy/pytorch-custom-cuda-tutorial","last_synced_at":"2025-04-06T09:10:46.429Z","repository":{"id":83216975,"uuid":"96352183","full_name":"chrischoy/pytorch-custom-cuda-tutorial","owner":"chrischoy","description":"Tutorial for building a custom CUDA function for Pytorch","archived":false,"fork":false,"pushed_at":"2019-01-25T22:20:17.000Z","size":17,"stargazers_count":512,"open_issues_count":2,"forks_count":51,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-03-30T08:11:16.510Z","etag":null,"topics":["python","pytorch","pytorch-backend","tutorial","wrapper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chrischoy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-07-05T19:02:57.000Z","updated_at":"2025-03-03T12:38:47.000Z","dependencies_parsed_at":"2023-03-01T21:00:54.123Z","dependency_job_id":null,"html_url":"https://github.com/chrischoy/pytorch-custom-cuda-tutorial","commit_stats":{"total_commits":18,"total_committers":1,"mean_commits":18.0,"dds":0.0,"last_synced_commit":"f63acd4c7695ad652e33063fdf252a3084e164ac"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrischoy%2Fpytorch-custom-cuda-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrischoy%2Fpytorch-custom-cuda-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrischoy%2Fpytorch-custom-cuda-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrischoy%2Fpytorch-custom-cuda-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chrischoy","download_url":"https://codeload.github.com/chrischoy/pytorch-custom-cuda-tutorial/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247457803,"owners_count":20941906,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","pytorch","pytorch-backend","tutorial","wrapper"],"created_at":"2024-11-07T14:14:46.066Z","updated_at":"2025-04-06T09:10:46.414Z","avatar_url":"https://github.com/chrischoy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pytorch Custom CUDA kernel Tutorial\n\nThis repository contains a tutorial code for making a custom CUDA function for\npytorch. The code is based on the pytorch [C extension\nexample](https://github.com/pytorch/extension-ffi).\n\n**Disclaimer**\n\n- 2019/01/02: I wrote **[another up-to-date tutorial](https://github.com/chrischoy/MakePytorchPlusPlus)** on how to make a pytorch C++/CUDA extension with a Makefile. Associate git page is on **[https://github.com/chrischoy/MakePytorchPlusPlus](https://github.com/chrischoy/MakePytorchPlusPlus)**\n- 2018/12/09: Pytorch CFFI is now deprecated in favor of [C++ extension](https://pytorch.org/tutorials/advanced/cpp_extension.html) from pytorch v1.0.\n\n`This tutorial was written when pytorch did not support broadcasting sum. Now that it supports, probably you wouldn't need to make your own broadcasting sum function, but you can still follow the tutorial to build your own custom layer with a custom CUDA kernel.`\n\nIn this repository, we will build a simple CUDA based broadcasting sum\nfunction.  The current version of pytorch does not support [broadcasting\nsum](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html), thus we\nhave to manually expand a tensor like using `expand_as` which makes a new\ntensor and takes additional memory and computation.\n\nFor example,\n\n```python\na = torch.randn(3, 5)\nb = torch.randn(3, 1)\n# The following line will give an error\n# a += b\n\n# Expand b to have the same dimension as a\nb_like_a = b.expand_as(a)\na += b_like_a\n```\n\nIn this post, we will build a function that can compute `a += b` without\nexplicitly expanding `b`.\n\n```python\nmathutil.broadcast_sum(a, b, *map(int, a.size()))\n```\n\n## Make a CUDA kernel\n\nFirst, let's make a cuda kernel that adds `b` to `a` without making a copy of a tensor `b`.\n\n```cuda\n__global__ void broadcast_sum_kernel(float *a, float *b, int x, int y, int size)\n{\n    int i = (blockIdx.x + blockIdx.y * gridDim.x) * blockDim.x + threadIdx.x;\n    if(i \u003e= size) return;\n    int j = i % x; i = i / x;\n    int k = i % y;\n    a[IDX2D(j, k, y)] += b[k];\n}\n```\n\n## Make a C wrapper\n\nOnce you made a CUDA kernel, you have to wrap it with a C code. However, we are not using the pytorch backend yet. Note that the inputs are already device pointers.\n\n\n```c++\nvoid broadcast_sum_cuda(float *a, float *b, int x, int y, cudaStream_t stream)\n{\n    int size = x * y;\n    cudaError_t err;\n\n    broadcast_sum_kernel\u003c\u003c\u003ccuda_gridsize(size), BLOCK, 0, stream\u003e\u003e\u003e(a, b, x, y, size);\n\n    err = cudaGetLastError();\n    if (cudaSuccess != err)\n    {\n        fprintf(stderr, \"CUDA kernel failed : %s\\n\", cudaGetErrorString(err));\n        exit(-1);\n    }\n}\n```\n\n## Connect Pytorch backends with the C Wrapper\n\nNext, we have to connect the pytorch backend with our C wrapper. You can expose the device pointer using the function `THCudaTensor_data`. The pointers `a` and `b` are device pointers (on GPU).\n\n\n```c++\nextern THCState *state;\n\nint broadcast_sum(THCudaTensor *a_tensor, THCudaTensor *b_tensor, int x, int y)\n{\n    float *a = THCudaTensor_data(state, a_tensor);\n    float *b = THCudaTensor_data(state, b_tensor);\n    cudaStream_t stream = THCState_getCurrentStream(state);\n\n    broadcast_sum_cuda(a, b, x, y, stream);\n\n    return 1;\n}\n```\n\n## Make a python wrapper\n\nNow that we built the cuda function and a pytorch function, we need to expose the function to python so that we can use the function in python.\n\nWe will first build a shared library using `nvcc`.\n\n```shell\nnvcc ... -o build/mathutil_cuda_kernel.so src/mathutil_cuda_kernel.cu\n```\n\nThen, we will use the pytorch `torch.utils.ffi.create_extension` function which automatically put appropriate headers and builds a python loadable shared library.\n\n```python\nfrom torch.utils.ffi import create_extension\n\n...\n\nffi = create_extension(\n    'mathutils',\n    headers=[...],\n    sources=[...],\n    ...\n)\n\nffi.build()\n```\n\n\n## Test!\n\nFinally, we can test our function by building it.\nIn the readme, I removed a lot of details, but you can see a working example.\n\n```shell\ngit clone https://github.com/chrischoy/pytorch-cffi-tutorial\ncd pytorch-cffi-tutorial\nmake\n```\n\n## Note\n\nThe function only takes `THCudaTensor`, which is `torch.FloatTensor().cuda()` in python.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrischoy%2Fpytorch-custom-cuda-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchrischoy%2Fpytorch-custom-cuda-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrischoy%2Fpytorch-custom-cuda-tutorial/lists"}