{"id":21902523,"url":"https://github.com/dendenxu/diff-gaussian-rasterization","last_synced_at":"2025-04-06T00:06:55.304Z","repository":{"id":226405682,"uuid":"740767692","full_name":"dendenxu/diff-gaussian-rasterization","owner":"dendenxu","description":"Improved 3DGS rasterizer.","archived":false,"fork":false,"pushed_at":"2025-02-26T21:43:37.000Z","size":435,"stargazers_count":108,"open_issues_count":2,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-29T23:07:30.130Z","etag":null,"topics":["3dgs","4dgs","nerf","rasterization"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dendenxu.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"license.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-09T02:48:55.000Z","updated_at":"2025-03-21T11:02:45.000Z","dependencies_parsed_at":"2024-03-30T13:29:06.970Z","dependency_job_id":"57ed231c-871c-44c7-b8a4-0b67c6ad3ce4","html_url":"https://github.com/dendenxu/diff-gaussian-rasterization","commit_stats":{"total_commits":127,"total_committers":9,"mean_commits":14.11111111111111,"dds":"0.36220472440944884","last_synced_commit":"4e58369e21e4a287ef61fcafd2f0d0081dcbe62c"},"previous_names":["dendenxu/diff-gaussian-rasterization"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dendenxu%2Fdiff-gaussian-rasterization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dendenxu%2Fdiff-gaussian-rasterization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dendenxu%2Fdiff-gaussian-rasterization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dendenxu%2Fdiff-gaussian-rasterization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dendenxu","download_url":"https://codeload.github.com/dendenxu/diff-gaussian-rasterization/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247415967,"owners_count":20935388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["3dgs","4dgs","nerf","rasterization"],"created_at":"2024-11-28T15:19:30.123Z","updated_at":"2025-04-06T00:06:55.279Z","avatar_url":"https://github.com/dendenxu.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Differential Gaussian Rasterization Improved\n\n## Faster Backward Pass\n\nThis is only faster if there're large number of semi-transparent (almost) transparent Gaussians to be rendered since it might introduce some small overheads for regular rendering.\n\nThe original backward implementation uses `atomicAdd` on global CUDA memory.\n\nWe further accelerate this process by making use of the `__shared__` memory in a thread block to store the temporal accumulated gradients, just like the original did to the gaussian properties.\n\nNo api change is required for this functionality and you can directly check out what we changed in [backward.cu](cuda_rasterizer/backward.cu#417).\n\nThe change can be summarized in this pseudo-code:\n\n```c++\n__global__ void __launch_bounds__(BLOCK_X * BLOCK_Y)\nrenderCUDA(...) {\n\n    __shared__ float3 s_dL_dmean2D[BLOCK_SIZE]; // allocated shared memory\n    s_dL_dmean2D[block.thread_rank()].x = 0.0f; // fill shared memory with zeros\n\n    for (int j = 0; !done \u0026\u0026 j \u003c min(BLOCK_SIZE, toDo); j++) { // iterate over gaussian that has a influence on this pixel\n        // Compute gradients\n        ...\n\n        // Update gradients w.r.t. 2D mean position of the Gaussian\n        atomicAdd(\u0026s_dL_dmean2D[j].x, dL_dG * dG_ddelx * ddelx_dx);\n        atomicAdd(\u0026s_dL_dmean2D[j].y, dL_dG * dG_ddely * ddely_dy);\n    }\n\n    atomicAdd(\u0026dL_dmean2D[global_id].x, s_dL_dmean2D[block.thread_rank()].x);\n    atomicAdd(\u0026dL_dmean2D[global_id].y, s_dL_dmean2D[block.thread_rank()].y);\n}\n```\n\nIn an effort to make this process even faster, we've also implemented a warp-reduction based version of the backward pass on top of the `__shared__` memory optimization.\n\nBy directly communicating the gradient accumulation in a 32-thread warp using:\n\n```c++\n__device__ float warpReduceSum(float value) {\n    auto warp = cg::coalesced_threads();\n    for (int offset = warp.size() / 2; offset \u003e 0; offset /= 2) {\n        value += warp.shfl_down(value, offset);\n    }\n    return value;\n}\n```\n\nAnd later aggregate the warp sum into `__shared__` memory:\n\n```c++\n...\n\t\t\t// Use a single thread from each warp to perform block level reduction\n\t\t\tif (block.thread_rank() % warp.size() == 0) {\n\t\t\t\tfor (int ch = 0; ch \u003c C; ch++) {\n\t\t\t\t\tatomicAdd(\u0026(s_dL_dcolors[ch * BLOCK_SIZE + j]), w_dL_dcolors[ch]);\n\t\t\t\t}\n\t\t\t\tatomicAdd(\u0026(s_dL_ddepths[j]), w_dL_ddepths);\n\t\t\t\tatomicAdd(\u0026s_dL_dmean2D[j].x, w_dL_dmean2D.x);\n\t\t\t\tatomicAdd(\u0026s_dL_dmean2D[j].y, w_dL_dmean2D.y);\n\t\t\t\tatomicAdd(\u0026s_dL_dconic2D[j].x, w_dL_dconic2D.x);\n\t\t\t\tatomicAdd(\u0026s_dL_dconic2D[j].y, w_dL_dconic2D.y);\n\t\t\t\tatomicAdd(\u0026s_dL_dconic2D[j].w, w_dL_dconic2D.w);\n\t\t\t\tatomicAdd(\u0026(s_dL_dopacity[j]), w_dL_dopacity);\n\t\t\t}\n...\n```\n\nWe can shave off another 2-3ms for the backward pass at the start of the training, but curiously it couldn't persist during the whole training process.\n\nThus by default only the `__shared__` memory optimization is enabled and in use.\n\nNote: this seems slower... See: https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell\n\n## Tile-Based Culling\n\nUsing the method mentioned: [StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time Rendering](https://github.com/r4dl/StopThePop-Rasterization), we borrow the tile-based culling scheme here to reduce the computational cost during training and rendering.\n\nThis section of code is directly adapted from their repository.\n\n```c++\n...\n    constexpr float alpha_threshold = 1.0f / 255.0f;\n    const float opacity_power_threshold = log(conic_opacity[idx].w / alpha_threshold);\n    glm::vec2 max_pos;\n    const glm::vec2 tile_min = {x * BLOCK_X, y * BLOCK_Y};\n    const glm::vec2 tile_max = {(x + 1) * BLOCK_X - 1, (y + 1) * BLOCK_Y - 1};\n    float max_opac_factor = max_contrib_power_rect_gaussian_float\u003cBLOCK_X-1, BLOCK_Y-1\u003e(conic_opacity[idx], points_xy[idx], tile_min, tile_max, max_pos);\n\n    if (max_opac_factor \u003e opacity_power_threshold) {\n        continue;\n    }\n...\n```\n\nNote: this seems slower...\n\n## Tile-Mask Rendering\n\n**Note: this api hasn't been fully tested yet.**\n\nWe additionaly provide a interface for adding a tile-mask to the gaussian rasterizer.\n\nTurns out the tile-based rendering rasterization pipeline can be easily masked out to provide a patch-like rendering result (to simulate a NeRF-like ray sampling approach).\n\nTo implement this as efficiently as possible, we:\n\n1. Mark points that's not to be rendered as early as possible in the `preprocessCUDA` kernel.\n2. Make all subsequent operations faster by not including masked-out tiles in the sorting and `renderCUDA` kernel.\n\nThe tile mask can be defined as:\n\n```python\nfrom diff_gauss import GaussianRasterizationSettings, GaussianRasterizer\nraster_settings = GaussianRasterizationSettings(...)\nrasterizer = GaussianRasterizer(raster_settings=raster_settings)\n\nBLOCK_X, BLOCK_Y = 16, 16\ntile_height, tile_width = (raster_settings.image_height + BLOCK_Y - 1) // BLOCK_Y, (raster_settings.image_width + BLOCK_X - 1) // BLOCK_X\ntile_mask = torch.ones((tile_height, tile_width), dtype=torch.bool, device='cuda')\n\nrendered_image, rendered_depth, rendered_alpha, radii = rasterizer(\n    means3D = means3D,\n    means2D = means2D,\n    shs = shs,\n    colors_precomp = colors_precomp,\n    opacities = opacity,\n    scales = scales,\n    rotations = rotations,\n    cov3D_precomp = cov3D_precomp,\n    tile_mask = tile_mask,\n)\n```\n\n## Fixed `ImageState` Buffer Size\n\nIn the [original implementation](https://github.com/graphdeco-inria/diff-gaussian-rasterization), the size of the `ranges` member of the struct `ImageState` was too large (same as the number of pixels).\n\nIn reality, only `number of tiles` of `ranges` are needed, as the `ranges` are used to store the start and end indices of the gaussian splats in the `GeometryState` buffer.\n\nWe fix this by simply replacing the memory allocation of `ImageState` with:\n\n```c++\nCudaRasterizer::ImageState CudaRasterizer::ImageState::fromChunk(char*\u0026 chunk, size_t N, size_t M)\n{\n\tImageState img;\n\tobtain(chunk, img.n_contrib, N, 128);\n\tobtain(chunk, img.ranges, M, 128);\n\treturn img;\n}\n```\n\n## Fixed Culling\n\nThe [original repository](https://github.com/graphdeco-inria/diff-gaussian-rasterization)'s implementation for view-space culling wasn't effective (no points were culled).\n\nWe fixed that with an improved OpenGL like culling function:\n\n```c++\n__forceinline__ __device__ bool in_frustum(int idx,\n\tconst float* orig_points,\n\tconst float* viewmatrix,\n\tconst float* projmatrix,\n\tbool prefiltered,\n\tfloat3\u0026 p_view, // reference\n\tconst float padding = 0.01f, // padding in ndc space // TODO: add api for changing this\n\tconst float xy_padding = 0.5f // padding in ndc space // TODO: add api for changing this\n\t)\n{\n\tfloat3 p_orig = { orig_points[3 * idx], orig_points[3 * idx + 1], orig_points[3 * idx + 2] };\n\tp_view = transformPoint4x3(p_orig, viewmatrix); // write this outside\n\tif (prefiltered) return true;\n\n\t// Bring points to screen space\n\tfloat4 p_hom = transformPoint4x4(p_orig, projmatrix);\n\tfloat p_w = 1.0f / (p_hom.w + 0.0000001f);\n\tfloat3 p_proj = { p_hom.x * p_w, p_hom.y * p_w, p_hom.z * p_w };\n\n\treturn (p_proj.z \u003e -1 - padding) \u0026\u0026 (p_proj.z \u003c 1 + padding) \u0026\u0026 (p_proj.x \u003e -1 - xy_padding) \u0026\u0026 (p_proj.x \u003c 1. + xy_padding) \u0026\u0026 (p_proj.y \u003e -1 - xy_padding) \u0026\u0026 (p_proj.y \u003c 1. + xy_padding);\n}\n```\n\n## Depth \u0026 Alpha Backward\n\n**Note: this functionality is directly copied from the [slothfulxtx repository](https://github.com/slothfulxtx/diff-gaussian-rasterization).**\n\nExcept for the RGB image, we also support render depth map and alpha map (both forward and backward process) compared with the [original repository](https://github.com/graphdeco-inria/diff-gaussian-rasterization).\n\nWe modify the dependency name as **diff_gauss** to avoid dependecy conflict with the original version. You can install our repo by executing the following command lines\n\nHere's an example of our modified differential gaussian rasterization repo\n```python\nfrom diff_gauss import GaussianRasterizationSettings, GaussianRasterizer\nraster_settings = GaussianRasterizationSettings(...)\nrasterizer = GaussianRasterizer(raster_settings=raster_settings)\n\nrendered_image, rendered_depth, rendered_alpha, radii = rasterizer(\n    means3D = means3D,\n    means2D = means2D,\n    shs = shs,\n    colors_precomp = colors_precomp,\n    opacities = opacity,\n    scales = scales,\n    rotations = rotations,\n    cov3D_precomp = cov3D_precomp\n)\n```\n\nDetails: By default, the depth is calculated as 'median depth', where the depth values of each pixels covered by 3D Gaussian Splatting are set to be the depth of the 3D Gaussian center. Thus, there exist numerical errors when the scales of 3D Gaussian are large. However, thanks to the densificaiton scheme, most 3D Gaussians are small. Currently, we ignore the numerical error of depth maps. \n\n## Differential Gaussian Rasterization\n\n**Note: this is the original readme for the [original diff-gaussian-rasterization repository](https://github.com/graphdeco-inria/diff-gaussian-rasterization).**\n\nUsed as the rasterization engine for the paper \"3D Gaussian Splatting for Real-Time Rendering of Radiance Fields\". If you can make use of it in your own research, please be so kind to cite us.\n\n\u003csection class=\"section\" id=\"BibTeX\"\u003e\n  \u003cdiv class=\"container is-max-desktop content\"\u003e\n    \u003ch2 class=\"title\"\u003eBibTeX\u003c/h2\u003e\n    \u003cpre\u003e\u003ccode\u003e@Article{kerbl3Dgaussians,\n      author       = {Kerbl, Bernhard and Kopanas, Georgios and Leimk{\\\"u}hler, Thomas and Drettakis, George},\n      title        = {3D Gaussian Splatting for Real-Time Radiance Field Rendering},\n      journal      = {ACM Transactions on Graphics},\n      number       = {4},\n      volume       = {42},\n      month        = {July},\n      year         = {2023},\n      url          = {https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/}\n}\u003c/code\u003e\u003c/pre\u003e\n  \u003c/div\u003e\n\u003c/section\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdendenxu%2Fdiff-gaussian-rasterization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdendenxu%2Fdiff-gaussian-rasterization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdendenxu%2Fdiff-gaussian-rasterization/lists"}