{"id":26578769,"url":"https://github.com/williamzhang20/cuda-practice","last_synced_at":"2025-03-23T05:16:33.475Z","repository":{"id":283920758,"uuid":"937705775","full_name":"WilliamZhang20/Cuda-Practice","owner":"WilliamZhang20","description":"Exercises in CUDA","archived":false,"fork":false,"pushed_at":"2025-03-23T03:09:55.000Z","size":100,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-23T03:26:03.174Z","etag":null,"topics":["cuda","n-body-problem"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/WilliamZhang20.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-23T17:54:06.000Z","updated_at":"2025-03-23T03:09:58.000Z","dependencies_parsed_at":"2025-03-23T03:26:06.417Z","dependency_job_id":"72adc3b1-242f-4b62-911f-26d9dbf48fbd","html_url":"https://github.com/WilliamZhang20/Cuda-Practice","commit_stats":null,"previous_names":["williamzhang20/cuda-practice"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WilliamZhang20%2FCuda-Practice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WilliamZhang20%2FCuda-Practice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WilliamZhang20%2FCuda-Practice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/WilliamZhang20%2FCuda-Practice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/WilliamZhang20","download_url":"https://codeload.github.com/WilliamZhang20/Cuda-Practice/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245056907,"owners_count":20553856,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","n-body-problem"],"created_at":"2025-03-23T05:16:32.693Z","updated_at":"2025-03-23T05:16:33.456Z","avatar_url":"https://github.com/WilliamZhang20.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CUDA Practice\n\nA collection of exercises learning \u0026 practicing parallel algorithms in CUDA.\n\nImplemented in the course \"Getting Started with Accelerated Computing in CUDA C/C++\".\n\nContains a final project to engineer a GPU-accelerated simulation of gravitationally-induced motion between many bodies, aka the N-Body Problem.\n\nSee section below on [the final project](#the-final-project).\n\n## Contents:\n\nSection 1: Introduction to CUDA\n- Heat Conduction\n- Matrix Multiplication\n- Adding two vectors with strides\n\nSection 2: Unified Memory\n- Adding 2 very large vectors\n- Fast implementation of SAXPY\n\nSection 3: Streaming \u0026 Visual Profiling\n- Initializating memory with CUDA streams\n- Solving the n-body problem with 4096 \u0026 65536 bodies.\n\n## The Final Project\n\nThe final assignment consisted of simulating the n-body problem on a GPU with force, velocity, \u0026 position computations - but accelerating a given CPU-only code by exploiting the CUDA device.\n\nThe overall mathematical procedure was simple - first, the forces are calculated in 3D space using Newton's gravitational laws from each mass to every other mass. \n\nOnce that is done, we use the calculated forces obtain the new positions for all masses. \n\nThe development was iterative \u0026 profile-driven, with each version analyzed using the NSight Systems Visual Profiler to analyze bottlenecks \u0026 memory usage patterns.\n\nIn my first successful version, only the force calculations were parallelized. While this \"passed\" the threshold, it was inefficient. Over 10 iterations, each iteration involved a transfer between host \u0026 device.\n\nIn my second version, I delegated memory transfers to only before and after all iterations of interaction computation, the latter of which was entirely parallelized. This meant that position integration had to be made into another kernel, which took the address of the array in the GPU device as an argument.\n\nIn my third version, I tried incorporating streams to further accelerate, since it was introduced in the same section. However, that is clearly not the right solution, since it adds unecessary overhead implementing the same logic across several sections of memory, when they all access the same data, and do not branch diverge.\n\nHere is what the NSys profile looks like for the best version by far:\n- The green at the start and pink at the end are memory transfers.\n- The big blue blocks in the middle are the force calculations - they can still be accelerated with streams.\n- Between the big blue blocks, one can notice a few ticks. This is the position integration procedure. \n\n![image](https://github.com/user-attachments/assets/804c3d50-5cb5-47e2-b913-24e233d151a9)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwilliamzhang20%2Fcuda-practice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwilliamzhang20%2Fcuda-practice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwilliamzhang20%2Fcuda-practice/lists"}