{"id":19871810,"url":"https://github.com/mattdean1/cuda","last_synced_at":"2026-03-27T03:38:40.682Z","repository":{"id":77088297,"uuid":"121798637","full_name":"mattdean1/cuda","owner":"mattdean1","description":"An implementation of parallel exclusive scan in CUDA","archived":false,"fork":false,"pushed_at":"2018-02-23T15:09:16.000Z","size":33,"stargazers_count":62,"open_issues_count":0,"forks_count":23,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-14T05:07:18.131Z","etag":null,"topics":["cpp","nvidia-cuda","parallel-programming"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mattdean1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-02-16T20:43:03.000Z","updated_at":"2025-02-25T17:22:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"c8a1d1d2-641c-4e6a-9ba3-74ccb1b413ee","html_url":"https://github.com/mattdean1/cuda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mattdean1/cuda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mattdean1%2Fcuda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mattdean1%2Fcuda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mattdean1%2Fcuda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mattdean1%2Fcuda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mattdean1","download_url":"https://codeload.github.com/mattdean1/cuda/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mattdean1%2Fcuda/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263475937,"owners_count":23472489,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","nvidia-cuda","parallel-programming"],"created_at":"2024-11-12T16:13:38.305Z","updated_at":"2026-03-27T03:38:35.634Z","avatar_url":"https://github.com/mattdean1.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Parallel Prefix Sum (Scan) with CUDA \n\nMy implementation of parallel exclusive scan in CUDA, following [this NVIDIA paper](http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/scan/doc/scan.pdf).\n\n\u003eParallel prefix sum, also known as parallel Scan, is a useful building block for many\nparallel algorithms including sorting and building data structures. In this document\nwe introduce Scan and describe step-by-step how it can be implemented efficiently\nin NVIDIA CUDA. We start with a basic naïve algorithm and proceed through\nmore advanced techniques to obtain best performance. We then explain how to\nscan arrays of arbitrary size that cannot be processed with a single block of threads. \n\nThis implementation can handle very large arbitrary length vectors thanks to the [recursively defined scan function](https://github.com/mattdean1/cuda/blob/master/parallel-scan/scan.cu#L105).\n\nPerformance is increased with a memory-bank conflict avoidance optimization (BCAO).\n\n---\n\nSee the [timings](https://github.com/mattdean1/cuda/blob/master/parallel-scan/Submission.cu#L616) for a performance comparison between:\n  1. Sequential scan run on the CPU\n  2. Parallel scan run on the GPU\n  3. Parallel scan with BCAO\n  \nFor a vector of 10 million entries:\n\n\t  CPU      : 20749 ms\n\t  GPU      : 7.860768 ms\n\t  GPU BCAO : 4.304064 ms\n    \n    Intel Core i5-4670k @ 3.4GHz, NVIDIA GeForce GTX 760\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmattdean1%2Fcuda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmattdean1%2Fcuda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmattdean1%2Fcuda/lists"}