https://github.com/tornikeo/sample-openmp-in-cuda
Sample of using OpenMP and CUDA: single GPU, multiple CPU
https://github.com/tornikeo/sample-openmp-in-cuda
cuda meson openmp
Last synced: 11 months ago
JSON representation
Sample of using OpenMP and CUDA: single GPU, multiple CPU
- Host: GitHub
- URL: https://github.com/tornikeo/sample-openmp-in-cuda
- Owner: tornikeo
- Created: 2025-03-14T10:57:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-17T10:49:11.000Z (over 1 year ago)
- Last Synced: 2025-07-14T10:22:52.147Z (12 months ago)
- Topics: cuda, meson, openmp
- Language: Cuda
- Homepage:
- Size: 87.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Minimal OpenMP + CUDA sample in C++
Shows a sample of using OpenMP with CUDA, with multiple CPUs batching their requests to query the GPU at the same time. In short:
1. Parallel CPU threads create jobs
2. Main thread concatenates jobs and sends it to GPU
3. GPU exection finishes
4. Main thread unconcats job results back to threads
4. Parallel CPU threads finalize job
Refer to [starter repo](https://github.com/tornikeo/minimal-vscode-cuda-meson) on setting this up with vscode + meson.
# Compile and run this
```sh
meson setup builddir
meson compile -C builddir
meson test -C builddir --verbose
```
Should output:
```sh
# ...
Start
Device Number: 0
Device name: NVIDIA GeForce GTX 1050 Ti with Max-Q Design
Memory Clock Rate (KHz): 3504000
Memory Bus Width (bits): 128
Peak Memory Bandwidth (GB/s): 112.128000
Thread 0 is working...
Thread 7 is working...
Thread 2 is working...
Thread 5 is working...
Thread 6 is working...
Thread 3 is working...
Thread 1 is working...
Thread 4 is working...
Numbers going into GPU:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Launching GPU...
We expect the kernel to sum the values in each row...
Thread 0 verified sum: 0 ✅
Thread 1 verified sum: 32 ✅
Thread 2 verified sum: 64 ✅
Thread 7 verified sum: 224 ✅
Thread 4 verified sum: 128 ✅
Thread 6 verified sum: 192 ✅
Thread 5 verified sum: 160 ✅
Thread 3 verified sum: 96 ✅
# ...
```
# Prerequisites
- cudatoolkit, cudatoolkit-dev (e.g from micromamba or conda)
- g++-11 (build-essential)