An open API service indexing awesome lists of open source software.

https://github.com/pc2/omp-offloading


https://github.com/pc2/omp-offloading

Last synced: about 1 year ago
JSON representation

Awesome Lists containing this project

README

          

# Introduction

The directories in this repository contain code examples for the course of
OpenMP GPU-offloading at Paderborn Center for Parallel Computing (PC²),
Paderborn University. The sub-directories are generally organized as:

* src: source code
* docs: documentation
* tests: some tests

Some highlights of the codes in this repository:

* The performance of our `saxpy` implemented by using OpenMP GPU-offloading is
as good as `cublasSaxpy` in CUBLAS. See `case 7` in `05_saxpy/src/asaxpy.c`
for details.

* The GPU shared memory has not been standardized in OpenMP API Specification
(Version 5.0 Nov. 2018). To optimize the performance of matrix multiplication
by using OpenMP GPU-offloading, i) `case 6` in `10_matMul/src/matMulAB.c`
implements a register blocking algorithm and ii) `case 8` in the same source
code file implements a common GPU-based tiled algorithm by blocking the local
shared memory in a very tricky manner and the OpenMP code resembles CUDA.

# List of Projects

* 00_build_OpenMP_offload

Documentation and scripts for building GCC as well as Clang/LLVM with OpenMP
support for Nvidia GPU offloading.

* 01_accelQuery

`accelQuery` searches accelerator(s) on a heterogeneous computer.
Accelerator(s), if found, will be enumerated with some basic info.

* 02_dataTransRate

`dataTransRate` gives the data transfer rate (in MB/sec) from `src` to `dst`.

The possible situations are:

* h2h: `src` = host and `dst` = host
* h2a: `src` = host and `dst` = accel
* a2a: `src` = accel and `dst` = accel

NOTE:

* A bug in Clang 9.0.1 has been fixed in Clang 11.
* The data transfer rata for `a2a` is still lower than our expectation.

* 03_taskwait

`taskwait` checks the `taskwait` construct for the deferred target task.

NOTE:

* Asynchronous offloading hasn't been implemented in the GCC 9.2 compiler.
* Asynchronous offloading is available in Clang 11.

* 04_scalarAddition

`scalarAddition` adds two integers on host and accelerator, and also compares
the performance.

* 05_saxpy

`saxpy` performs the `saxpy` operation on host as well as accelerator.
The performance (in MB/s) for different implementations is also compared.

* 08_distThreads

`distThreads` demonstrates the organization of threads and teams in a league
on GPU.

* 09_matAdd

`matAdd` performs matrix addition (A +=B) in single-precision on GPU. The
performance (in GB/s) for different implementations is compared and the
numerical results are also verified.

* 10_matMul

`matMul` performs matrix multiplication in single-precision on GPU. The
performance (in GFLOPS) for different implementations is compared and the
numerical results are also verified.