https://github.com/yasahi-hpc/vlp4d_mpi

MPI+Kokkos/OpenACC/OpenMP4.5/stdpar implementation of vlp4d
https://github.com/yasahi-hpc/vlp4d_mpi
gpu kokkos openacc openmp stdpar
Last synced: 4 months ago
JSON representation
MPI+Kokkos/OpenACC/OpenMP4.5/stdpar implementation of vlp4d
Host: GitHub
URL: https://github.com/yasahi-hpc/vlp4d_mpi
Owner: yasahi-hpc
License: mit
Created: 2020-05-26T06:26:03.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-10-29T07:47:37.000Z (almost 4 years ago)
Last Synced: 2025-06-12T08:11:35.216Z (4 months ago)
Topics: gpu, kokkos, openacc, openmp, stdpar
Language: C++
Homepage:
Size: 374 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # About

The vlp4d code solves Vlasov-Poisson equations in 4D (2d space, 2d velocity). From the numerical point of view, vlp4d is based on a semi-lagrangian scheme. Vlasov solver is typically based on a directional Strang splitting. The Poisson equation is treated with 2D Fourier transforms. For the sake of simplicity, all directions are, for the moment, handled with periodic boundary conditions. As a major update from [the pervious version](https://github.com/yasahi-hpc/vlp4d), we parallelized the code with MPI and upgrade the interpolatin scheme from Lagrange to Spline. 

The Vlasov solver is based on advection's operators: 

- Halo excahnge on  (P2P communications)  

- Compute spline coefficient along 

- 2D advection along 

- Poisson solver -> compute electric fields  and 

- Compute spline coefficient along 

- 4D advection along  directions for 

Detailed descriptions of the test cases can be found in 

- [Crouseilles & al. J. Comput. Phys., 228, pp. 1429-1446, (2009).](http://people.rennes.inria.fr/Nicolas.Crouseilles/loss4D.pdf)  

  Section 5.3.1 Two-dimensional Landau damping -> SLD10

- [Crouseilles & al. Communications in Nonlinear Science and Numerical Simulation, pp 94-99, 13, (2008).](http://people.rennes.inria.fr/Nicolas.Crouseilles/cgls2.pdf)  

  Section 2 and 3 Two stream Instability and Beam focusing pb -> TSI20

- [Crouseilles & al. Beam Dynamics Newsletter no 41 (2006).](http://icfa-bd.kek.jp/Newsletter41.pdf )  

  Section 3.3, Beam focusing pb.

  

For questions or comments, please find us in the AUTHORS file.

# HPC

From the view point of high perfomrance computing (HPC), the code is parallelized with MPI + "X", where "X" is one of a mixed OpenMP3.0/OpenACC, OpenMP3.0/OpenMP4.5, Kokkos and parallel algorithm (experimental). We have investigated optimization strategies applicable to a kinetic plasma simulation code that makes use of the MPI + "X" implementation listed above. The details are presented in the [P3HPC workshop 2021](https://p3hpc.org/workshop/2021/). Our previous result for [non-MPI version](https://github.com/yasahi-hpc/vlp4d) is found in 

- [Yuuichi Asahi, Guillaume Latu, Virginie Grandgirard, and Julien Bigot, "Performance Portable Implementation of a Kinetic Plasma Simulation Mini-app"](https://link.springer.com/chapter/10.1007/978-3-030-49943-3_6), in [Accelerator Programming Using Directives](https://link.springer.com/book/10.1007/978-3-030-49943-3) or in [Proceedings of Sixth Workshop on Accelerator Programming Using Directives (WACCPD), IEEE, 2019](https://sc19.supercomputing.org/proceedings/workshops/workshop_files/ws_waccpd104s2-file1.pdf).

# Test environments

We have tested the code on the following environments. 

- Nvidia Tesla p100 on Tsubame3.0 (Tokyo Tech, Japan)  

Compilers: cuda/10.2.48 + openmpi3.1.4 (Kokkos), pgi19.1 + openmpi3.1.4 (OpenACC)

- Nvidia Tesla v100 on Marconi100 (Cineca, Italy)  

Compilers cuda/10.2 + spectrum_mpi10.3.1 (Kokkos), Nvidia HPC SDK 20.11-0 (OpenACC)

- Intel Skylake on JFRS-1 (IFERC-CSC, Japan)  

Compilers (intel compiler 18.0.2)

- Fujitsu A64FX on Flow (Nagoya Univ., Japan)  

Compilers (Fujitsu compiler 1.2.27)

# Usage

## Compile

Firstly, you need to git clone on your environment as

```

git clone https://github.com/yasahi-hpc/vlp4d_mpi.git

```

Depending on your configuration, you may have to modify the Makefile.

You may add your configuration in the same way as 

```

ifneq (,$(findstring p100,$(DEVICES)))

CXX      = mpicxx

CXXFLAGS = -O3 -ta=nvidia:cc60 -Minfo=accel -Mcudalib=cufft,cublas -std=c++11 -DENABLE_OPENACC -DLAYOUT_LEFT

LDFLAGS  = -Mcudalib=cufft,cublas -ta=nvidia:cc60 -acc

TARGET   = vlp4d.tsubame3.0_p100_openacc

endif

```

Before compiling, you need to load appropriate modules for MPI + CUDA/OpenACC/OpenMP4.5 environment. 

CUDA-Aware-MPI is necessary for this application.

For CPU version, it is also necessary to make sure that [fftw](http://www.fftw.org) is available in your configuration. 

OpenMP4.5 and stdpar versions are experimental and not appeared in the workshop paper.

For OpenMP4.5 and stdpar versions, we have only tested with ```nvc++``` in Nvidia HPC SDK.

### OpenACC version

```

export DEVICE=device_name # choose the device_name from "p100", "v100", "a100", "bdw", "knl", "skx", "a64fx"

cd src_openacc

make

```

### OpenMP4.5 version

This is an experimental version (not appeared in the workshop paper). 

```

export DEVICE=device_name # choose the device_name from "v100", "a100"

cd src_openmp4.5

make

```

### Kokkos version

First of all, you need to install kokkos on your environment. Instructions are found in https://github.com/kokkos/kokkos. In the following example, it is assumed that kokkos is located at "your_kokkos_path".

```

export KOKKOS_PATH=your_kokkos_path # set your_kokkos_path

export OMPICXX=$KOKKOSPATH/bin/nvccwrapper # Assuming OpenMPI as a MPI library (GPU only)

export DEVICE=device_name # choose the device_name from "p100", "v100", "a100", "bdw", "skx", "a64fx"

cd src_kokkos

make

```

### C++ parallel algorithm (stdpar) version

This is an experimental version (not appeared in the workshop paper). Performance test has been made on A100 GPU. Further optimizations are needed for this version.

```

export DEVICE=device_name # choose the device_name from "p100", "v100", "a100", "icelake"

cd src_stdpar

make

```

## Run

Depending on your configuration, you may have to modify the ```job.sh``` in ```wk``` and ```sub_*.sh``` in ```wk/batch_scripts```.

```

cd wk

./job.sh

```

## Experiment workflow

In order to evaluate the impact of optimizations, one has to compile the code on each environment. 

Here, ```device_name``` is the name of the device one use in Makefile and job scripts. We assume that the Installation has already been made successfully. The impact of optimizations can be evaluated by comparing the standard output with different versions.

### OpenACC version

```

export DEVICE=device_name

export OPTIMIZATION=STEP1 # choose from STEP0-2 for CPUs and choose from STEP0-1 for GPUs

cd src_openacc

make

cd ../wk

./job.sh

```

### OpenMP4.5 version

This is an experimental version (not appeared in the workshop paper). 

It seems important to map GPUs before calling ```MPI_Init```. See [wrapper.sh](https://github.com/yasahi-hpc/vlp4d_mpi/blob/master/wk/batch_scripts/wrapper.sh) and [sub_Wisteria_A100_omp4.5.sh](https://github.com/yasahi-hpc/vlp4d_mpi/blob/master/wk/batch_scripts/sub_Wisteria_A100_omp4.5.sh).

```

export DEVICE=device_name

export OPTIMIZATION=STEP1 # choose from STEP0-1 for GPUs

cd src_openmp4.5

make

cd ../wk

./job.sh

```

### Kokkos version

```

export DEVICE=device_name

export OMPICXX=$KOKKOSPATH/bin/nvccwrapper # Only for OpenMPI + GPU case

export OPTIMIZATION=STEP1 # choose from STEP0-2 for CPUs and choose from STEP0-1 for GPUs

cd src_kokkos

make

cd ../wk

./job.sh

```

### C++ parallel algorithm (stdpar) version

This is an experimental version (not appeared in the workshop paper). 

As well as the OpenMP4.5 version, it is recommended to map GPUs before calling ```MPI_Init```. See [sub_Wisteria_A100_stdpar.sh](https://github.com/yasahi-hpc/vlp4d_mpi/blob/master/wk/batch_scripts/sub_Wisteria_A100_stdpar.sh).

```

export DEVICE=device_name

cd src_stdpar

make

cd ../wk

./job.sh

```

### Expected result

If the code works correctly, one may find the standard output file in ascii format showing the timing at the bottom.  

The timings look like (though not alingned in the standard output file)

|  |  |  | 

| ---- | ---- | ---- | 

| total | 4.57123 [s], | 1 calls |  

| MainLoop | 4.56027 [s], | 40 calls |

| pack | 0.14395 [s], | 40 calls |

| comm | 0.705633 [s], | 40 calls |

| unpack | 0.0556184 [s], | 40 calls | 

| advec2D | 0.258894 [s], | 40 calls |

| advec4D |1.38773 [s], | 40 calls |

| field |0.0474864 [s], | 80 calls |

| all\_reduce |0.116563 [s], | 80 calls |

| Fourier |0.0296469 [s], | 80 calls |

| diag |0.0992476 [s], | 40 calls |

| splinecoeff\_xy | 0.805345 [s], | 40 calls |

| splinecoeff\_vxvy | 0.907955 [s], | 40 calls |

Each column denotes the kernel name, total elapsed time in seconds, and number of call counts.

The elapsed time ```s``` of a given kernel for a single iteration can be computed by

```

total elapsed time [s] / number of call counts

```

The Flops and memory bandwidth are computed by the following formula

```

Flops = Nf/s,

Bytes/s = Nb/s

```

where ```f``` and ```b``` denote the total amount of floating point operation and memory accesses per grid point. ```N``` represent the total number of grid points ans ```s``` is the elapsed time of a given kernel for a single iteration. ```f``` and ```b``` presented in Table V of the paper (section VI) are the analytical estimates from the source code.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yasahi-hpc/vlp4d_mpi

Awesome Lists containing this project

README