{"id":17493650,"url":"https://github.com/luraess/parallel-gpu-workshop-juliacon21","last_synced_at":"2025-04-15T11:54:22.645Z","repository":{"id":48585794,"uuid":"383486691","full_name":"luraess/parallel-gpu-workshop-JuliaCon21","owner":"luraess","description":"Solving differential equations in parallel on GPUs - JuliaCon 2021 workshop","archived":false,"fork":false,"pushed_at":"2023-11-14T11:57:32.000Z","size":6349,"stargazers_count":94,"open_issues_count":1,"forks_count":13,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-28T19:45:17.189Z","etag":null,"topics":["gpu","julia","juliacon","pde-solver","workshop"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luraess.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-06T13:55:39.000Z","updated_at":"2025-03-04T20:10:36.000Z","dependencies_parsed_at":"2022-08-30T09:00:57.378Z","dependency_job_id":null,"html_url":"https://github.com/luraess/parallel-gpu-workshop-JuliaCon21","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luraess%2Fparallel-gpu-workshop-JuliaCon21","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luraess%2Fparallel-gpu-workshop-JuliaCon21/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luraess%2Fparallel-gpu-workshop-JuliaCon21/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luraess%2Fparallel-gpu-workshop-JuliaCon21/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luraess","download_url":"https://codeload.github.com/luraess/parallel-gpu-workshop-JuliaCon21/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249067773,"owners_count":21207395,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpu","julia","juliacon","pde-solver","workshop"],"created_at":"2024-10-19T12:08:57.229Z","updated_at":"2025-04-15T11:54:22.602Z","avatar_url":"https://github.com/luraess.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Solving differential equations in parallel on GPUs\n\n[![Build Status](https://github.com/luraess/parallel-gpu-workshop-JuliaCon21/workflows/CI/badge.svg)](https://github.com/luraess/parallel-gpu-workshop-JuliaCon21/actions)\n\n[**JuliaCon 2021 workshop | Fri, July 23, 10am-1pm ET (16:00-19:00 CEST)**](https://pretalx.com/juliacon2021/talk/review/NDHRHLYN7JZUV88PLFWPL99P93NQVU8K)\n\n[**👀 watch the workshop LIVE recording**](https://www.youtube.com/watch?v=DvlM0w6lYEY)\n\n👉 **Useful notes:**\n- 💡 The [Getting started](#getting-started) will help you set up.\n- ❓ Further interests in solving PDEs with Julia on GPUs\n    - Check out this [online geo-HPC tutorial](https://github.com/luraess/geo-hpc-course)\n    - Visit [EGU21's short course repo](https://github.com/luraess/julia-parallel-course-EGU21)\n\n---\n\nThis workshop covers trendy areas in modern numerical computing with examples from geoscientific applications. The physical processes governing natural systems' evolution are often mathematically described as systems of differential equations or partial differential equations (PDE). Fast and accurate solutions require numerical implementations to leverage modern parallel hardware.\n\n\n# Content\n* [Objectives](#objectives)\n* [About this repository](#about-this-repository)\n* [Getting started](#getting-started) _(not discussed during the workshop)_\n* 👉 [**Workshop material**](#workshop-material)\n* [Extras](#extras) _(optional if time permits)_\n* [Further reading](#further-reading)\n\n\n# Objectives\nThe goal of this workshop is to offer an interactive hands-on to solve systems of differential equations in parallel on GPUs using the [ParallelStencil.jl] and [ImplicitGlobalGrid.jl] Julia packages. [ParallelStencil.jl] permits to write architecture-agnostic parallel high-performance GPU and CPU code for stencil computations and [ImplicitGlobalGrid.jl] renders its distributed parallelization almost trivial. The resulting codes are fast, short and readable \\[[1][JuliaCon20a], [2][JuliaCon20b], [3][JuliaCon19]\\].\n\nWe will use these two Julia packages to design and implement an iterative nonlinear diffusion solver. We will, in a second step, turn the serial CPU solver into a parallel application to run on multiple CPU threads and on GPUs. We will, in a third step, do distributed memory computing, enhancing the nonlinear diffusion solver to execute on multiple CPUs and GPUs. The nonlinear diffusion solver we will work on can be applied to resolve the shallow ice approximation (SIA) equations with applications that predict ice flow dynamics over mountainous Greenland topography (Fig. below).\n\n![Greenland ice cap](docs/greenland_1.png)\n\n👉 Visit [EGU21's short course repo](https://github.com/luraess/julia-parallel-course-EGU21) if interested.\n\n**The workshop consists of 3 parts:**\n1. [**Part 1**](#part-1---fast-iterative-solvers) - You will learn about accelerating iterative solvers.\n2. [**Part 2**](#part-2---parallel-cpu-and-gpu-computing) - You will port the iterative solver to parallel CPU and GPU execution.\n3. [**Part 3**](#part-3---distributed-computing-on-multiple-cpus-and-gpus) - You will learn how to implement multi-CPU and multi-GPU execution.\n\nBy the end of this workshop, you will:\n- Have high-performance nonlinear PDE GPU solvers;\n- Have a Julia code that achieves similar performance than legacy codes (C/CUDA C + MPI);\n- Be able to leverage the computing power of modern GPU accelerated servers and supercomputers.\n\n⤴️ [_back to content_](#content)\n\n\n# About this repository\nThe workshop repository lists following folders and items:\n- the [docs](docs) folder contains documentation linked in the [README](README.md);\n- the [scripts](scripts) folder contains the scripts this workshop is about 🎉\n- the [extras](extras) folder contains supporting workshop material (not discussed live during the workshop);\n- the [`Project.toml`](Project.toml) file is the Julia project file, tracking the used packages and enabling a reproducible environment.\n\n\u003e 👉 This repository is an interactive and dynamic source of information related to the workshop.\n\u003e- Check out the [**Discussion**](https://github.com/luraess/parallel-gpu-workshop-JuliaCon21/discussions) tab if you have general comments, ideas to share or for Q\u0026A.\n\u003e- File an [**Issue**](https://github.com/luraess/parallel-gpu-workshop-JuliaCon21/issues) if you encounter any technical problems with the distributed codes.\n\u003e- Interact in an open-minded, respectful and inclusive manner.\n\n⤴️ [_back to content_](#content)\n\n\n# Getting started\n**TL;DR** (assuming you have Julia v1.6 installed).  Run at shell:\n```\ngit clone https://github.com/luraess/parallel-gpu-workshop-JuliaCon21.git\ncd parallel-gpu-workshop-JuliaCon21/scripts\njulia --project -t4\n\njulia\u003e using Pkg; Pkg.instantiate()\n```\n\n\u003e ⚠️ The workshop will not cover the Getting started steps. These are meant to provide directions to the participant willing to actively try out the examples during the workshop or for Julia newcomers. **It is warmly recommended to perform the Getting started steps before the beginning of the workshop.**\n\nThe detailed steps in the dropdown hereafter will get you started with:\n1. Installing Julia v1.6;\n2. Running the scripts from the workshop repository.\n\n\u003cdetails\u003e\n\u003csummary\u003eCLICK HERE for the getting started steps 🚀\u003c/summary\u003e\n\u003cbr\u003e\n\n## Installing Julia v1.6 (or later)\nCheck you have an active internet connexion and [download Julia v1.6](https://julialang.org/downloads/) for your platform following the install directions provided under **\\[help\\]** if needed.\n\nAlternatively, open a terminal and download the binaries (select the one for your platform):\n```sh\nwget https://julialang-s3.julialang.org/bin/winnt/x64/1.6/julia-1.6.1-win64.exe # Windows\nwget https://julialang-s3.julialang.org/bin/mac/x64/1.6/julia-1.6.1-mac64.dmg # macOS\nwget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.1-linux-x86_64.tar.gz # Linux x86\n```\nThen add Julia to `PATH` (usually done in your `.bashrc`, `.profile`, or `config` file).\n\n### Terminal + external editor\nEnsure you have a text editor with syntax highlighting support for Julia. From within the terminal, type\n```sh\njulia\n```\nto make sure that the Julia REPL (aka terminal) starts.  Exit with `Ctrl-d`.\n\n### VS Code\nIf you'd enjoy a more IDE type of environment, [check out VS Code](https://code.visualstudio.com). Follow the [installation directions](https://github.com/julia-vscode/julia-vscode#getting-started) for the [Julia VS Code extension](https://www.julia-vscode.org).\n\n## Running the scripts\nTo get started with the workshop,\n1. clone (or download the ZIP archive) the workshop repository ([help here](https://docs.github.com/en/github/creating-cloning-and-archiving-repositories/cloning-a-repository))\n```sh\ngit clone https://github.com/luraess/parallel-gpu-workshop-JuliaCon21.git\n```\n2. Navigate to `parallel-gpu-workshop-JuliaCon21`\n```sh\ncd parallel-gpu-workshop-JuliaCon21\n```\n3. From the terminal, launch Julia with the `--project` flag to read-in project environment related informations such as used packages\n```sh\njulia --project\n```\n3. From VS Code, follow the [instructions from the documentation](https://www.julia-vscode.org/docs/stable/gettingstarted/) to get started.\n\n### Packages installation\nNow that you launched Julia, you should be in the [Julia REPL]. You need to ensure all the packages you need are installed before using them. To do so, enter the [Pkg mode](https://docs.julialang.org/en/v1/stdlib/REPL/#Pkg-mode) by typing `]`. Then, `instantiate` the project which should trigger the download of the packages (`st` lists the package status). Exit the Pkg mode with `Ctrl-c`:\n```julia-repl\njulia\u003e ]\n\n(parallel-gpu-workshop-JuliaCon21) pkg\u003e st\nStatus `~/parallel-gpu-workshop-JuliaCon21/Project.toml`\n    # [...]\n\n(parallel-gpu-workshop-JuliaCon21) pkg\u003e instantiate\n   Updating registry at `~/.julia/registries/General`\n   Updating git-repo `https://github.com/JuliaRegistries/General.git`\n   # [...]\n\njulia\u003e\n```\nTo test your install, go to the [scripts](../scripts) folder and run the [`diffusion_2D_expl.jl`](/scripts/diffusion_2D_expl.jl) code. You can execute shell commands from within the [Julia REPL] first typing `;`:\n```julia-repl\njulia\u003e ;\n\nshell\u003e cd scripts/\n\njulia\u003e include(\"diffusion_2D_expl.jl\")\n```\nRunning this the first time will (pre-)complie the various installed packages and will take some time. Subsequent runs, by executing `include(\"diffusion_2D_expl.jl\")`, should take around 2s.\n\nYou should then see a figure displayed showing the nonlinear diffusion of a quantity `H` after `nt=666` steps:\n\n![](docs/diff2D_expl.png)\n\n## Multi-threading on CPUs\nOn the CPU, multi-threading is made accessible via [Base.Threads]. To make use of threads, Julia needs to be launched with\n```sh\njulia --project -t auto\n```\nwhich will launch Julia with as many threads are there are cores on your machine (including hyper-threaded cores).  Alternatively set\nthe environment variable [JULIA_NUM_THREADS], e.g. `export JULIA_NUM_THREADS=2` to enable 2 threads.\n\n## Running on GPUs\nThe [CUDA.jl] module permits to launch compute kernels on Nvidia GPUs natively from within [Julia]. [JuliaGPU] provides further reading and [introductory material](https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/) about GPU ecosystems within Julia. If you have an Nvidia CUDA capable GPU device, also export following environment vaiable prior to installing the [CUDA.jl] package:\n```sh\nexport JULIA_CUDA_USE_BINARYBUILDER=false\n```\n\n## Julia MPI\nThe following steps permit you to install [MPI.jl] on your machine and test it:\n1. Julia MPI being a dependency of this Julia project [MPI.jl] should have been added upon executing the `instantiate` command from within the package manager [see here](#packages-installation).\n\n2. Install `mpiexecjl`:\n```julia-repl\njulia\u003e using MPI\n\njulia\u003e MPI.install_mpiexecjl()\n[ Info: Installing `mpiexecjl` to `HOME/.julia/bin`...\n[ Info: Done!\n```\n3. Then, one should add `HOME/.julia/bin` to PATH in order to launch the Julia MPI wrapper `mpiexecjl`.\n\n4. Running a Julia MPI code `\u003cmy_script.jl\u003e` on `np` processes:\n```sh\n$ mpiexecjl -n np julia --project \u003cmy_script.jl\u003e\n```\n\n5. To test the Julia MPI installation, launch the [`hello_mpi.jl`](extras/hello_mpi.jl) using the Julia MPI wrapper `mpiexecjl` (located in `~/.julia/bin`) on 4 processes:\n```sh\n$ mpiexecjl -n 4 julia --project extras/hello_mpi.jl\n$ Hello world, I am 0 of 3\n$ Hello world, I am 1 of 3\n$ Hello world, I am 2 of 3\n$ Hello world, I am 3 of 3\n```\n\u003e 💡 Note: On MacOS, you may encounter this issue (https://github.com/JuliaParallel/MPI.jl/issues/407). To fix it, define following `ENV` variable:\n```sh\n$ export MPICH_INTERFACE_HOSTNAME=localhost\n```\n\u003e and add `-host localhost` to the execution script:\n```sh\n$ mpiexecjl -n 4 -host localhost julia --project extras/hello_mpi.jl\n```\n\n\u003cbr\u003e\n\u003c/details\u003e\n\n\u003e 👉 **Note: This workshop was developed and tested on Julia v1.6. It should work with any Julia version ≥v1.6**. The install configurations were tested on a MacBook Pro running macOS 10.15.7, a Linux GPU server running Ubuntu 20.04 LTS and a Linux GPU server running CentOS 8.\n\n\n# Workshop material\nThis section lists the material discussed within this 3h workshop:\n* [Part 1 - GPU computing and iterative solvers](#part-1---fast-iterative-solvers)\n    * [Diffusion equation](#diffusion-equation)\n    * [Iterative solvers](#iterative-solvers)\n    * [Performance considerations](#performance-considerations)\n* [Part 2 - Parallel CPU and GPU computing](#part-2---parallel-cpu-and-gpu-computing)\n    * [Parallel CPU implementation](#parallel-cpu-implementation)\n    * [GPU implementation](#gpu-implementation)\n    * [XPU implementation](#xpu-implementation)\n    * [Performance and scaling](#performance-and-scaling)\n* [Part 3 - Distributed computing on multiple CPUs and GPUs](#part-3---distributed-computing-on-multiple-cpus-and-gpus)\n    * [Distributed memory and fake parallelisation](#distributed-memory-and-fake-parallelisation)\n    * [Distributed Julia computing using MPI](#distributed-julia-computing-using-mpi)\n    * [Multi-XPU implementations in 2D](#multi-xpu-implementations-in-2d)\n    * [Advanced features](#advanced-features)\n\n💡 In this workshop we will implement a 2D nonlinear diffusion equation on GPUs in Julia using the finite-difference method and an iterative solving approach.\n\n⤴️ [_back to content_](#content)\n\n## Part 1 - Fast iterative solvers\nIn this first part of the workshop we will implement an efficient implicit iterative and matrix-free solver to solve the time-dependent nonlinear diffusion equation in 2D.\n\n### Diffusion equation\nLet's start with a 2D nonlinear diffusion example to implement both an explicit and iterative implicit PDE solver:\n\n  dH/dt = ∇.(H^3 ∇H)\n\nThe diffusion of a quantity `H` over time `t` can be described as (1a, 1b) a diffusive flux, (1c) a flux balance and (1d) an update rule:\n```md\nqHx   = -H^3*dH/dx         (1a)\nqHy   = -H^3*dH/dy         (1b)\ndHdt  = -dqHx/dx -dqHy/dy  (1c)\ndH/dt = dHdt               (1d)\n```\nThe [`diffusion_2D_expl.jl`](scripts/diffusion_2D_expl.jl) code implements an iterative and explicit solution of eq. (1) for an initial Gaussian profile:\n```md\nH0 = exp(-(x-lx/2.0)^2 -(y-ly/2.0)^2)\n```\n\n![](docs/diffusion_2D_expl.gif)\n\n\u003e 💡 The animation above was generated using the [`diffusion_2D_expl_gif.jl`](extras/diffusion_2D_expl_gif.jl) script located in [extras](extras).\n\nA simple way to solve nonlinear diffusion, BUT:\n- given the explicit nature of the scheme we have a restrictive limitation on the maximal allowed time step (subject to the CFL stability condition):\n  ```md\n  dt = minimum(min(dx, dy)^2 ./ inn(H).^npow ./ 4.1)\n  ```\n- there might be loss of accuracy since we use an explicit scheme for a nonlinear problem.\n\nSo now you may ask: can we use an implicit algorithm to ensure nonlinear accuracy, side-step the CFL-condition, control the (physically motivated) time steps `dt` _**and**_ keep it \"matrix-free\" ?\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### Iterative solvers\nThe [`diffusion_2D_impl.jl`](scripts/diffusion_2D_impl.jl) code implements an iterative, implicit solution of eq. (1). **How ?** We include the physical time derivative `dH/dt=(H-Hold)/dt` in the previous rate of change `dHdt` to define the residual `ResH`\n```md\nResH = -(H-Hold)/dt -dqHx/dx -dqHy/dy = 0\n```\nThis is the backward Euler time-stepping scheme.  We iterate until the values of `ResH` (the residual of the eq. (1)) drop below a defined tolerance level `tol`.\n\nHow do we iterate?  Let's define a \"new\" time, a pseudo-time `tau` and set\n```md\ndH/dtau = ResH\n```\nIf we evolve `H` forward in pseudo-time until it reaches a steady state, we get a `H` which solves the equation `ResH=0`.  This is know as a *Picard* or fixed-point iteration.\n\nRunning the implementation [`diffusion_2D_impl.jl`](scripts/diffusion_2D_impl.jl) gives:\n\n![](docs/diff2D_impl.png)\n\nIt works, but the \"naive\" _Picard_ iteration count seems to be pretty high (`niter\u003e800`). A efficient way to circumvent this is to add \"damping\" (`damp`) to the (pseudo-time) rate-of-change `dH/dtau`, analogous to add friction enabling faster convergence \\[[4][Frankel50]\\]\n```md\ndH/dtau = ResH + damp * (dH/dtau)_prev\n```\nwhere `(dH/dtau)_prev` is the `dH/dtau` value from the previous pseudo-time iteration.\nThe [`diffusion_2D_damp.jl`](scripts/diffusion_2D_damp.jl) code implements a damped iterative implicit solution of eq. (1). The iteration count drops to `niter\u003c200`.\n\n![](docs/diff2D_damp.png)\n\nThis second order pseudo-transient approach enables the iteration count to scale close to _O(N)_ and not _O(N^2)_, resulting in a total number of iterations `niter` normalised by the number of grid points `nx` to stay constant and even decay with increasing number of grid points `nx`:\n\n![](docs/iter_scale.png)\n\n\u003e The [`diffusion_2D_damp_perf_gpu_iters.jl`](extras/diffusion_2D_perf_tests/diffusion_2D_damp_perf_gpu_iters.jl) code used for scaling test, the testing and visualisation routines can be found in [extras/diffusion_2D_perf_tests](extras/diffusion_2D_perf_tests).\n\nSo far so good, we have a fast implicit iterative solver. But why bother with implicit, wasn't explicit good enough? Let's compare the difference between the explicit and the damped implicit results using the [`compare_expl_impl.jl`](scripts/compare_expl_impl.jl) script, chosing the \"explicit\" physical time step for both the explicit and implicit code:\n\n![](docs/diff2D_expl_impl.png)\n\nWe see that the explicit approach leads to a less sharp front by ~0.2% (when normalised by the implicit solution).  (Although, arguably this 2D non-linear diffusion problem is solved similarly well by either method.  But stiffer problems *need* an implicit time stepper.)\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### Performance considerations\nEfficient algorithms should minimise the time to solution. For iterative algorithms this means:\n1. Keep the iteration count as low as possible\n2. Ensure fast iterations\n\nWe just achieved (1.) with the implicit damped approach. Let's fix (2.).\n\nMany-core processors as GPUs are throughput-oriented systems that use their massive parallelism to hide latency. On the scientific application side, most algorithms require only a few operations or flops compared to the amount of numbers or bytes accessed from main memory, and thus are significantly memory bound; the Flop/s metric is no longer the most adequate for reporting performance. This status motivated the development of a memory throughput-based performance evaluation metric, `T_eff`, to evaluate the performance of iterative stencil-based solvers \\[[1][JuliaCon20a]\\].\n\nThe effective memory access, `A_eff` [GB], is the the sum of twice the memory footprint of the unknown fields, `D_u`, (fields that depend on their own history and that need to be updated every iteration, i.e. one read \u0026 one write) and once the known fields, `D_k`, (fields that do not change every iteration, i.e. just one read). The effective memory access divided by the execution time per iteration, `t_it` [sec], defines the effective memory throughput, `T_eff` [GB/s].\n\n```md\nA_eff = 2 D_u + D_k\nT_eff = A_eff/t_it\n```\n\nThe theoretical upper bound of `T_eff` is `T_peak`, the hardware's peak memory throughput. Defining the `T_eff` metric, we assume that 1) we evaluate an iterative stencil-based solver, 2) the problem size is much larger than the cache sizes and 3) we do not use time blocking (reasonable assumption for real-world applications). An important concept is not to include fields within the effective memory access that do not depend on their own history (e.g. fluxes); such fields can be re-computed on the fly or stored on-chip.\n\nFore more details, check out the [performance related section](https://github.com/omlins/ParallelStencil.jl#performance-metric) from [ParallelStencil.jl].\n\nFor the 2D time-dependent diffusion equation, we thus have `D_u=2` and `D_k=1`:\n```md\nA_eff = (2 x 2 + 2 x 1) x 8 x nx x ny / 1e9 [GB]\n```\nLet's implement this measure in the following scripts.\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n## Part 2 - Parallel CPU and GPU computing\nIn this second part of the workshop, we will port the [`diffusion_2D_damp.jl`](scripts/diffusion_2D_damp.jl) script implemented using Julia CPU array broadcasting to high-performance parallel CPU and GPU implementations.\n```julia\n# [...] skipped lines\nqHx    .= .-av_xi(H).^npow.*diff(H[:,2:end-1], dims=1)/dx  # flux\nqHy    .= .-av_yi(H).^npow.*diff(H[2:end-1,:], dims=2)/dy  # flux\nResH   .= .-(inn(H) .- inn(Hold))/dt .+\n           (.-diff(qHx, dims=1)/dx .-diff(qHy, dims=2)/dy) # residual of the PDE\ndHdtau .= ResH .+ damp*dHdtau                              # damped rate of change\ndtau   .= (1.0./(min(dx, dy)^2. /inn(H).^npow./4.1) .+ 1.0/dt).^-1  # time step (obeys ~CFL condition)\nH[2:end-1,2:end-1] .= inn(H) .+ dtau.*dHdtau               # update rule, sets the BC as H[1]=H[end]=0\n# [...] skipped lines\n```\nIn the first step towards this goal we:\n- use single-line broadcasting statements to avoid having to use work-arrays (to make it prettier \u0026 shorter we use macros)\n- introduce a `H2` array to avoid race conditions (once parallelized)\n- use non-allocating `diff` operators: `LazyArrays: Diff`\n- add accurate timing of the main loop and `T_eff` reporting\n\nThis results in the [`diffusion_2D_damp_perf.jl`](scripts/diffusion_2D_damp_perf.jl) code:\n```julia\nusing LazyArrays\nusing LazyArrays: Diff\n# [...] skipped lines\nmacro qHx()  esc(:( .-av_xi(H).^npow.*Diff(H[:,2:end-1], dims=1)/dx )) end\nmacro qHy()  esc(:( .-av_yi(H).^npow.*Diff(H[2:end-1,:], dims=2)/dy )) end\nmacro dtau() esc(:( (1.0./(min(dx, dy)^2 ./inn(H).^npow./4.1) .+ 1.0/dt).^-1  )) end\n# [...] skipped lines\nif (it==1 \u0026\u0026 iter==0) t_tic = Base.time(); niter = 0 end\ndHdtau .= .-(inn(H) .- inn(Hold))/dt .+\n           (.-Diff(@qHx(), dims=1)/dx .-Diff(@qHy(), dims=2)/dy) .+\n           damp*dHdtau                              # damped rate of change\nH2[2:end-1,2:end-1] .= inn(H) .+ @dtau().*dHdtau    # update rule, sets the BC as H[1]=H[end]=0\nH, H2 = H2, H                                       # pointer swap\n# [...] skipped lines\nt_toc = Base.time() - t_tic\nA_eff = (2*2+1)/1e9*nx*ny*sizeof(Float64)  # Effective main memory access per iteration [GB]\nt_it  = t_toc/niter                        # Execution time per iteration [s]\nT_eff = A_eff/t_it                         # Effective memory throughput [GB/s]\n# [...] skipped lines\n```\nRunning [`diffusion_2D_damp_perf.jl`](scripts/diffusion_2D_damp_perf.jl) with `nx = ny = 512`, starting Julia with `-O3 --check-bounds=no` produces following output on an Intel Quad-Core i5-4460  CPU @3.20GHz processor (`T_peak = 17 GB/s` measured with [`memcopy3D.jl`](extras/memcopy3D.jl)):\n```julia-repl\nTime = 21.523 sec, T_eff = 0.39 GB/s (niter = 804)\n```\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### Parallel CPU implementation\nThe next step step is to modify the diffusion code [`diffusion_2D_damp_perf.jl`](scripts/diffusion_2D_damp_perf.jl) by transforming the isolated physics calculations (see end of previous section) into spatial loops over `ix` and `iy`, resulting in the [`diffusion_2D_damp_perf_loop.jl`](scripts/diffusion_2D_damp_perf_loop.jl) code:\n```julia\n# [...] skipped lines\nmacro qHx(ix,iy)  esc(:( -(0.5*(H[$ix,$iy+1]+H[$ix+1,$iy+1]))^npow * (H[$ix+1,$iy+1]-H[$ix,$iy+1])/dx )) end\nmacro qHy(ix,iy)  esc(:( -(0.5*(H[$ix+1,$iy]+H[$ix+1,$iy+1]))^npow * (H[$ix+1,$iy+1]-H[$ix+1,$iy])/dy )) end\nmacro dtau(ix,iy) esc(:(  (1.0/(min(dx,dy)^2 / H[$ix+1,$iy+1]^npow/4.1) + 1.0/dt)^-1  )) end\n# [...] skipped lines\nfor iy=1:size(dHdtau,2)\n    for ix=1:size(dHdtau,1)\n        dHdtau[ix,iy] = -(H[ix+1, iy+1] - Hold[ix+1, iy+1])/dt +\n                         (-(@qHx(ix+1,iy)-@qHx(ix,iy))/dx -(@qHy(ix,iy+1)-@qHy(ix,iy))/dy) +\n                         damp*dHdtau[ix,iy]                        # damped rate of change\n        H2[ix+1,iy+1] = H[ix+1,iy+1] + @dtau(ix,iy)*dHdtau[ix,iy]  # update rule, sets the BC as H[1]=H[end]=0\n    end\nend\nH, H2 = H2, H  # pointer swap\n# [...] skipped lines\n```\n\u003e 💡 Note that macros can now take `ix` and `iy` as arguments and that both calculations for `dHdtau`and `H` update are done within a single loop (loop fusion).\n\nRunning [`diffusion_2D_damp_perf_loop.jl`](scripts/diffusion_2D_damp_perf_loop.jl) with `nx = ny = 512` produces following output:\n```julia-repl\nTime = 4.774 sec, T_eff = 1.80 GB/s (niter = 804)\n```\n(We're not sure why the performance so drastically improved over [`diffusion_2D_damp_perf.jl`](scripts/diffusion_2D_damp_perf.jl), as broadcasting was just replaced by loops.)\n\nThe next step is to wrap these physics calculations into functions (later called kernels on the GPU) and define them before the main function of the script, resulting in the [`diffusion_2D_damp_perf_loop_fun.jl`](scripts/diffusion_2D_damp_perf_loop_fun.jl) code:\n```julia\n# [...] skipped lines\nmacro qHx(ix,iy)  esc(:( -(0.5*(H[$ix,$iy+1]+H[$ix+1,$iy+1]))*(0.5*(H[$ix,$iy+1]+H[$ix+1,$iy+1]))*(0.5*(H[$ix,$iy+1]+H[$ix+1,$iy+1])) * (H[$ix+1,$iy+1]-H[$ix,$iy+1])*_dx )) end\nmacro qHy(ix,iy)  esc(:( -(0.5*(H[$ix+1,$iy]+H[$ix+1,$iy+1]))*(0.5*(H[$ix+1,$iy]+H[$ix+1,$iy+1]))*(0.5*(H[$ix+1,$iy]+H[$ix+1,$iy+1])) * (H[$ix+1,$iy+1]-H[$ix+1,$iy])*_dy )) end\nmacro dtau(ix,iy) esc(:(  (1.0/(min_dxy2 / (H[$ix+1,$iy+1]*H[$ix+1,$iy+1]*H[$ix+1,$iy+1]) / 4.1) + _dt)^-1  )) end\n\nfunction compute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\n    Threads.@threads for iy=1:size(dHdtau,2)\n    # for iy=1:size(dHdtau,2)\n        for ix=1:size(dHdtau,1)\n            dHdtau[ix,iy] = -(H[ix+1, iy+1] - Hold[ix+1, iy+1])*_dt +\n                             (-(@qHx(ix+1,iy)-@qHx(ix,iy))*_dx -(@qHy(ix,iy+1)-@qHy(ix,iy))*_dy) +\n                             damp*dHdtau[ix,iy]                        # damped rate of change\n            H2[ix+1,iy+1] = H[ix+1,iy+1] + @dtau(ix,iy)*dHdtau[ix,iy]  # update rule, sets the BC as H[1]=H[end]=0\n        end\n    end\n    return\nend\n# [...] skipped lines\n_dx, _dy, _dt = 1.0/dx, 1.0/dy, 1.0/dt\nmin_dxy2 = min(dx,dy)^2\ncompute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\n# [...] skipped lines\n```\n\u003e 💡 Note that the outer loop (over `iy`) can be parallelized using multi-threading capabilities of the CPU accessible via `Threads.@threads` (see [Getting started](#getting-started) for more infos).\n\nRunning [`diffusion_2D_damp_perf_loop_fun.jl`](scripts/diffusion_2D_damp_perf_loop_fun.jl) with `nx = ny = 512` on 4 cores produces following output:\n```julia-repl\nTime = 0.961 sec, T_eff = 8.80 GB/s (niter = 804)\n```\n(Here again: some performance improvement stems from using multi-threading but there are additional gains compared to  [`diffusion_2D_damp_perf_loop.jl`](scripts/diffusion_2D_damp_perf_loop.jl) of unknown origin...)\n\n\nSince the performance increases and gets closer to hardware limit (memory copy values), some details start to become performance limiters, namely:\n- divisions instead of multiplications\n- arithmetic operations such as power `H^npow`\n\nThese details will become even more important on the GPU.\n\nWe are now ready to move to the GPU !\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### GPU implementation\nSo we now have a cool iterative and implicit nonlinear diffusion solver in less than 100 lines of code 🎉. Good enough for mid-resolution calculations. What if we need higher resolution and faster time to solution ? GPU computing makes it possible to go way beyond the so far achieved T_eff of 8.8 GB/s with my consumer electronics CPU. Let's slightly modify the [`diffusion_2D_damp_perf_loop_fun.jl`](scripts/diffusion_2D_damp_perf_loop_fun.jl) code to enable GPU execution.\n\nThe main idea of the applied GPU parallelisation is to calculate each grid point concurrently by a different GPU thread (instead of the more serial CPU execution) as depicted hereafter:\n\n![](docs/cpu_gpu.png)\n\nThe main change is to replace the (multi-threaded) loops by a vectorised GPU index:\n```julia\nix = (blockIdx().x-1) * blockDim().x + threadIdx().x\niy = (blockIdx().y-1) * blockDim().y + threadIdx().y\n```\nspecific to GPU execution. Each `ix` and `iy` are then executed concurrently by a different GPU thread. Also, whether a grid point has to participate in the calculation or not can no longer be defined by the loop range, but needs to be handled locally to each thread by e.g. an `if`-condition, resulting in the following [`diffusion_2D_damp_perf_gpu.jl`](scripts/diffusion_2D_damp_perf_gpu.jl) GPU code:\n```julia\nusing CUDA\n# [...] skipped lines\nfunction compute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\n    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x\n    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y\n    if (ix\u003c=size(dHdtau,1) \u0026\u0026 iy\u003c=size(dHdtau,2))\n        dHdtau[ix,iy] = -(H[ix+1, iy+1] - Hold[ix+1, iy+1])*_dt +\n                         (-(@qHx(ix+1,iy)-@qHx(ix,iy))*_dx -(@qHy(ix,iy+1)-@qHy(ix,iy))*_dy) +\n                         damp*dHdtau[ix,iy]                        # damped rate of change\n        H2[ix+1,iy+1] = H[ix+1,iy+1] + @dtau(ix,iy)*dHdtau[ix,iy]  # update rule, sets the BC as H[1]=H[end]=0\n    end\n    return\nend\n# [...] skipped lines\nBLOCKX = 32\nBLOCKY = 8\nGRIDX  = 16*16\nGRIDY  = 32*32\nnx, ny = BLOCKX*GRIDX, BLOCKY*GRIDY # number of grid points\n# [...] skipped lines\nResH   = CUDA.zeros(Float64, nx-2, ny-2) # normal grid, without boundary points\ndHdtau = CUDA.zeros(Float64, nx-2, ny-2) # normal grid, without boundary points\n# [...] skipped lines\nH      = CuArray(exp.(.-(xc.-lx/2).^2 .-(yc.-ly/2)'.^2))\n# [...] skipped lines\ncuthreads = (BLOCKX, BLOCKY, 1)\ncublocks  = (GRIDX,  GRIDY,  1)\n# [...] skipped lines\n@cuda blocks=cublocks threads=cuthreads compute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\nsynchronize()\n# [...] skipped lines\n```\n\u003e 💡 We use `@cuda blocks=cublocks threads=cuthreads` to launch the GPU kernel on the appropriate number of threads, i.e. \"parallel workers\". The number of grid points `nx` and `ny` must now be chosen according to the number of parallel workers. Also, note that we need to run higher resolution in order to saturate the GPU memory bandwidth and get relevant performance measures.\n\n\u003e ⚠ Default precision in `CUDA.jl` is `Float32`, so we have to enforce `Float64` here.\n\nRunning [`diffusion_2D_damp_perf_gpu.jl`](scripts/diffusion_2D_damp_perf_gpu.jl) with `nx = ny = 8192` produces the following output on an Nvidia Tesla V100 PCIe (16GB) GPU (`T_peak = 840 GB/s` measured with [`memcopy3D.jl`](extras/memcopy3D.jl)):\n```julia-repl\nTime = 10.088 sec, T_eff = 770.00 GB/s (niter = 2904)\n```\nSo - that rocks 🚀\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### XPU implementation\nLet's do a rapid recap:\n\nSo far we have two performant codes, one CPU-based, the other GPU-based, to solve the nonlinear and implicit diffusion equation in 2D. **Wouldn't it be great to have a single code that enables both ?**\n\n**The answer is [ParallelStencil.jl]** which enables a backend independent syntax implementing parallel stencil kernels to execute on XPUs. The [`diffusion_2D_damp_perf_xpu2.jl`](scripts/diffusion_2D_damp_perf_xpu2.jl) code uses [ParallelStencil.jl] to combine [`diffusion_2D_damp_perf_gpu.jl`](scripts/diffusion_2D_damp_perf_gpu.jl) and [`diffusion_2D_damp_perf_loop_fun.jl`](scripts/diffusion_2D_damp_perf_loop_fun.jl) into a single code. Backend can be chosen by the `USE_GPU` flag. Using the `parallel_indices` permits to write code that avoids storing the fluxes to main memory:\n```julia\nconst USE_GPU = true\nusing ParallelStencil\nusing ParallelStencil.FiniteDifferences2D\n@static if USE_GPU\n    @init_parallel_stencil(CUDA, Float64, 2)\nelse\n    @init_parallel_stencil(Threads, Float64, 2)\nend\n# [...] skipped lines\n@parallel_indices (ix,iy) function compute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\n    if (ix\u003c=size(dHdtau,1) \u0026\u0026 iy\u003c=size(dHdtau,2))\n        dHdtau[ix,iy] = -(H[ix+1, iy+1] - Hold[ix+1, iy+1])*_dt +\n                         (-(@qHx(ix+1,iy)-@qHx(ix,iy))*_dx -(@qHy(ix,iy+1)-@qHy(ix,iy))*_dy) +\n                         damp*dHdtau[ix,iy]                        # damped rate of change\n        H2[ix+1,iy+1] = H[ix+1,iy+1] + @dtau(ix,iy)*dHdtau[ix,iy]  # update rule, sets the BC implicitly as H[1]=H[end]=0\n    end\n    return\nend\n# [...] skipped lines\n@parallel cublocks cuthreads compute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\n# [...] skipped lines\n```\n\u003e 💡 Note that [ParallelStencil.jl] currently supports `Threads.@threads` and `CUDA.jl` as backends.\n\nRunning [`diffusion_2D_damp_perf_xpu2.jl`](scripts/diffusion_2D_damp_perf_xpu2.jl) with `nx = ny = 8192` on an Nvidia Tesla V100 PCIe (16GB) GPU produces following output:\n```julia-repl\nTime = 10.094 sec, T_eff = 770.00 GB/s (niter = 2904)\n```\nThat's excellent ! [ParallelStencil.jl] and the [CUDA.jl] backend show identical performance compared to the pure CUDA [`diffusion_2D_damp_perf_gpu.jl`](scripts/diffusion_2D_damp_perf_gpu.jl) code. The [`diffusion_2D_damp_perf_xpu2.jl`](scripts/diffusion_2D_damp_perf_xpu2.jl) XPU code uses manually precomputed optimal grid and thread block sizes passed to the `@parallel` launch macro (type `?@parallel` for more information). Alternatively, ParallelStencil provides some \"comfort features\" for launching kernels, as e.g. computing automatically and dynamically optimal grid and thread block sizes, resulting currently in about 1% performance difference (see [`diffusion_2D_damp_perf_gpu.jl`](scripts/diffusion_2D_damp_perf_gpu.jl); note that in the next release of ParallelStencil this performance difference will become entirely negligible). The \"default\" implementation using ParallelStencil would allow for using macros exposed by the `FiniteDifferences2D` module for a math-close notation in the kernels:\n```julia\n# [...] skipped lines\n@parallel function compute_flux!(qHx, qHy, H, _dx, _dy)\n    @all(qHx) = -@av_xi(H)*@av_xi(H)*@av_xi(H)*@d_xi(H)*_dx\n    @all(qHy) = -@av_yi(H)*@av_yi(H)*@av_yi(H)*@d_yi(H)*_dy\n    return\nend\n\nmacro dtau() esc(:( (1.0/(min_dxy2 / (@inn(H)*@inn(H)*@inn(H)) / 4.1) + _dt)^-1 )) end\n@parallel function compute_update!(dHdtau, H, Hold, qHx, qHy, _dt, damp, min_dxy2, _dx, _dy)\n    @all(dHdtau) = -(@inn(H) - @inn(Hold))*_dt - @d_xa(qHx)*_dx - @d_ya(qHy)*_dy + damp*@all(dHdtau)\n    @inn(H)      =   @inn(H) + @dtau()*@all(dHdtau)\n    return\nend\n```\nRunning [`diffusion_2D_damp_xpu.jl`](scripts/diffusion_2D_damp_xpu.jl) with `nx = ny = 8192` on an Nvidia Tesla V100 PCIe (16GB) GPU produces following output:\n```julia-repl\nTime = 21.420 sec, T_eff = 360.00 GB/s (niter = 2904)\n```\nThe performance is significantly less good in this case as writing fluxes to main memory could not be avoided using the more comfortable syntax. Note, however, that in many applications we do not face this issue and the performance of applications written using the `@parallel` macro is on par with those written using the `@parallel_indices` macro.\n\n\u003e 💡 Future versions of ParallelStencil will enable comfortable syntax using the `@parallel` macro for computing fields as these fluxes on-the-fly or for storing them on-chip.\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### Performance and scaling\nWe have developed 6 scripts, 3 CPU-based and 4 GPU-based, we can now use to realise a scaling test and report `T_eff` as function of number of grid points `nx = [64 128, 256, 512, 1024, 2048, 4096]` and including `[..., 8192, 16384]` values on the GPU:\n\n![](docs/perf_cpu.png)\n\n![](docs/perf_gpu.png)\n\nNote that `T_peak` of the Nvidia Tesla V100 GPU is 840 GB/s. Our GPU code thus achieves an effective memory throughput which is 92% of the peak memory throughput. The codes used for performance tests and testing routine can be found in [extras/diffusion_2D_perf_tests](extras/diffusion_2D_perf_tests).\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n## Part 3 - Distributed computing on multiple CPUs and GPUs\nIn this last part of the workshop, we will explore multi-XPU capabilities. This will enable our codes to run on multiple CPUs and GPUs in order to scale on modern multi-GPU nodes, clusters and supercomputers. Also, we will experiment with basic concepts of the distributed memory computing approach using Julia's MPI wrapper [MPI.jl]. In the proposed approach, each MPI process handles one CPU thread. In the MPI GPU case (multi-GPUs), each MPI process handles one GPU. The [Getting started](#getting-started) section contains useful information in the [Julia MPI](#getting-started) section to get you set up.\n\n### Distributed memory and fake parallelisation\nAs a first step, we will look at the [`diffusion_1D_2procs.jl`](scripts/diffusion_1D_2procs.jl) code that solves the linear diffusion equations using a \"fake-parallelisation\" approach. We split the calculation on two distinct left and right domains, which requires left and right `H` arrays, `HL` and `HR`, respectively:\n```julia\n# Compute physics locally\nHL[2:end-1] .= HL[2:end-1] .+ dt*λ*diff(diff(HL)/dx)/dx\nHR[2:end-1] .= HR[2:end-1] .+ dt*λ*diff(diff(HR)/dx)/dx\n# Update boundaries (later MPI)\nHL[end] = HR[2]\nHR[1]   = HL[end-1]\n# Global picture\nH .= [HL[1:end-1]; HR[2:end]]\n```\nWe see that a correct boundary update is the critical part for a successful implementation. In our approach, we need an overlap of 2 cells in order to avoid any artefacts at the transition between the left and right domains.\n\nThe next step would be to generalise the \"2 processes\" concept to \"n-processes\", keeping the \"fake-parallelisation\" approach. The [`diffusion_1D_nprocs.jl`](scripts/diffusion_1D_nprocs.jl) code contains this modification:\n```julia\nfor ip = 1:np # compute physics locally\n    H[2:end-1,ip] .= H[2:end-1,ip] .+ dt*λ*diff(diff(H[:,ip])/dxg)/dxg\nend\nfor ip = 1:np-1 # update boundaries\n    H[end,ip  ] = H[    2,ip+1]\n    H[  1,ip+1] = H[end-1,ip  ]\nend\nfor ip = 1:np # global picture\n    i1 = 1 + (ip-1)*(nx-2)\n    Hg[i1:i1+nx-2] .= H[1:end-1,ip]\nend\n```\nThe array `H` contains now `n` local domains where each domain belongs to one fake process, namely the fake process indicated by the second index of H (`ip`). The `# update boundaries` steps are adapted accordingly. All the physical calculations happen on the local chunks of the arrays. We only need \"global\" knowledge in the definition of the initial condition, in order to e.g. initialise the Gaussian distribution using global and not local coordinates.\n\nSo far, so good, we are now ready to write a script that would truly distribute calculations on different processors using [MPI.jl].\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### Distributed Julia computing using MPI\nAs next step, let's see what are the minimal requirements that would allow us to write an MPI-parallel code in Julia. We will solve the following linear diffusion physics:\n```julia\nfor it = 1:nt\n    qHx        .= .-λ*diff(H)/dx\n    H[2:end-1] .= H[2:end-1] .- dt*diff(qHx)/dx\nend\n```\non multiple processors. The [`diffusion_1D_mpi.jl`](scripts/diffusion_1D_mpi.jl) code implements the following steps:\n1. Initialise MPI and set-up a Cartesian communicator\n2. Implement a boundary exchange routine\n3. Finalise MPI\n4. Create a \"global\" initial condition\n\nTo (1.) initialise MPI and prepare the Cartesian communicator, we define:\n```julia\nMPI.Init()\ndims        = [0]\ncomm        = MPI.COMM_WORLD\nnprocs      = MPI.Comm_size(comm)\nMPI.Dims_create!(nprocs, dims)\ncomm_cart   = MPI.Cart_create(comm, dims, [0], 1)\nme          = MPI.Comm_rank(comm_cart)\ncoords      = MPI.Cart_coords(comm_cart)\nneighbors_x = MPI.Cart_shift(comm_cart, 0, 1)\n```\nwhere `me` represents the process ID unique to each MPI process.\n\nThen, we need to (2.) implement a boundary exchange routine. For conciseness, we will here use blocking messages:\n```julia\n@views function update_halo(A, neighbors_x, comm)\n    if neighbors_x[1] != MPI.MPI_PROC_NULL\n        sendbuf = A[2]\n        recvbuf = zeros(1)\n        MPI.Send(sendbuf,  neighbors_x[1], 0, comm)\n        MPI.Recv!(recvbuf, neighbors_x[1], 1, comm)\n        A[1] = recvbuf[1]\n    end\n    if neighbors_x[2] != MPI.MPI_PROC_NULL\n        sendbuf = A[end-1]\n        recvbuf = zeros(1)\n        MPI.Send(sendbuf,  neighbors_x[2], 1, comm)\n        MPI.Recv!(recvbuf, neighbors_x[2], 0, comm)\n        A[end] = recvbuf[1]\n    end\n    return\nend\n```\nIn a nutshell, we store the boundary values we want to exchange in a send buffer `sendbuf` and initialise a receive buffer `recvbuf`; then, we send the content of `sendbuf` to the neighbor (sending messages `MPI.Send(), MPI.Recv!()`); finally, we assign to the boundary the values from the receive buffer.\n\nLast, we need to (3.) finalise MPI prior to returning from the main\n```julia\nMPI.Finalize()\n```\nThe remaining step is to (4.) create an initial Gaussian distribution of `H` that spans correctly over all local domains. This can be achieved as following:\n```julia\nx0    = coords[1]*(nx-2)*dx\nxc    = [x0 + ix*dx - dx/2 - 0.5*lx  for ix=1:nx]\nH     = exp.(.-xc.^2)\n```\nwhere `x0` represents the first global x-coordinate on every process and `xc` represents the local chunk of the global coordinates on each local process.\n\nRunning the [`diffusion_1D_mpi.jl`](scripts/diffusion_1D_mpi.jl) code\n```sh\nmpiexecjl -n 4 julia --project diffusion_1D_mpi.jl\n```\nwill generate one output file for each MPI process. Use the [`vizme1D_mpi.jl`](scripts/vizme1D_mpi.jl) script to reconstruct the global `H` array from the local results and visualise it.\n\nYay 🎉 - we just made a Julia parallel MPI diffusion solver in _only_ 70 lines of code.\n\nHold-on, the [`diffusion_2D_mpi.jl`](scripts/diffusion_2D_mpi.jl) code implements a 2D version of the [`diffusion_1D_mpi.jl`](scripts/diffusion_1D_mpi.jl) code. Nothing is really new there, but it may be interesting to see how boundary update routines are defined in 2D as one now needs to exchange vectors instead of single values. Running the [`diffusion_2D_mpi.jl`](scripts/diffusion_2D_mpi.jl) will generate one output file per MPI process and the [`vizme2D_mpi.jl`](scripts/vizme2D_mpi.jl) script can then be used for visualisation purpose.\n\n_Note: The presented concise Julia MPI scripts are inspired from [this 2D python script](https://github.com/omlins/adios2-tutorial/blob/main/example/mpi_diffusion2D.py)._\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### Multi-XPU implementations in 2D\nLet's do a quick recap: so far, we explored the concept of distributed memory parallelisation with simple \"fake-parallel\" codes. We then demystified the usage of MPI in Julia within a 1D and 2D diffusion solver using [MPI.jl], a Cartesian topology and blocking message. This code would already execute on many processors and could be launched on a cluster.\n\nThe remaining steps are to:\n- use non-blocking communication\n- use multiple GPUs\n- prevent MPI communication to become the performance killer\n\nWe address these steps using [ImplicitGlobalGrid.jl] along with [ParallelStencil.jl]. As final act of this workshop we will take the high-performance XPU [`diffusion_2D_damp_perf_xpu.jl`](scripts/diffusion_2D_damp_perf_xpu.jl) code from [Part 2](#xpu-implementation):\n```julia\n# [...] skipped lines\ndamp   = 1-35/nx      # damping (this is a tuning parameter, dependent on e.g. grid resolution)\ndx, dy = lx/nx, ly/ny # grid size\n# [...] skipped lines\nH      = Data.Array(exp.(.-(xc.-lx/2).^2 .-(yc.-ly/2)'.^2))\n# [...] skipped lines\n@parallel compute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\n# [...] skipped lines\nerr = norm(ResH)/length(ResH)\n# [...] skipped lines\n```\nand add the few required [ImplicitGlobalGrid.jl] functions in order to have a multi-XPU code ready to scale on GPU supercomputers.\n\nAppreciate the few minor changes -**10 new lines only**- (not including those for visualisation) required turn the single-XPU code into a multi-XPU code [`diffusion_2D_damp_perf_multixpu.jl`](scripts/diffusion_2D_damp_perf_multixpu.jl):\n```julia\n# [...] skipped lines\nusing ImplicitGlobalGrid, Plots, Printf, LinearAlgebra\nimport MPI\n# [...] skipped lines\nnorm_g(A) = (sum2_l = sum(A.^2); sqrt(MPI.Allreduce(sum2_l, MPI.SUM, MPI.COMM_WORLD)))\n# [...] skipped lines\nme, dims = init_global_grid(nx, ny, 1)  # Initialization of MPI and more...\n@static if USE_GPU select_device() end  # select one GPU per MPI local rank (if \u003e1 GPU per node)\ndx, dy = lx/nx_g(), ly/ny_g()           # grid size\ndamp   = 1-35/nx_g()                    # damping (this is a tuning parameter, dependent on e.g. grid resolution)\n# [...] skipped lines\nH     .= Data.Array([exp(-(x_g(ix,dx,H)+dx/2 -lx/2)^2 -(y_g(iy,dy,H)+dy/2 -ly/2)^2) for ix=1:size(H,1), iy=1:size(H,2)])\n# [...] skipped lines\nlen_ResH_g = ((nx-2-2)*dims[1]+2)*((ny-2-2)*dims[2]+2)\n# [...] skipped lines\n@hide_communication (8, 4) begin # communication/computation overlap\n    @parallel compute_update!(H2, dHdtau, H, Hold, _dt, damp, min_dxy2, _dx, _dy)\n    H, H2 = H2, H\n    update_halo!(H)\nend\n# [...] skipped lines\nerr = norm_g(ResH)/len_ResH_g\n# [...] skipped lines\nfinalize_global_grid()\n# [...] skipped lines\n```\nRunning the [`diffusion_2D_damp_perf_multixpu.jl`](scripts/diffusion_2D_damp_perf_multixpu.jl) code with `do_visu = true` generates the following gif (here `4096x4096` grid points on 4 GPUs)\n\n![](docs/diffusion_2D_multixpu.gif)\n\nSo, here we are:\n\n**We have a Julia GPU MPI code to resolve nonlinear diffusion processes in 2D using a second order accelerated iterative scheme and we can run it on GPU supercomputers** 🎉.\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n### Advanced features\nThis last section provides directions and details on more advanced features.\n\n#### CUDA-aware MPI\n[ImplicitGlobalGrid.jl] supports CUDA-aware MPI upon exporting the ENV variable `IGG_CUDAAWARE_MPI=1`; MPI can then access device (GPU) pointers and exchange data directly between GPUs using Remote Direct Memory Access (RDMA) bypassing extra buffer copies to the CPU for enhanced performance.\n\n#### Hiding communication\n[ParallelStencil.jl] exposes a hiding communication feature accessible through the [`@hide_communication`](https://github.com/luraess/parallel-gpu-workshop-JuliaCon21/blob/cb0ca90d5f8fc46d4cf57edc2b26ff6b7cddd353/scripts/diffusion_2D_damp_perf_multixpu.jl#L90) macro. This macro allows to define a boundary width applied to each local domain in order to split the computation such that:\n1. The boundary cells are first computed\n2. The communication (boundary exchange procedure) can start\n3. The remaining inner points are computed while boundary exchange is on-going.\n\nThe [`diffusion_2D_damp_perf_multixpu_prof.jl`](extras/diffusion_2D_damp_perf_multixpu_prof.jl) code implements CUDA profiling features to visualise the MPI communication overlapping with computation, adding `CUDA.@profile while err\u003etol \u0026\u0026 iter\u003citMax` to the inner loop.\n\nRunning the code as:\n```sh\nmpirun -np 4 nvprof --profile-from-start off --export-profile diffusion_2D.%q{OMPI_COMM_WORLD_RANK}.prof -f julia --project -O3 --check-bounds=no diffusion_2D_damp_perf_multixpu_prof.jl\n```\nwith will generate profiler files that can then be loaded and visualised Nvidia's visual profiler (`nvvp`):\n\n![](docs/profiling.png)\n\n\u003e Note that CUDA-aware MPI was used on 4 Nvidia Tesla V100 GPUs, connected with NVLink (i.e. MPI communication is using NVLink here)\n\nThe MPI communication (purple bars) nicely overlap the `compute_update!()` kernel execution. Further infos can be found [here](https://github.com/omlins/ParallelStencil.jl#seamless-interoperability-with-communication-packages-and-hiding-communication).\n\n#### 3D examples\nThe approach and tools presented in this workshop are not restricted to 2D calculation. If you have interests in 3D examples, check out the [miniapps](https://github.com/omlins/ParallelStencil.jl#concise-singlemulti-xpu-miniapps) section from the [ParallelStencil.jl] README. The [miniapps](https://github.com/omlins/ParallelStencil.jl#concise-singlemulti-xpu-miniapps) section provides also additional information:\n- about the `T_eff` metric;\n- on how to run MPI GPU applications on different hardware.\n\n⤴️ [_back to workshop material_](#workshop-material)\n\n# Further reading\n\\[1\\] [Omlin, S., Räss, L., Kwasniewski, G., Malvoisin, B., \u0026 Podladchikov, Y. Y. (2020). Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia. JuliaCon Conference, virtual.][JuliaCon20a]\n\n\\[2\\] [Räss, L., Reuber, G., \u0026 Omlin, S. (2020). Multi-Physics 3-D Inversion on GPU Supercomputers with Julia. JuliaCon Conference, virtual.][JuliaCon20b]\n\n\\[3\\] [Räss, L., Omlin, S., \u0026 Podladchikov, Y. Y. (2019). Porting a Massively Parallel Multi-GPU Application to Julia: a 3-D Nonlinear Multi-Physics Flow Solver. JuliaCon Conference, Baltimore, USA.][JuliaCon19]\n\n\\[4\\] [Frankel, S. P. (1950). Convergence rates of iterative treatments of partial differential equations, Mathe. Tables Other Aids Comput., 4, 65–75.][Frankel50]\n\n⤴️ [_back to content_](#content)\n\n\n[Julia]: https://julialang.org\n[Julia language]: https://docs.julialang.org/en/v1/\n[Julia REPL]: https://docs.julialang.org/en/v1/stdlib/REPL/\n[Base.Threads]: https://docs.julialang.org/en/v1/base/multi-threading/\n[JULIA_NUM_THREADS]:https://docs.julialang.org/en/v1.0.0/manual/environment-variables/#JULIA_NUM_THREADS-1\n[CUDA.jl]: https://github.com/JuliaGPU/CUDA.jl\n[JuliaGPU]: https://juliagpu.org\n[ParallelStencil.jl]: https://github.com/omlins/ParallelStencil.jl\n[ImplicitGlobalGrid.jl]: https://github.com/eth-cscs/ImplicitGlobalGrid.jl\n[MPI.jl]: https://juliaparallel.github.io/MPI.jl/stable/examples/01-hello/\n\n[JuliaCon20a]: https://www.youtube.com/watch?v=vPsfZUqI4_0\n[JuliaCon20b]: https://www.youtube.com/watch?v=1t1AKnnGRqA\n[JuliaCon19]: https://www.youtube.com/watch?v=b90qqbYJ58Q\n[Frankel50]: https://doi.org/10.2307/2002770\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluraess%2Fparallel-gpu-workshop-juliacon21","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluraess%2Fparallel-gpu-workshop-juliacon21","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluraess%2Fparallel-gpu-workshop-juliacon21/lists"}