https://github.com/uscbiostats/r-parallel-benchmark

Using Rcpp with OpenMP (parfor and SIMD)
https://github.com/uscbiostats/r-parallel-benchmark

benchmark hpc openmp parallel-computing rcpp rstats simd

Last synced: 5 months ago
JSON representation

Using Rcpp with OpenMP (parfor and SIMD)

Host: GitHub
URL: https://github.com/uscbiostats/r-parallel-benchmark
Owner: USCbiostats
Created: 2020-05-26T01:27:28.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-07-29T22:20:48.000Z (over 5 years ago)
Last Synced: 2025-07-14T12:37:57.963Z (6 months ago)
Topics: benchmark, hpc, openmp, parallel-computing, rcpp, rstats, simd
Language: C++
Homepage:
Size: 53.7 KB
Stars: 6
Watchers: 4
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

          ---

output: github_document

---

# Rcpp + OpenMP

This repository shows how much speed gains can be obtained from using

[OpenMP](https://openmp.org), and in particular, the `omp simd` and `parallel for`

instructions.

The test consists on computing the pair-wise distances between rows in a matrix

of size `N` by `M`. The equivalent function in R is `dist()`, and here we 

redefined it using [`Rcpp`](https://cran.r-project.org/package=Rcpp) (checkout

the benchmark computing matrix product [here](matrix.md)).

The file [norm.cpp](norm.cpp) contains the C++ source code for the dist functions.

The compiles function are:

- `dist_omp_simd` Using the pragma directives `parallel for` and `simd`.

- `dist_omp_simd_ptr` Same as above, but instead of creating a copy of the input matrix, it uses a `const double *` (a pointer) to access the data.

- `dist_omp` Using the pragma `parallel for`.

- `dist_simd` Using the pragma `simd`.

- `dist_for` no directives.

- `dist_for_arma2` Using Armadillo with vectorized functions.

- `dist_for_arma1` Armadillo implementation with for-loops.

## Speed benchmark

```{r execution, cache=TRUE}

# Notice that the -fopenmp flag is already included in the norm.cpp file

Sys.setenv("PKG_CXXFLAGS" = "-O2 -mavx2 -march=core-avx2 -mtune=core-avx2 -DARMA_USE_OPENMP")

Rcpp::sourceCpp("norm.cpp")

library(microbenchmark)

set.seed(718243)

N <- 500

M <- 1000

x <- matrix(runif(N * M), nrow = N)

xt <- t(x)

(ans_bm <- microbenchmark(

  `SIMD + parfor`      = dist_omp_simd(x, N, M, 2),

  `SIMD + parfor (ptr)`= dist_omp_simd_ptr(xt, N, M, 2),

  `parfor`             = dist_omp(x, N, M, 2),

  `SIMD`               = dist_simd(x, N, M),

  `serial`             = dist_for(x, N, M),

  `arma sugar`         = dist_for_arma2(x,N,M),

  `arma`               = dist_for_arma1(x,N,M),

  R                    = as.matrix(dist(x)),

  times                = 10,

  unit                 = "relative"

))

```

As a reference, the elapsed time in ms for R and SIMD + parfor is

```{r print-as-ms, echo=FALSE}

library(microbenchmark)

print(ans_bm[ans_bm$expr %in% c("R", "SIMD + parfor"),], unit = "ms")

```

Overall, in my machine, the SIMD+parfor combo outperforms all the others (notice

that when it comes to compute matrix products, [Armadillo is the fastest](matrix.md)).

Let's see if the results are equivalent. At the very least, we should only

observe small differences (if any) b/c of precision:

```{r Comparing-results, cache=TRUE}

Rcpp::sourceCpp("norm.cpp")

ans0 <- as.matrix(dist(x))

ans_a <- dist_omp_simd(x, N, M)

ans_b <- dist_omp(x, N, M)

ans_c <- dist_simd(x, N, M)

ans_d <- dist_for(x, N, M)

ans_e <- dist_omp_simd_ptr(t(x), N, M)

range(ans0 - ans_b)

range(ans_a - ans_b)

range(ans_b - ans_c)

range(ans_c - ans_d)

range(ans_d - ans_e)

```

The programs were compiled on a machine with an 

[Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor](https://ark.intel.com/content/www/us/en/ark/products/95443/intel-core-i5-7200u-processor-3m-cache-up-to-3-10-ghz.html) which works with [AVX2 instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2), i.e. we can literally vectorize 4 double precision operations at a time (512/64 = 4, on top of multi-threading). One important thing to consider is that for this to work we had to generate a copy of the R matrix into a double vector so that elements were contiguous (which is important for SIMD).

Finally, the [`microbenchmark`](https://cran.r-project.org/package=microbenchmark) R package offers a nice viz with boxplot comparing all the methods:

```{r viz, dependson='execution'}

op <- par(mai = par("mai") * c(2,1,1,1))

boxplot(ans_bm, las = 2, xlab = "")

par(op)

```

## Session info

The programs were compiled on a machine with an 

[Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor](https://ark.intel.com/content/www/us/en/ark/products/95443/intel-core-i5-7200u-processor-3m-cache-up-to-3-10-ghz.html)

```{r}

sessionInfo()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/uscbiostats/r-parallel-benchmark

Awesome Lists containing this project

README