An open API service indexing awesome lists of open source software.

https://github.com/uscbiostats/r-parallel-benchmark

Using Rcpp with OpenMP (parfor and SIMD)
https://github.com/uscbiostats/r-parallel-benchmark

benchmark hpc openmp parallel-computing rcpp rstats simd

Last synced: 5 months ago
JSON representation

Using Rcpp with OpenMP (parfor and SIMD)

Awesome Lists containing this project

README

          

---
output: github_document
---

# Rcpp + OpenMP

This repository shows how much speed gains can be obtained from using
[OpenMP](https://openmp.org), and in particular, the `omp simd` and `parallel for`
instructions.

The test consists on computing the pair-wise distances between rows in a matrix
of size `N` by `M`. The equivalent function in R is `dist()`, and here we
redefined it using [`Rcpp`](https://cran.r-project.org/package=Rcpp) (checkout
the benchmark computing matrix product [here](matrix.md)).

The file [norm.cpp](norm.cpp) contains the C++ source code for the dist functions.
The compiles function are:

- `dist_omp_simd` Using the pragma directives `parallel for` and `simd`.

- `dist_omp_simd_ptr` Same as above, but instead of creating a copy of the input matrix, it uses a `const double *` (a pointer) to access the data.

- `dist_omp` Using the pragma `parallel for`.

- `dist_simd` Using the pragma `simd`.

- `dist_for` no directives.

- `dist_for_arma2` Using Armadillo with vectorized functions.

- `dist_for_arma1` Armadillo implementation with for-loops.

## Speed benchmark

```{r execution, cache=TRUE}
# Notice that the -fopenmp flag is already included in the norm.cpp file
Sys.setenv("PKG_CXXFLAGS" = "-O2 -mavx2 -march=core-avx2 -mtune=core-avx2 -DARMA_USE_OPENMP")
Rcpp::sourceCpp("norm.cpp")

library(microbenchmark)
set.seed(718243)
N <- 500
M <- 1000
x <- matrix(runif(N * M), nrow = N)
xt <- t(x)

(ans_bm <- microbenchmark(
`SIMD + parfor` = dist_omp_simd(x, N, M, 2),
`SIMD + parfor (ptr)`= dist_omp_simd_ptr(xt, N, M, 2),
`parfor` = dist_omp(x, N, M, 2),
`SIMD` = dist_simd(x, N, M),
`serial` = dist_for(x, N, M),
`arma sugar` = dist_for_arma2(x,N,M),
`arma` = dist_for_arma1(x,N,M),
R = as.matrix(dist(x)),
times = 10,
unit = "relative"
))
```

As a reference, the elapsed time in ms for R and SIMD + parfor is

```{r print-as-ms, echo=FALSE}
library(microbenchmark)
print(ans_bm[ans_bm$expr %in% c("R", "SIMD + parfor"),], unit = "ms")
```

Overall, in my machine, the SIMD+parfor combo outperforms all the others (notice
that when it comes to compute matrix products, [Armadillo is the fastest](matrix.md)).
Let's see if the results are equivalent. At the very least, we should only
observe small differences (if any) b/c of precision:

```{r Comparing-results, cache=TRUE}
Rcpp::sourceCpp("norm.cpp")
ans0 <- as.matrix(dist(x))
ans_a <- dist_omp_simd(x, N, M)
ans_b <- dist_omp(x, N, M)
ans_c <- dist_simd(x, N, M)
ans_d <- dist_for(x, N, M)
ans_e <- dist_omp_simd_ptr(t(x), N, M)
range(ans0 - ans_b)
range(ans_a - ans_b)
range(ans_b - ans_c)
range(ans_c - ans_d)
range(ans_d - ans_e)
```

The programs were compiled on a machine with an
[Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor](https://ark.intel.com/content/www/us/en/ark/products/95443/intel-core-i5-7200u-processor-3m-cache-up-to-3-10-ghz.html) which works with [AVX2 instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2), i.e. we can literally vectorize 4 double precision operations at a time (512/64 = 4, on top of multi-threading). One important thing to consider is that for this to work we had to generate a copy of the R matrix into a double vector so that elements were contiguous (which is important for SIMD).

Finally, the [`microbenchmark`](https://cran.r-project.org/package=microbenchmark) R package offers a nice viz with boxplot comparing all the methods:

```{r viz, dependson='execution'}
op <- par(mai = par("mai") * c(2,1,1,1))
boxplot(ans_bm, las = 2, xlab = "")
par(op)
```

## Session info

The programs were compiled on a machine with an
[Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor](https://ark.intel.com/content/www/us/en/ark/products/95443/intel-core-i5-7200u-processor-3m-cache-up-to-3-10-ghz.html)

```{r}
sessionInfo()
```