https://github.com/uscbiostats/r-parallel-benchmark
Using Rcpp with OpenMP (parfor and SIMD)
https://github.com/uscbiostats/r-parallel-benchmark
benchmark hpc openmp parallel-computing rcpp rstats simd
Last synced: 5 months ago
JSON representation
Using Rcpp with OpenMP (parfor and SIMD)
- Host: GitHub
- URL: https://github.com/uscbiostats/r-parallel-benchmark
- Owner: USCbiostats
- Created: 2020-05-26T01:27:28.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-07-29T22:20:48.000Z (over 5 years ago)
- Last Synced: 2025-07-14T12:37:57.963Z (6 months ago)
- Topics: benchmark, hpc, openmp, parallel-computing, rcpp, rstats, simd
- Language: C++
- Homepage:
- Size: 53.7 KB
- Stars: 6
- Watchers: 4
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---
# Rcpp + OpenMP
This repository shows how much speed gains can be obtained from using
[OpenMP](https://openmp.org), and in particular, the `omp simd` and `parallel for`
instructions.
The test consists on computing the pair-wise distances between rows in a matrix
of size `N` by `M`. The equivalent function in R is `dist()`, and here we
redefined it using [`Rcpp`](https://cran.r-project.org/package=Rcpp) (checkout
the benchmark computing matrix product [here](matrix.md)).
The file [norm.cpp](norm.cpp) contains the C++ source code for the dist functions.
The compiles function are:
- `dist_omp_simd` Using the pragma directives `parallel for` and `simd`.
- `dist_omp_simd_ptr` Same as above, but instead of creating a copy of the input matrix, it uses a `const double *` (a pointer) to access the data.
- `dist_omp` Using the pragma `parallel for`.
- `dist_simd` Using the pragma `simd`.
- `dist_for` no directives.
- `dist_for_arma2` Using Armadillo with vectorized functions.
- `dist_for_arma1` Armadillo implementation with for-loops.
## Speed benchmark
```{r execution, cache=TRUE}
# Notice that the -fopenmp flag is already included in the norm.cpp file
Sys.setenv("PKG_CXXFLAGS" = "-O2 -mavx2 -march=core-avx2 -mtune=core-avx2 -DARMA_USE_OPENMP")
Rcpp::sourceCpp("norm.cpp")
library(microbenchmark)
set.seed(718243)
N <- 500
M <- 1000
x <- matrix(runif(N * M), nrow = N)
xt <- t(x)
(ans_bm <- microbenchmark(
`SIMD + parfor` = dist_omp_simd(x, N, M, 2),
`SIMD + parfor (ptr)`= dist_omp_simd_ptr(xt, N, M, 2),
`parfor` = dist_omp(x, N, M, 2),
`SIMD` = dist_simd(x, N, M),
`serial` = dist_for(x, N, M),
`arma sugar` = dist_for_arma2(x,N,M),
`arma` = dist_for_arma1(x,N,M),
R = as.matrix(dist(x)),
times = 10,
unit = "relative"
))
```
As a reference, the elapsed time in ms for R and SIMD + parfor is
```{r print-as-ms, echo=FALSE}
library(microbenchmark)
print(ans_bm[ans_bm$expr %in% c("R", "SIMD + parfor"),], unit = "ms")
```
Overall, in my machine, the SIMD+parfor combo outperforms all the others (notice
that when it comes to compute matrix products, [Armadillo is the fastest](matrix.md)).
Let's see if the results are equivalent. At the very least, we should only
observe small differences (if any) b/c of precision:
```{r Comparing-results, cache=TRUE}
Rcpp::sourceCpp("norm.cpp")
ans0 <- as.matrix(dist(x))
ans_a <- dist_omp_simd(x, N, M)
ans_b <- dist_omp(x, N, M)
ans_c <- dist_simd(x, N, M)
ans_d <- dist_for(x, N, M)
ans_e <- dist_omp_simd_ptr(t(x), N, M)
range(ans0 - ans_b)
range(ans_a - ans_b)
range(ans_b - ans_c)
range(ans_c - ans_d)
range(ans_d - ans_e)
```
The programs were compiled on a machine with an
[Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor](https://ark.intel.com/content/www/us/en/ark/products/95443/intel-core-i5-7200u-processor-3m-cache-up-to-3-10-ghz.html) which works with [AVX2 instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2), i.e. we can literally vectorize 4 double precision operations at a time (512/64 = 4, on top of multi-threading). One important thing to consider is that for this to work we had to generate a copy of the R matrix into a double vector so that elements were contiguous (which is important for SIMD).
Finally, the [`microbenchmark`](https://cran.r-project.org/package=microbenchmark) R package offers a nice viz with boxplot comparing all the methods:
```{r viz, dependson='execution'}
op <- par(mai = par("mai") * c(2,1,1,1))
boxplot(ans_bm, las = 2, xlab = "")
par(op)
```
## Session info
The programs were compiled on a machine with an
[Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz processor](https://ark.intel.com/content/www/us/en/ark/products/95443/intel-core-i5-7200u-processor-3m-cache-up-to-3-10-ghz.html)
```{r}
sessionInfo()
```