Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/javierluraschi/sparkworker

sparkworker: R Worker for Apache Spark
https://github.com/javierluraschi/sparkworker

Last synced: about 1 month ago
JSON representation

sparkworker: R Worker for Apache Spark

Awesome Lists containing this project

README

        

---
title: "sparkworker: R Worker for Apache Spark"
output:
github_document:
fig_width: 9
fig_height: 5
---

`sparkworker` provides support to execute arbitrary distributed r code, as any
other `sparklyr` extension, load `sparkworker`, `sparklyr` and connecto to
Apache Spark:

```{r}
library(sparkworker)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.1.0")
iris_tbl <- sdf_copy_to(sc, iris)
```

To execute arbitrary functions use `spark_apply` as follows:

```{r}
spark_apply(iris_tbl, function(row) {
row$Petal_Width <- row$Petal_Width + rgamma(1, 2)
row
})
```

We can calculate π using `dplyr` and `spark_apply` as follows:

```{r message=FALSE}
library(dplyr)

sdf_len(sc, 10000) %>%
spark_apply(function() sum(runif(2, min = -1, max = 1) ^ 2) < 1) %>%
filter(id) %>% count() %>% collect() * 4 / 10000
```

Notice that `spark_log` shows `sparklyr` performing the following operations:

1. The `Gateway` receives a request to execute custom `RDD` of type `WorkerRDD`.
2. The `WorkerRDD` is evaluated on the worker node which initializes a new
`sparklyr` backend tracked as `Worker` in the logs.
3. The backend initializes an `RScript` process that connects back to the
backend, retrieves data, performs the clossure and updates the result.

```{r}
spark_log(sc, filter = "sparklyr:", n = 30)
```

Finally, we disconnect:

```{r}
spark_disconnect(sc)
```