https://github.com/javierluraschi/sparkworker

sparkworker: R Worker for Apache Spark
https://github.com/javierluraschi/sparkworker

Last synced: 7 months ago
JSON representation

sparkworker: R Worker for Apache Spark

Host: GitHub
URL: https://github.com/javierluraschi/sparkworker
Owner: javierluraschi
License: apache-2.0
Created: 2017-02-22T01:44:34.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2017-07-15T00:01:33.000Z (about 8 years ago)
Last Synced: 2024-12-19T04:51:40.080Z (7 months ago)
Language: Scala
Size: 384 KB
Stars: 6
Watchers: 7
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

        ---

title: "sparkworker: R Worker for Apache Spark"

output:

  github_document:

    fig_width: 9

    fig_height: 5

---

`sparkworker` provides support to execute arbitrary distributed r code, as any

other `sparklyr` extension, load `sparkworker`, `sparklyr` and connecto to

Apache Spark:

```{r}

library(sparkworker)

library(sparklyr)

sc <- spark_connect(master = "local", version = "2.1.0")

iris_tbl <- sdf_copy_to(sc, iris)

```

To execute arbitrary functions use `spark_apply` as follows:

```{r}

spark_apply(iris_tbl, function(row) {

  row$Petal_Width <- row$Petal_Width + rgamma(1, 2)

  row

})

```

We can calculate π using `dplyr` and `spark_apply` as follows:

```{r message=FALSE}

library(dplyr)

sdf_len(sc, 10000) %>%

  spark_apply(function() sum(runif(2, min = -1, max = 1) ^ 2) < 1) %>%

  filter(id) %>% count() %>% collect() * 4 / 10000

```

Notice that `spark_log` shows `sparklyr` performing the following operations:

 1. The `Gateway` receives a request to execute custom `RDD` of type `WorkerRDD`.

 2. The `WorkerRDD` is evaluated on the worker node which initializes a new

    `sparklyr` backend tracked as `Worker` in the logs.

 3. The backend initializes an `RScript` process that connects back to the

    backend, retrieves data, performs the clossure and updates the result.

```{r}

spark_log(sc, filter = "sparklyr:", n = 30)

```

Finally, we disconnect:

```{r}

spark_disconnect(sc)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/javierluraschi/sparkworker

Awesome Lists containing this project

README