Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/javierluraschi/sparkworker
sparkworker: R Worker for Apache Spark
https://github.com/javierluraschi/sparkworker
Last synced: about 1 month ago
JSON representation
sparkworker: R Worker for Apache Spark
- Host: GitHub
- URL: https://github.com/javierluraschi/sparkworker
- Owner: javierluraschi
- License: apache-2.0
- Created: 2017-02-22T01:44:34.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2017-07-15T00:01:33.000Z (over 7 years ago)
- Last Synced: 2024-12-19T04:51:40.080Z (about 1 month ago)
- Language: Scala
- Size: 384 KB
- Stars: 6
- Watchers: 7
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
title: "sparkworker: R Worker for Apache Spark"
output:
github_document:
fig_width: 9
fig_height: 5
---`sparkworker` provides support to execute arbitrary distributed r code, as any
other `sparklyr` extension, load `sparkworker`, `sparklyr` and connecto to
Apache Spark:```{r}
library(sparkworker)
library(sparklyr)sc <- spark_connect(master = "local", version = "2.1.0")
iris_tbl <- sdf_copy_to(sc, iris)
```To execute arbitrary functions use `spark_apply` as follows:
```{r}
spark_apply(iris_tbl, function(row) {
row$Petal_Width <- row$Petal_Width + rgamma(1, 2)
row
})
```We can calculate π using `dplyr` and `spark_apply` as follows:
```{r message=FALSE}
library(dplyr)sdf_len(sc, 10000) %>%
spark_apply(function() sum(runif(2, min = -1, max = 1) ^ 2) < 1) %>%
filter(id) %>% count() %>% collect() * 4 / 10000
```Notice that `spark_log` shows `sparklyr` performing the following operations:
1. The `Gateway` receives a request to execute custom `RDD` of type `WorkerRDD`.
2. The `WorkerRDD` is evaluated on the worker node which initializes a new
`sparklyr` backend tracked as `Worker` in the logs.
3. The backend initializes an `RScript` process that connects back to the
backend, retrieves data, performs the clossure and updates the result.```{r}
spark_log(sc, filter = "sparklyr:", n = 30)
```Finally, we disconnect:
```{r}
spark_disconnect(sc)
```