https://github.com/Emaasit/sparkredshift

An R package to read data from Amazon Redshift into Spark DataFrames
https://github.com/Emaasit/sparkredshift

Last synced: about 1 month ago
JSON representation

An R package to read data from Amazon Redshift into Spark DataFrames

Host: GitHub
URL: https://github.com/Emaasit/sparkredshift
Owner: Emaasit
License: gpl-3.0
Created: 2016-10-05T19:09:55.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2016-10-06T06:07:06.000Z (over 8 years ago)
Last Synced: 2024-10-20T05:37:35.652Z (8 months ago)
Language: R
Homepage: http://www.danielemaasit.com/sparkredshift/
Size: 24.4 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

awesome-sparklyr - sparkredshift: An R package to read data from Amazon Redshift into Spark DataFrames

README

        ---

title: "sparkredshift"

author: "Daniel Emaasit"

date: "`r Sys.Date()`"

output: md_document

---

```{r setup, include=FALSE, eval = FALSE}

knitr::opts_chunk$set(echo = TRUE)

```

# Read data from Amazon Redshift into Spark DataFrames

## What is sparkredshift?

sparkredshipt is an extension for sparklyr to read data from Amazon Redshift into Spark DataFrames. It uses the Spark package spark-redshift to load redshift data into Spark DataFrames.

## Installation

sparkredshift requires the sparklyr package to run

### Install sparklyr

I recommend the latest stable version of sparklyr available on CRAN

```{r eval = FALSE}

install.packages("sparklyr")

```

### Install sparkredshift

Install the development version of sparkredshift from this Github repo using devtools

```{r eval = FALSE}

library(devtools)

devtools::install_github("emaasit/sparkredshift")

```

## Connecting to Spark

If Spark is not already installed, use the following sparklyr command to install your preferred version of Spark:

```{r eval = FALSE}

library(sparklyr)

spark_install(version = "2.0.0")

```

The call to `library(sparkredshift)` will make the sparkredshift functions available on the R search path and will also ensure that the dependencies required by the package are included when we connect to Spark.

```{r eval = FALSE}

library(sparkredshift) 

```

We can create a Spark connection as follows:

```{r eval = FALSE}

sc <- spark_connect(master = "local")

```

## Reading redshift files

sparkredshift provides the function `spark_read_redshift` to read redshift data files into Spark DataFrames. It uses a Spark package called spark-redshift. Here's an example.

```{r eval = FALSE}

mtcars_file <- system.file("extdata", "mtcars.redshift", package = "sparkredshift")

mtcars_df <- spark_read_redshift(sc, path = mtcars_file, table = "redshift_table")

mtcars_df

```

The resulting pointer to a Spark table can be further used in dplyr statements.

```{r eval = FALSE}

library(dplyr)

mtcars_df %>% group_by(cyl) %>%

  summarise(count = n(), avg.mpg = mean(mpg), avg.displacment = mean(disp), avg.horsepower = mean(hp))

```

## Logs & Disconnect

Look at the Spark log from R:

```{r eval = FALSE}

spark_log(sc, n = 100)

```

Now we disconnect from Spark:

```{r eval = FALSE}

spark_disconnect(sc)

```

## Acknowledgements

Thanks to RStudio for the sparklyr package that provides functionality to create Extensions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Emaasit/sparkredshift

Awesome Lists containing this project

README