Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Emaasit/sparkredshift
An R package to read data from Amazon Redshift into Spark DataFrames
https://github.com/Emaasit/sparkredshift
Last synced: 3 months ago
JSON representation
An R package to read data from Amazon Redshift into Spark DataFrames
- Host: GitHub
- URL: https://github.com/Emaasit/sparkredshift
- Owner: Emaasit
- License: gpl-3.0
- Created: 2016-10-05T19:09:55.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2016-10-06T06:07:06.000Z (about 8 years ago)
- Last Synced: 2024-04-26T21:32:48.727Z (7 months ago)
- Language: R
- Homepage: http://www.danielemaasit.com/sparkredshift/
- Size: 24.4 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- awesome-sparklyr - sparkredshift: An R package to read data from Amazon Redshift into Spark DataFrames
README
---
title: "sparkredshift"
author: "Daniel Emaasit"
date: "`r Sys.Date()`"
output: md_document
---```{r setup, include=FALSE, eval = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```# Read data from Amazon Redshift into Spark DataFrames
## What is sparkredshift?
sparkredshipt is an extension for sparklyr to read data from Amazon Redshift into Spark DataFrames. It uses the Spark package spark-redshift to load redshift data into Spark DataFrames.
## Installation
sparkredshift requires the sparklyr package to run### Install sparklyr
I recommend the latest stable version of sparklyr available on CRAN
```{r eval = FALSE}
install.packages("sparklyr")
```### Install sparkredshift
Install the development version of sparkredshift from this Github repo using devtools
```{r eval = FALSE}
library(devtools)
devtools::install_github("emaasit/sparkredshift")
```## Connecting to Spark
If Spark is not already installed, use the following sparklyr command to install your preferred version of Spark:
```{r eval = FALSE}
library(sparklyr)
spark_install(version = "2.0.0")
```The call to `library(sparkredshift)` will make the sparkredshift functions available on the R search path and will also ensure that the dependencies required by the package are included when we connect to Spark.
```{r eval = FALSE}
library(sparkredshift)
```We can create a Spark connection as follows:
```{r eval = FALSE}
sc <- spark_connect(master = "local")
```## Reading redshift files
sparkredshift provides the function `spark_read_redshift` to read redshift data files into Spark DataFrames. It uses a Spark package called spark-redshift. Here's an example.
```{r eval = FALSE}
mtcars_file <- system.file("extdata", "mtcars.redshift", package = "sparkredshift")mtcars_df <- spark_read_redshift(sc, path = mtcars_file, table = "redshift_table")
mtcars_df
```The resulting pointer to a Spark table can be further used in dplyr statements.
```{r eval = FALSE}
library(dplyr)
mtcars_df %>% group_by(cyl) %>%
summarise(count = n(), avg.mpg = mean(mpg), avg.displacment = mean(disp), avg.horsepower = mean(hp))
```## Logs & Disconnect
Look at the Spark log from R:
```{r eval = FALSE}
spark_log(sc, n = 100)
```Now we disconnect from Spark:
```{r eval = FALSE}
spark_disconnect(sc)
```## Acknowledgements
Thanks to RStudio for the sparklyr package that provides functionality to create Extensions.