Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dominodatalab/workshop-sparklyr-introduction
https://github.com/dominodatalab/workshop-sparklyr-introduction
Last synced: 29 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/dominodatalab/workshop-sparklyr-introduction
- Owner: dominodatalab
- Created: 2023-03-30T06:45:12.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-04-11T13:18:48.000Z (over 1 year ago)
- Last Synced: 2023-08-07T03:05:39.734Z (over 1 year ago)
- Language: R
- Size: 18.6 KB
- Stars: 0
- Watchers: 6
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Introduction to Spark (via sparklyr)
This repository contains materials used for the Domino Field Data Science "Introduction to Spark (via sparklyr)" module.
* *hello_world.R* - A simple "Hello World" type of script, which connects to on-demand Spark in Domino and loads an R data.frame into Spark
* *csv_example.R* - Reading a tabular data file into a Spark DataFrame
* *sql_example.R* - Running SQL queries against Spark tables via DBI
* *dplyr_example.R* - Using dplyr with Spark
* *ft_example.R* - Feature Transformations example
* *ml_example.R* - Simple logistic regression model fitting and scoring example### Setup instructions
#### Compute environments
This training uses two custom compute environments (CEs). The definition of the CE for the R Studio workspace is as follows:
```
# SparklyR:Spark3.3.1-Workspace
FROM quay.io/domino/compute-environment-images:ubuntu20-py3.9-r4.2-spark3.3.1-hadoop3.3.4-domino5.5#Install and configure sparklyr
RUN sudo R -e "remotes::install_version('sparklyr', version = '1.8.1', dependencies= T)"
# Disable Hive
RUN sudo bash -c "echo ' sparklyr.connect.enablehivesupport: false' >> /usr/local/lib/R/site-library/sparklyr/conf/config-template.yml"# Additional packages
RUN sudo R -e "remotes::install_version('dbplot', version = '0.3.3', dependencies= T)"
```This compute environment also needs the following Pluggable Workspace Tools configuration for running RStudio:
```
rstudio:
title: "RStudio"
iconUrl: "/assets/images/workspace-logos/Rstudio.svg"
start: [ "/opt/domino/workspaces/rstudio/start" ]
httpProxy:
port: 8888
requireSubdomain: false
```The second CE is used as a Spark Cluster environment. It's definition is:
```
# SparklyR:Spark3.3.1
FROM quay.io/domino/cluster-environment-images:spark3.3.1-hadoop3.3.2-py3.9-domino5.5USER root
ENV R_LIBS_SITE=/usr/local/lib/R/site-library
ENV R_VERSION=4.2.3RUN apt-get update && \
apt-get install -y --no-install-recommends gnupg2 software-properties-commonRUN apt-key adv --keyserver keyserver.ubuntu.com --recv-key '95C0FAF38DB3CCAD0C080A7BDC78B2DDEABC47B7'
RUN add-apt-repository "deb https://cloud.r-project.org/bin/linux/debian $(lsb_release -sc)-cran40/" && \
apt-get update
RUN apt-get install -y --no-install-recommends dirmngr software-properties-common && \
apt-get install -y --no-install-recommends r-recommended=${R_VERSION}-* r-base=${R_VERSION}-* r-base-dev=${R_VERSION}-* && \
apt-mark hold r-* && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN apt show r-baseRUN R -q -e "install.packages('remotes')" && \
R -e "remotes::install_version('sparklyr', version = '1.8.1', dependencies= T)"
```
For additional information on running on-demand Spark clusters in Domino please see the [Domino documentation](https://docs.dominodatalab.com/en/latest/user_guide/482ec5/on-demand-spark/)#### Additional configuration
Copy the diabetes.csv dataset to the local dataset for the project (it needs to be visible across the Spark worker nodes)
```
cp /mnt/code/data/diabetes.csv /mnt/data//diabetes.csv
```