https://github.com/flaviostutz/spark-scala-jupyter
Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master
https://github.com/flaviostutz/spark-scala-jupyter
hdfs hdfs-cluster hdfs-docker jupyter jupyter-notebook scala scala-spark spark spark-sql
Last synced: 3 months ago
JSON representation
Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master
- Host: GitHub
- URL: https://github.com/flaviostutz/spark-scala-jupyter
- Owner: flaviostutz
- License: mit
- Created: 2020-04-14T00:41:53.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-04-25T00:06:17.000Z (about 5 years ago)
- Last Synced: 2025-02-26T18:23:50.439Z (3 months ago)
- Topics: hdfs, hdfs-cluster, hdfs-docker, jupyter, jupyter-notebook, scala, scala-spark, spark, spark-sql
- Language: Jupyter Notebook
- Size: 1.17 MB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spark-scala-jupyter
Jupyter notebook server prepared for running Spark with Scala kernels on a remote Spark master
See a **[complete example](/example)** of running the Spark/Scala Notebook using custom jars, SBT packaging, clustered HDFS, Scala with data visualization on Notebooks etc at [/example](/example)
Check for a [demonstration video here](https://youtu.be/ASMh0yHJZls)
## Usage
* Create docker-compose.yml
```yml
version: "3.5"
services:
spark-scala-jupyter:
image: flaviostutz/spark-scala-jupyter
ports:
- 8888:8888
- 6006:6006
# volumes:
# - /notebooks:/notebooks
environment:
- JUPYTER_TOKEN=flaviostutz
- SPARK_MASTER=spark://spark-master:7077#SPARK SERVICES
spark-master:
image: bde2020/spark-master:2.4.5-hadoop2.7
ports:
- "8080:8080"
- "7077:7077"
environment:
- INIT_DAEMON_STEP=setup_spark
- SPARK_PUBLIC_DNS=localhostspark-worker:
image: bde2020/spark-worker:2.4.5-hadoop2.7
environment:
- SPARK_MASTER=spark://spark-master:7077
- SPARK_PUBLIC_DNS=localhost
```* Run ```docker-compose up -d```
* Open Jupyter at http://localhost:8888
* Create a new Notebook with kernel Scala Spark Toree with contents
```scala
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveModeimport org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Pathprintln("Initializing Spark context...")
val conf = new SparkConf().setAppName("Example App")
val spark: SparkSession = SparkSession.builder.config(conf).getOrCreate()println("************")
println("Hello, world!")
val rdd = spark.sparkContext.parallelize(Array(1 to 1000))
println(rdd.count())
println("************")
```* Open Spark Master UI at http://localhost:8080
* View running App (from Jupyter notebook execution!)
### ENVs
* JUPYTER_TOKEN - password for accessing notebook server. defaults to ''
* SPARK_MASTER - Spark mastery location for submiting Spark jobs. ex.: 'spark://spark-master:7077'. defaults to 'local[*]'
* HDFS_URL - HDFS nameserver url to be used by Spark. ex.: hdfs://namenode1:8020. defaults to ''## Some Jupyter Visualizations with Scala
* With Vegas - https://github.com/vegas-viz/Vegas
```
%AddDeps org.vegas-viz vegas_2.11 0.3.11 --transitive
%AddDeps org.vegas-viz vegas-spark_2.11 0.3.11
implicit val render = vegas.render.ShowHTML(kernel.display.content("text/html", _))
import vegas._
import vegas.data.External._Vegas("A simple bar chart with embedded data.").
withData(Seq(
Map("a" -> "A", "b" -> 28), Map("a" -> "B", "b" -> 55), Map("a" -> "C", "b" -> 43),
Map("a" -> "D", "b" -> 91), Map("a" -> "E", "b" -> 81), Map("a" -> "F", "b" -> 53),
Map("a" -> "G", "b" -> 19), Map("a" -> "H", "b" -> 87), Map("a" -> "I", "b" -> 52)
)).
encodeX("a", Ordinal).
encodeY("b", Quantitative).
mark(Bar).
show
```