https://github.com/michelderu/jupyter-spark-cassandra
Complete environment that allows you to use Jupyter with PySPark in combination with Cassandra and Spark.
https://github.com/michelderu/jupyter-spark-cassandra
Last synced: 3 months ago
JSON representation
Complete environment that allows you to use Jupyter with PySPark in combination with Cassandra and Spark.
- Host: GitHub
- URL: https://github.com/michelderu/jupyter-spark-cassandra
- Owner: michelderu
- Created: 2021-03-09T19:05:15.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-03-09T19:21:08.000Z (over 4 years ago)
- Last Synced: 2025-01-20T08:49:25.150Z (5 months ago)
- Language: Jupyter Notebook
- Size: 71.3 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Jupyter notebook with Spark master, 2 workers and a Cassandra database
This repo contains a working environment that allows you to use PySPark in combination with Cassandra and Spark.## Build a specific version of bitnami-spark
We need to match the Python and Spark version between the spark and jupyter containers.
- `jupyter/pyspark-notebook:29edefbcb06a` is a Jupyter container with Pythin 3.8.8 and Spark 3.0.2
- `bitnamy-spark` will be modified to include Python 3.8.8 (instead of 3.6), it already includes Spark 3.0.2
First build the custom `bitnami-spark` image with:
```sh
cd ./bitnami-docker-spark-custom/3/debian-10
docker build -t custom-bitnami-spark .
```## Startup the environment
```sh
docker-compose up
```
Wait until Cassandra, Spark-Master, the two Spark-Workers and Jupyter have been started and fire up a notebook.## Vermont notebook
The vermont notebook and data is based upon: https://levelup.gitconnected.com/using-docker-and-pyspark-134cd4cab867
Link to dataset: https://data.vermont.gov/Finance/Vermont-Vendor-Payments/786x-sbp3
Place the csv in `/jupyter/data`.