https://github.com/tansudasli/spark-sandbox

Apache spark sandbox on GCP and Amazon EMR.
https://github.com/tansudasli/spark-sandbox

apache-spark aws-emr gcp-dataproc python

Last synced: 4 months ago
JSON representation

Apache spark sandbox on GCP and Amazon EMR.

Host: GitHub
URL: https://github.com/tansudasli/spark-sandbox
Owner: tansudasli
Created: 2019-04-12T17:25:52.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-03-04T06:32:09.000Z (over 5 years ago)
Last Synced: 2025-03-01T04:27:07.881Z (8 months ago)
Topics: apache-spark, aws-emr, gcp-dataproc, python
Language: Jupyter Notebook
Size: 3.89 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# spark-sandbox

for a brief introduction to [streams](https://github.com/tansudasli/spark-sandbox/wiki/Streaming-Fundamentals)

## How to run spark

spark can be run in many ways. I did on AWS EMR, Dataproc on GCP, Computing Instance on GCP and Serverless Databricks.

- [x] standalone, on GCP

here, you should do some steps on GCP (step-1,2,3,4,5) and on your local machine (step-6)

also, open ports: 7070, 4040 on GCP instance firewall
- [ ] standalone, on your local machine
- [x] standalone, on your local machine w/ anaconda and pyspark package installation (step-6)
- [x] master-slave, on GCP

here, you should do some steps on GCP (step-1,2,3,4,5,8) and on your local machine (step-6)

also, open ports: 7070, 8080, 8081 on GCP instance firewall
- [x] master-slave, on **AWS EMR**

look for details under */aws-emr-jupiter-notebooks/README.md* file. You have to pay, either use or not use, but 2 optimization is available.

* EMR instances are %50 discounted compared to EC2 equivalant instances

* Leverage spot instances for more cost decrease.

* Use transient instances for job kind things
- [x] master-slave, on **GCP Dataproc**

very similar to AWS, except more robust and more faster and much better experience in GCP.

* Do not forget to add Jupiter component installation and open 8123 port on firewall !

* Leverage preemptible instances for more cost decrease.

* Use transient instances for job kind things
- [ ] master-slave, on your local machine
- [x] Serverless Databricks on Azure

look for details under */databricks-jupiter-notebooks/README.md* file. You only pay,when you process.
- [ ] Serverless Databricks on AWS

## How to start

**Basicly**,

* copy your dataset
* get your spark cluster
* do your things in jupiter
* submit your job to cluster

0- create a GCP (ubuntu 18.04) instance on GCP console, then connect with that server via appropriate SSH ways.
* `gcloud compute --project .... ssh --zone .... ....`

1- download *tansudasli/spark-sandbox* files to GCP instance via

`git clone https://github.com/tansudasli/spark-sandbox.git`, then
`cd spark-sandbox`

2- then give run permisson to install_cloudera_stack.sh file

`chmod +x install_apache_spark.sh`

3- and run below script to install python3x, java8 and spark 2.4

`./install_apache_spark.sh`

if you face w/ connection or downloading issues, run it again after delete unnecessary folders.

4- test pyspark in *standalone, on GCP*
`pyspark` for python or `spark-shell` for scala

and then, in the shell, type `sc.version`

* at this stage you may access your spark over `IP:4040` to see jobs and storages etc.

5- download movielens sample data set.
`sudo apt install unzip`

`wget http://files.grouplens.org/datasets/movielens/ml-100k.zip`

`unzip ml-100k.zip`

for latest & largest data-set, you may use *http://files.grouplens.org/datasets/movielens/ml-latest.zip* url.

6- you may want to write pyspark, and other staffs on your local machine without install spark. There are many ways for that, but i prefer anaconda!

- [ ] install everything seperately on local (vscode or another IDE, python3, pip3) and jupiyer-notebook and pyspark
- [x] don't install pyspark, you will see some imports errors in your IDE and also you won't test code interactively, but that's ok. Run your code on GCP instance where you installed spark.
- [x] install anaconda on local (w/ conda package manager), then leverage jupiter-notebooks and install pyspark

for anaconda

`brew cask install anaconda` install anaconda w/ brew on Mac

`echo 'export PATH="/usr/local/anaconda3/bin/:$PATH"' >> .zshrc` change _.profile_ if not using zsh-terminal!

`cd ~/anaconda3/bin`

`conda update -n base -c defaults conda`

`conda create --name apache-spark python=3`

`conda activate apache-spark`

`conda install -c conda-forge pyspark`

* on VSCode, do not forget to switch python interpreter to anaconda python version!

8- you may want to test *master-slave, on GCP*, then run below commands.
`./sbin/start-master.sh`

`./sbin/start-slave.sh spark://IP:7077`

`pyspark --master spark://IP:7077`

9- to run jupiternotebook on GCP instance,
`sudo pip3 install runipy`

`runipy spark-sandbox/ratings-histogram.ipynb`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tansudasli/spark-sandbox

Awesome Lists containing this project

README