https://github.com/raycad/spark
Apache Spark cluster docker
https://github.com/raycad/spark
Last synced: 10 months ago
JSON representation
Apache Spark cluster docker
- Host: GitHub
- URL: https://github.com/raycad/spark
- Owner: raycad
- Created: 2019-02-04T00:24:43.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2019-02-06T07:13:08.000Z (almost 7 years ago)
- Last Synced: 2025-01-22T18:32:33.046Z (12 months ago)
- Language: Shell
- Size: 7.81 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Apache Spark Cluster
### 1. Pull the image
```
$ sudo docker pull seedotech/spark:2.4.0
```
### 2. Create a spark network
```
$ sudo docker network create --driver=bridge spark
```
### 3. Create and run docker containers
```sh
# The 1st argument (CREATE_SPARK_MASTER) is set to YES if you want to create a spark master, the default is YES
# The 2nd argument (SPARK_MASTER_IP) is the spark master ip (the ip of the host machine of the spark master docker container), the default is spark-master
# The 3rd argument (SPARK_WORKER_NUMBER) is the number of workers you want to create, the default is 2
# The default spark cluster has 3 nodes includes 1 master and 2 workers
$ start_containers.sh
# If you want to create 1 spark worker in other machine (assume that the ip of the spark master is 192.168.1.10), you can do like that:
$ start_containers.sh NO 192.168.1.10 1
```
### 4. Get into the spark master container
```
$ sudo docker exec -it spark-master bash
```
### 5. Monitor spark cluster
Access to the http://spark-master:8080 to monitor the spark cluster.
### 6. Run spark application with hadoop
#### 6.1. Create a shared network for hadoop cluster and spark cluster be able to connect together
```
$ sudo docker network create --driver=bridge spark-hadoop
```
#### 6.2. Create a hadoop cluster and use the spark-hadoop network
```
e.g
$ sudo docker run -itd \
--net=spark-hadoop \
-p 50070:50070 \
-p 8088:8088 \
-e HADOOP_SLAVE_NUMBER=$HADOOP_SLAVE_NUMBER \
--name hadoop-master \
--hostname hadoop-master \
seedotech/hadoop:2.9.2 &> /dev/null
```
#### 6.3. Create a spark cluster and use the spark-hadoop network
```
e.g
$ sudo docker run -itd \
--net=spark-hadoop \
-p 6066:6066 -p 7077:7077 -p 8080:8080 \
-e IS_SPARK_MASTER=YES \
--name spark-master \
--hostname spark-master \
seedotech/spark:2.4.0 &> /dev/null
```
#### 6.4. Execute spark application in other machine
```
$ spark-submit --class com.seedotech.spark.SparkJavaApp --master spark://spark-master:7077 --deploy-mode cluster hdfs://hadoop-master:9000/apps/spark/spark-java-1.0.jar hdfs://hadoop-master:9000/apps/spark/demo.txt hdfs://hadoop-master:9000/apps/spark/out
```
#### 6.5. Debug spark application
Use logs to trace bugs/issues
```
$ sudo docker logs -f spark-master
$ sudo docker logs -f spark-worker1
```
**NOTE**: you might run failed in the second time, it's because of the folder **hdfs://hadoop-master:9000/apps/spark/out** has been created. To run successfully please remove it before running (use http://localhost:50070/explorer.html to upload/delete folders/files).