https://github.com/justkode/spark-on-k8s-example
Running Spark On Kubernetes Example
https://github.com/justkode/spark-on-k8s-example
kubernetes pyspark spark
Last synced: 9 months ago
JSON representation
Running Spark On Kubernetes Example
- Host: GitHub
- URL: https://github.com/justkode/spark-on-k8s-example
- Owner: JustKode
- Created: 2023-03-24T16:17:02.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-03-27T17:05:06.000Z (almost 3 years ago)
- Last Synced: 2025-04-24T00:59:44.235Z (9 months ago)
- Topics: kubernetes, pyspark, spark
- Language: Python
- Homepage:
- Size: 2.93 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Spark On Kubernetes Example
Spark On Kubernetes Example. Need to set up the programs below.
- Docker
- Minikube
- kubectl
- Spark 3.0 >=
### Pyspark Image build & push
Before you run spark application on kubernetes, you need to build a spark image that can run in kubernetes cluster.
For this purpose, you can use `./bin/docker-image-tool.sh`. Options are described below.
- `-f file`: (Optional) Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark. For Java 17, use `-f kubernetes/dockerfiles/spark/Dockerfile.java17`
- `-p file`: (Optional) Dockerfile to build for PySpark Jobs. Builds Python dependencies and ships with Spark. Skips building PySpark docker image if not specified.
- `-R file`: (Optional) Dockerfile to build for SparkR Jobs. Builds R dependencies and ships with Spark. Skips building SparkR docker image if not specified.
- `-r repo`: Repository address.
- `-t tag`: Tag to apply to the built image, or to identify the image to be pushed.
First, you need to build base image of spark
```bash
$ cd $SPARK_DIR # spark directory
$ eval $(minikube docker-env)
$ ./bin/docker-image-tool.sh -r k8s -t 1.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
```
Second, build application image from base image of spark
```bash
$ cd /python
$ eval $(minikube docker-env)
$ docker build -t pyspark-on-k8s:1.0 .
$ minikube image load pyspark-on-k8s:1.0
```
You can see all builded docker image by `docker image ls`
```bash
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
pyspark-on-k8s 1.0 00a4af077a09 About a minute ago 938MB
k8s/spark-py 1.0 985cf805549a 13 days ago 938MB
k8s/spark 1.0 bd8ba88688d4 13 days ago 601MB
...
```
### How To Run Spark Application
Before running spark application, you need to make **k8s service account and k8s clusterrolebinding.** Because Spark on k8s works in a way that the **driver pod calls the executor pod**, the driver pod must be authorized to edit the pod via clusterrolebinding.
```bash
$ kubectl create serviceaccount spark
$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
```
By Spark Submit, you can **run spark application on k8s cluster.**
```bash
$ kubectl proxy
Starting to serve on 127.0.0.1:8001
# In another terminal
$ kubectl create namespace spark-job
# rdd_example
$ ./bin/spark-submit \
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name rdd-example \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=pyspark-on-k8s:1.0 \
--conf spark.kubernetes.driver.pod.name=rdd-example-pod \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--verbose \
"local:///python/rdd_example.py"
$ kubectl logs rdd-example-pod # log check
# dataframe_example
$ ./bin/spark-submit \
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name dataframe-example \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=pyspark-on-k8s:1.0 \
--conf spark.kubernetes.driver.pod.name=dataframe-example-pod \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--verbose \
"local:///python/dataframe_example.py"
$ kubectl logs dataframe-example-pod # log check
```