Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kayvansol/sparkonkubernetes
Spark On Kubernetes via helm chart
https://github.com/kayvansol/sparkonkubernetes
apache apache-spark bitnami docker-compose helm-charts java kubernetes pyspark python scala spark
Last synced: 27 days ago
JSON representation
Spark On Kubernetes via helm chart
- Host: GitHub
- URL: https://github.com/kayvansol/sparkonkubernetes
- Owner: kayvansol
- Created: 2024-03-24T18:21:51.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-04-10T12:33:09.000Z (7 months ago)
- Last Synced: 2024-10-12T00:03:14.857Z (27 days ago)
- Topics: apache, apache-spark, bitnami, docker-compose, helm-charts, java, kubernetes, pyspark, python, scala, spark
- Language: Python
- Homepage:
- Size: 1.57 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spark On Kubernetes
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/logo.png?raw=true)Spark On Kubernetes via **helm chart**
The control-plane & worker nodes addresses are :
```
192.168.56.115
192.168.56.116
192.168.56.117
```
![alt text](https://raw.githubusercontent.com/kayvansol/Ingress/main/pics/vmnet.png?raw=true)Kubernetes cluster **nodes** :
![alt text](https://raw.githubusercontent.com/kayvansol/Ingress/main/pics/nodes.png?raw=true)
you can install helm via the link [helm](https://helm.sh/docs/intro/install) :
***
The Steps :
1) Install spark via helm chart **(bitnami)** :
```
$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm search repo bitnami
$ helm install kayvan-release oci://registry-1.docker.io/bitnamicharts/spark
$ helm upgrade kayvan-release bitnami/spark --set worker.replicaCount=5
```
the installed **6 pods** :![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Pods.png?raw=true)
and **Services** (headless for statefull) :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Services.png?raw=true)
and the **spark master ui** is :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Master.png?raw=true)
***
2) type the below commands on kubernetes kube-apiserver :
```
kubectl exec -it kayvan-release-spark-master-0 -- ./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
./examples/jars/spark-examples_2.12-3.4.1.jar 1000```
or```
kubectl exec -it kayvan-release-spark-master-0 -- /bin/bash./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
./examples/jars/spark-examples_2.12-3.4.1.jar 1000./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
./examples/src/main/python/pi.py 1000./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \
./examples/src/main/python/wordcount.py //filepath```
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Command.png?raw=true)
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/logo2.png?raw=true)
the exact **scala & python** code of spark-examples_2.12-3.4.1.jar , pi.py & wordcount.py :
[examples/src/main/scala/org/apache/spark/examples/SparkPi.scala](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala)
[examples/src/main/python/pi.py](https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py)
[examples/src/main/python/wordcount.py](https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py)
***
3) The final **result** is 🍹 :
for **scala** :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Result.png?raw=true)
for **python** :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ResultPy.png?raw=true)
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Completed.png?raw=true)
***
The other **python** Programm :
1) Copy **People.csv** (large file) inside spark worker pods :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy0.png?raw=true)
```
kubectl cp people.csv kayvan-release-spark-worker-{x}:/opt/bitnami/spark
```Notes:
- you can download the file from [link](https://www.datablist.com/learn/csv/download-sample-csv-files)
- you can also use a **nfs share folder** for read large csv file from it instead of copying it inside pods.2) Write some python codes inside **readcsv.py** please :
```python
from pyspark.sql import SparkSession
#from pyspark.sql.functions import sum
from pyspark.context import SparkContextspark = SparkSession\
.builder\
.appName("Mahla")\
.getOrCreate()
sc = spark.sparkContext
path = "people.csv"
df = spark.read.options(delimiter=",", header=True).csv(path)
df.show()
#df.groupBy("Job Title").sum().show()
df.createOrReplaceTempView("Peopletable")
df2 = spark.sql("select Sex, count(1) countsex, sum(Index) sex_sum " \
"from peopletable group by Sex")
df2.show()#df.select(sum(df.Index)).show()
```
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy1.png?raw=true)3) copy readcsv.py file inside spark **master** pod :
```
kubectl cp readcsv.py kayvan-release-spark-master-0:/opt/bitnami/spark
```4) run the code :
```
kubectl exec -it kayvan-release-spark-master-0 -- ./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077
readcsv.py
```5) showing some data :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy2.png?raw=true)
6) the next result data:
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy3.png?raw=true)
7) the time consuming for processing :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy4.png?raw=true)
***
The other **python** Programm on **Docker Desktop** :docker-compose.yml :
```yaml
version: '3.6'services:
spark:
container_name: spark
image: bitnami/spark:latest
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=root
- PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
ports:
- 127.0.0.1:8081:8080spark-worker:
image: bitnami/spark:latest
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=2
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=root
- PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
```
```
docker-compose up --scale spark-worker=2
```![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition4.png?raw=true)
copy required files to containers :
for e.g.
```bash
docker cp file.csv spark-worker-1:/opt/bitnami/spark
```python code on master :
```python
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("Writingjson").getOrCreate()
df = spark.read.option("header", True).csv("csv/file.csv").coalesce(2)
df.show()
df.write.partitionBy('name').mode('overwrite').format('json').save('file_name.json')
```run the code on spark master docker container :
```bash
./bin/spark-submit --master spark://4f28330ce077:7077 csv/ctp.py
```
showing some data :![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition2.png?raw=true)
and the seperated json files based on name partitioning :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition1.png?raw=true)
data for name=kayvan :
![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition3.png?raw=true)