{"id":14982356,"url":"https://github.com/kayvansol/sparkonkubernetes","last_synced_at":"2026-02-14T15:02:13.371Z","repository":{"id":229498473,"uuid":"776892089","full_name":"kayvansol/SparkOnKubernetes","owner":"kayvansol","description":"Spark On Kubernetes via helm chart","archived":false,"fork":false,"pushed_at":"2024-04-10T12:33:09.000Z","size":1643,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-06T22:40:54.610Z","etag":null,"topics":["apache","apache-spark","bitnami","docker-compose","helm-charts","java","kubernetes","pyspark","python","scala","spark"],"latest_commit_sha":null,"homepage":"https://medium.com/@kayvan.sol2/spark-on-kubernetes-d566158186c6","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kayvansol.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-24T18:21:51.000Z","updated_at":"2025-01-15T07:41:33.000Z","dependencies_parsed_at":"2024-04-10T12:52:32.382Z","dependency_job_id":null,"html_url":"https://github.com/kayvansol/SparkOnKubernetes","commit_stats":{"total_commits":48,"total_committers":2,"mean_commits":24.0,"dds":0.02083333333333337,"last_synced_commit":"35b61c65b1254f97e1954f4af3cf1c7d744d8806"},"previous_names":["kayvansol/sparkonkubernetes"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kayvansol/SparkOnKubernetes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayvansol%2FSparkOnKubernetes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayvansol%2FSparkOnKubernetes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayvansol%2FSparkOnKubernetes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayvansol%2FSparkOnKubernetes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kayvansol","download_url":"https://codeload.github.com/kayvansol/SparkOnKubernetes/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kayvansol%2FSparkOnKubernetes/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29447768,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T14:10:32.461Z","status":"ssl_error","status_checked_at":"2026-02-14T14:09:49.945Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","apache-spark","bitnami","docker-compose","helm-charts","java","kubernetes","pyspark","python","scala","spark"],"created_at":"2024-09-24T14:05:15.223Z","updated_at":"2026-02-14T15:02:13.334Z","avatar_url":"https://github.com/kayvansol.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark On Kubernetes\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/logo.png?raw=true)\n\n\nSpark On Kubernetes via **helm chart**\n\nThe control-plane \u0026 worker nodes addresses are :\n```\n192.168.56.115\n192.168.56.116\n192.168.56.117\n```\n![alt text](https://raw.githubusercontent.com/kayvansol/Ingress/main/pics/vmnet.png?raw=true)\n\n\nKubernetes cluster **nodes** :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/Ingress/main/pics/nodes.png?raw=true)\n\nyou can install helm via the link [helm](https://helm.sh/docs/intro/install) :\n\n***\nThe Steps :\n1) Install spark via helm chart **(bitnami)** :\n\n\u003cimg src=\"https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/bitnami.png\" width=\"500\" height=\"200\"\u003e\n   \n```\n$ helm repo add bitnami https://charts.bitnami.com/bitnami\n$ helm search repo bitnami\n$ helm install kayvan-release oci://registry-1.docker.io/bitnamicharts/spark\n$ helm upgrade kayvan-release bitnami/spark --set worker.replicaCount=5\n```\nthe installed **6 pods** :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Pods.png?raw=true)\n\nand **Services** (headless for statefull) :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Services.png?raw=true)\n\nand the **spark master ui** is :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Master.png?raw=true)\n\n***\n2) type the below commands on kubernetes kube-apiserver :\n```\nkubectl exec -it  kayvan-release-spark-master-0 -- ./bin/spark-submit \\\n  --class org.apache.spark.examples.SparkPi \\\n  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \\\n  ./examples/jars/spark-examples_2.12-3.4.1.jar 1000\n\n```\nor\n\n```\nkubectl exec -it  kayvan-release-spark-master-0 -- /bin/bash\n\n./bin/spark-submit \\\n  --class org.apache.spark.examples.SparkPi \\\n  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \\\n  ./examples/jars/spark-examples_2.12-3.4.1.jar 1000\n\n\n./bin/spark-submit \\\n  --class org.apache.spark.examples.SparkPi \\\n  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \\\n  ./examples/src/main/python/pi.py 1000\n\n\n./bin/spark-submit \\\n  --class org.apache.spark.examples.SparkPi \\\n  --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \\\n  ./examples/src/main/python/wordcount.py //filepath\n\n```\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Command.png?raw=true)\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/logo2.png?raw=true)\n\nthe exact **scala \u0026 python** code of spark-examples_2.12-3.4.1.jar , pi.py \u0026 wordcount.py :\n\n[examples/src/main/scala/org/apache/spark/examples/SparkPi.scala](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala)\n\n[examples/src/main/python/pi.py](https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py)\n\n[examples/src/main/python/wordcount.py](https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py)\n\n***\n\n3) The final **result** is 🍹 :\n\nfor **scala** :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Result.png?raw=true)\n\nfor **python** :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ResultPy.png?raw=true)\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/Completed.png?raw=true)\n\n***\n\nThe other **python** \u003cimg src=\"https://github.com/devicons/devicon/raw/master/icons/python/python-original.svg\" title=\"Python\" alt=\"Python\" width=\"20\" height=\"20\" style=\"max-width: 100%;\"\u003e Programm :\n\n1) Copy **People.csv** (large file) inside spark worker pods :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy0.png?raw=true)\n\n```\nkubectl cp people.csv kayvan-release-spark-worker-{x}:/opt/bitnami/spark\n```\n\nNotes: \n- you can download the file from [link](https://www.datablist.com/learn/csv/download-sample-csv-files)  \n- you can also use a **nfs share folder** for read large csv file from it instead of copying it inside pods.\n\n2) Write some python codes inside **readcsv.py** please :\n```python\nfrom pyspark.sql import SparkSession\n#from pyspark.sql.functions import sum\nfrom pyspark.context import SparkContext\n\nspark = SparkSession\\\n            .builder\\\n            .appName(\"Mahla\")\\\n            .getOrCreate()\n        \n\nsc = spark.sparkContext\n\npath = \"people.csv\"\n\ndf = spark.read.options(delimiter=\",\", header=True).csv(path)\n\ndf.show()\n\n#df.groupBy(\"Job Title\").sum().show() \n\ndf.createOrReplaceTempView(\"Peopletable\")\ndf2 = spark.sql(\"select Sex, count(1) countsex, sum(Index) sex_sum \" \\\n                \"from peopletable group by Sex\")\ndf2.show()\n\n#df.select(sum(df.Index)).show()\n```\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy1.png?raw=true)\n\n3) copy readcsv.py file inside spark **master** pod :\n```\nkubectl cp readcsv.py kayvan-release-spark-master-0:/opt/bitnami/spark\n```\n\n4) run the code :\n```\nkubectl exec -it  kayvan-release-spark-master-0 -- ./bin/spark-submit   --class org.apache.spark.examples.SparkPi\n      --master spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077 \n        readcsv.py\n```\n\n5) showing some data :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy2.png?raw=true)\n\n6) the next result data:\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy3.png?raw=true)\n\n7) the time consuming for processing :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/ProgPy4.png?raw=true)\n\n***\nThe other **python** \u003cimg src=\"https://github.com/devicons/devicon/raw/master/icons/python/python-original.svg\" title=\"Python\" alt=\"Python\" width=\"20\" height=\"20\" style=\"max-width: 100%;\"\u003e Programm on **Docker Desktop** :\n\ndocker-compose.yml :\n```yaml\nversion: '3.6'\n\nservices:\n\n  spark:\n    container_name: spark\n    image: bitnami/spark:latest\n    environment:\n      - SPARK_MODE=master\n      - SPARK_RPC_AUTHENTICATION_ENABLED=no\n      - SPARK_RPC_ENCRYPTION_ENABLED=no\n      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no\n      - SPARK_SSL_ENABLED=no\n      - SPARK_USER=root   \n      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3\n    ports:\n      - 127.0.0.1:8081:8080\n\n  spark-worker:\n    image: bitnami/spark:latest\n    environment:\n      - SPARK_MODE=worker\n      - SPARK_MASTER_URL=spark://spark:7077\n      - SPARK_WORKER_MEMORY=2G\n      - SPARK_WORKER_CORES=2\n      - SPARK_RPC_AUTHENTICATION_ENABLED=no\n      - SPARK_RPC_ENCRYPTION_ENABLED=no\n      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no\n      - SPARK_SSL_ENABLED=no\n      - SPARK_USER=root\n      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3\n```\n```\ndocker-compose up --scale spark-worker=2\n```\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition4.png?raw=true)\n\ncopy required files to containers :\n\nfor e.g.\n```bash\ndocker cp file.csv spark-worker-1:/opt/bitnami/spark\n```\n\npython code on master :\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder.appName(\"Writingjson\").getOrCreate()\n\ndf = spark.read.option(\"header\", True).csv(\"csv/file.csv\").coalesce(2)\n\ndf.show()\n\ndf.write.partitionBy('name').mode('overwrite').format('json').save('file_name.json')\n```\n\nrun the code on spark master docker container :\n```bash\n./bin/spark-submit --master spark://4f28330ce077:7077 csv/ctp.py\n```\nshowing some data :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition2.png?raw=true)\n\nand the seperated json files based on name partitioning :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition1.png?raw=true)\n\ndata for name=kayvan :\n\n![alt text](https://raw.githubusercontent.com/kayvansol/SparkOnKubernetes/main/img/partition3.png?raw=true)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkayvansol%2Fsparkonkubernetes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkayvansol%2Fsparkonkubernetes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkayvansol%2Fsparkonkubernetes/lists"}