{"id":27890237,"url":"https://github.com/tibcosoftware/snappy-on-k8s","last_synced_at":"2025-10-13T00:47:49.172Z","repository":{"id":151968468,"uuid":"119957972","full_name":"TIBCOSoftware/snappy-on-k8s","owner":"TIBCOSoftware","description":"An Integrated and collaborative cloud environment for building and running Spark applications on PKS/Kubernetes","archived":false,"fork":false,"pushed_at":"2020-03-16T09:26:45.000Z","size":730,"stargazers_count":83,"open_issues_count":14,"forks_count":29,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-05-05T10:55:45.055Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIBCOSoftware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-02-02T08:59:06.000Z","updated_at":"2025-04-03T12:54:25.000Z","dependencies_parsed_at":"2023-06-10T18:45:21.424Z","dependency_job_id":null,"html_url":"https://github.com/TIBCOSoftware/snappy-on-k8s","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/TIBCOSoftware/snappy-on-k8s","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIBCOSoftware%2Fsnappy-on-k8s","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIBCOSoftware%2Fsnappy-on-k8s/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIBCOSoftware%2Fsnappy-on-k8s/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIBCOSoftware%2Fsnappy-on-k8s/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIBCOSoftware","download_url":"https://codeload.github.com/TIBCOSoftware/snappy-on-k8s/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIBCOSoftware%2Fsnappy-on-k8s/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279013643,"owners_count":26085298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-05T10:45:56.362Z","updated_at":"2025-10-13T00:47:49.156Z","avatar_url":"https://github.com/TIBCOSoftware.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multi cloud Spark application service on PKS\n\n## Introduction\n[Kubernetes](https://kubernetes.io/) is an open source project designed specifically for container orchestration. \nKubernetes offers a number of key [features](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/), \nincluding multiple storage APIs, container health checks, manual or automatic scaling, rolling upgrades and \nservice discovery.\n[Pivotal Container Service - PKS](https://pivotal.io/platform/pivotal-container-service) is a solution to manage \nKubernetes clusters across private and public clouds. It leverages [BOSH](https://bosh.io/) to offer a uniform way to instantiate, \ndeploy, and manage highly available Kubernetes clusters on a cloud platform like GCP, VMWare vSphere or AWS. \n\nThis project provides a streamlined way of deploying, scaling and managing Spark applications. Spark 2.3 added support\nfor Kubernetes as a cluster manager. This project leverages [Helm charts](https://helm.sh/) to allow deployment of \ncommon Spark application recipes - using [Apache Zeppelin](https://zeppelin.apache.org/) and/or [Jupyter](https://jupyter.org/) \nfor interactive, collaborative workloads. It also automates logging of all events across batch jobs and Notebook driven\napplications to log events to shared storage for offline analysis.   \n\n\nHelm is a package manager for kubernetes and the most productive way to find, install, share, upgrade and use even the \nmost complex kubernetes applications. So, for instance, with a single command you can deploy additional components like \nHDFS or Elastic search for our Spark applications.  \n\nThis project is a collaborative effort between SnappyData and Pivotal. \n\n\n\n## Features\n- Full support for Spark 2.2 applications running on PKS 1.x on both Google cloud and on-prem VMWare cloud environments.\nThe project leverages [spark-on-k8s](https://github.com/apache-spark-on-k8s/spark) work.\n- Deploy batch spark jobs using kubernetes master as the cluster/resource manager\n- Helm chart to deploy Zeppelin, centralized logging, monitoring across apps (using History server)\n- Helm chart to deploy Jupyter,  centralized logging, monitoring across apps (using History server)\n- Use kubernetes persistent volumes for notebooks and event logging for collaboration and historical analysis\n- Spark applications can be Java, Scala or Python\n- Spark applications can dynamically scale\n\nWe use Helm charts to abstract the developer from having to understand kubernetes concepts and simply focus on \nconfiguration that matters. Think recipes that come with sensible defaults for common Spark workloads on PKS. \n\nWe showcase the use of cloud storage (e.g. Google Cloud Storage) to manage logs/events but show how the use persistent \nvolumes within the charts make the architecture portable across clouds. \n\n## Pre-requisites and assumptions:\n- We need a running kubernetes or PKS cluster. We only support Kubernetes 1.9 (or higher) and PKS 1.0.0(or higher).\n\n **NOTE** If you already have access to a Kubernetes cluster, jump to the [next section](#steps-if-a-kubernetes-cluster-is-available).\n\n### Getting access to a PKS or Kubernetes cluster\nIf you would like to deploy on-prem you can either use Minikube (local developer machine) or get PKS environment setup \nusing vSphere. \n\n#### Option (1) - PKS\n- PKS on vSphere: Follow these [instructions](https://docs.pivotal.io/runtimes/pks/1-0/vsphere.html) \n- PKS on GCP: Follow these [instructions](https://docs.pivotal.io/runtimes/pks/1-0/gcp.html)\n- Create a Kubernetes cluster using PKS CLI : Once PKS is setup you will need to create a k8s cluster as described \n[here](https://docs.pivotal.io/runtimes/pks/1-0/using.html)\n\n#### Option (2) - Kubernetes on Google Cloud Platform (GCP)\n- Login to your Google account and goto the [Cloud console](https://console.cloud.google.com/) to launch a GKE cluster\n\n#### Option (3) - Minikube on your local machine\n- If either of the above options is difficult, you may setup a test cluster on your local machine using \n[minikube](https://kubernetes.io/docs/getting-started-guides/minikube/). We recommend using the latest release of minikube \nwith the DNS addon enabled.\n- If using Minikube, be aware that the default minikube configuration is not enough for running Spark applications. \nWe recommend 3 CPUs and 4g of memory to be able to start a simple Spark application with a single executor.\n\n### Steps if a Kubernetes cluster is available \n- If using PKS, you will need to install the PKS command line tool. See instructions \n[here](https://docs.pivotal.io/runtimes/pks/1-0/installing-pks-cli.html)\n- Install kubectl on your local development machine and configure access to the kubernetes/PKS cluster. See instructions for \nkubectl [here](https://kubernetes.io/docs/tasks/tools/install-kubectl/). If you are using Google cloud, you will find \ninstructions for setting up Google Cloud SDK ('gcloud') along with kubectl \n[here](https://kubernetes.io/docs/tasks/tools/install-kubectl/).\n- You must have appropriate permissions to list, create, edit and delete pods in your cluster. You can verify that you \ncan list these resources by running `kubectl auth can-i list,create,edit,delete pods`.\n- The service account credentials used by the driver pods must be allowed to create pods, services and configmaps. For example, \nif you are using `default` service account, assign 'edit' role to it for namespace 'spark' by using following command\n\n```text\nkubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:default\n```\n\u003c!--- TODO Why is this required?\n- You must have Kubernetes DNS configured in your cluster.\n---\u003e\n\n\n### Setup Helm charts\n\n[Helm](https://github.com/kubernetes/helm/blob/master/README.md) comprises of two parts: a client and a server (Tiller) inside \nthe kube-system namespace. Tiller runs inside your Kubernetes cluster, and manages releases (installations) of your charts. \nTo install Helm follow the steps [here](https://docs.pivotal.io/runtimes/pks/1-0/configure-tiller-helm.html). The instructions\nare applicable for any kubernetes cluster (PKS or GKE or Minikube).\n\n\n### Quickstart\n\n#### Launch Spark and Notebook servers\n\nWe use the spark-umbrella chart to deploy Jupyter, Zeppelin, Spark Resource Staging Server, and Spark Shuffle Service \non Kubernetes. This chart is composed from individual sub-charts for each of the components. \nYou can read more about Helm umbrella charts \n[here](https://github.com/kubernetes/helm/blob/master/docs/charts_tips_and_tricks.md#complex-charts-with-many-dependencies)\n\nYou can configure the components in the umbrella chart's 'values.yaml' (see [spark-umbrella/values.yaml](charts/spark-umbrella/values.yaml)) or in\neach of the individual sub-chart's 'values.yaml' file. The umbrella chart's 'values.yaml' will override the ones in sub-charts.\n\n```text\n# fetch the chart repo ....\ngit clone https://github.com/SnappyDataInc/spark-on-k8s\n\n# Get the sub-charts required by the umbrella chart\ncd charts\nhelm dep up spark-umbrella\n\n# Now, install the chart in a namespace called 'spark'\nhelm install --name spark-all --namespace spark ./spark-umbrella/\n```\nThe above command will deploy the helm chart and will display instructions to access Zeppelin service and Spark UI.\n\n\u003e Note that this command will return quickly and kubernetes controllers will work in the background to achieve the state\nspecified in the chart. The command below can be used to access the notebook environment from any browser. \n\n```text\nkubectl get services --namespace spark -w\n# Note: this could take a while to complete. Use '-w' option to wait for state changes. \n```\n\nOnce everything is up and running you will see something like this:\n```text\nNAME                          TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                        AGE\nspark-all-jupyter-spark       LoadBalancer   10.63.246.130   35.184.71.164   8888:31540/TCP,4040:30922/TCP  9m\nspark-all-rss                 LoadBalancer   10.63.246.190   35.192.235.35   10000:31000/TCP                9m\nspark-all-zeppelin            LoadBalancer   10.63.254.150   35.192.68.147   8080:30522/TCP,4040:31236/TCP  9m\n```\n\u003e Access the zeppelin notebook environment using URL external-ip:8080 from any browser.\n\u003e Spark UI is accessible using URL external-ip:4040. \nNOTE that the Spark UI is only accessible after you have run at least one Spark job. Spark Driver (and hence UI) is lazily started. \nSimply navigate to \u003cZeppelin Home\u003e; Click 'Zeppelin Tutorial' and then 'Basic Features(Spark)'. Run the 'Load\ndata' paragraph followed by one or more SQL paragraphs. \n\n#### Launch the kubernetes dashboard\n\u003e You can launch the Kubernetes dashboard (If using GCP you can get to the dashboard from the GCP console) to inspect the \nvarious deployed objects, associated pods and even connect to a running container.\n```text\n# To launch the dashboard, do this ... We use a proxy to access the dashboard locally ...\nkubectl proxy\n\n# Goto URL localhost:8001/ui. The page will request a token .... \n# Get the token using ....\nkubectl config view | grep -A10 \"name: $(kubectl config current-context)\" | awk '$1==\"access-token:\"{print $2}' \n```\n\n#### Launch a Spark batch job\n\nThe spark distribution with support for kubernetes can be downloaded \n[here](https://github.com/apache-spark-on-k8s/spark/releases/tag/v2.2.0-kubernetes-0.5.0)\n\nWe will use spark-submit from this distribution to deploy a batch job. Example below runs the built in SparkPi job. \nThe 'local://' URL will result in looking for the JAR in the launched container. spark.kubernetes.namespace option \nindicates the namespace in which the Spark job will be executed.\n\n```text\n# Find your Kubernetes Master server IP using 'kubectl cluster-info' and port number. Substitute below. \nbin/spark-submit --master k8s://https://K8S-API-SERVER-IP:PORT --deploy-mode cluster --name spark-pi \\\n --class org.apache.spark.examples.SparkPi --conf spark.kubernetes.namespace=spark --conf spark.executor.instances=1 \\ \n --conf spark.app.name=spark-pi --conf spark.kubernetes.driver.docker.image=snappydatainc/spark-driver:v2.2.0-kubernetes-0.5.1 \\\n --conf spark.kubernetes.executor.docker.image=snappydatainc/spark-executor:v2.2.0-kubernetes-0.5.1 \\\n  local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar\n```\n\u003e If you face OAuth token expiry errors when you run spark-submit, it is likely because the token needs to be refreshed.\n The easiest way to fix this is to run any kubectl command, say, kubectl version and then retry your submission.\n\n#### Stop/delete everything\nYou can delete everything using 'helm delete'. Note that any changes to notebooks, data, etc will be gone too. \n```text\nhelm delete --purge spark-all\n```\n\n### Quickstart along with History server\n\n\u003e [History server](https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact) purpose: \nOther cluster managers for Spark (Standalone, Mesos, Yarn) provide a UI so one can monitor \nacross all Spark applications or analyze the metrics after a Job completes. With Kubernetes, currently there is no such \ncentralized admin utility to monitor Spark applications. But, Spark can be configured to use a shared folder to log events.\nEach application logs its events into a sub-folder we can use the Spark history server to monitor/analyze across all apps. \nThe History server provides a UI that is very similar to the Spark UI for individual apps(exposed by the Spark Driver). \nSpark can log using NFS, HDFS or GS (google storage).  \nWe walk through the setup for using the History server using our umbrella Helm chart. \n\n\n#### Setup shared storage \nWe need a shared persistent volume to manage state: Spark events from distributed applications(pods) and Notebooks (developed \nusing Zeppelin or Jupyter) that you want to preserve/share. Note containers only manage ephemeral state. You need to \nconfigure external persistence so your data survives pod failures or restarts.  \n\n\u003e We describe the steps to use Google cloud storage for Spark Events. We will describe the steps to setup NFS as a \npossible solution across cloud environments, in the future. \n\n#### Steps to setup storage using Google Cloud Storage (GCS)\nIn this example, we use Google Cloud Storage(GCS) to persist the events generated by Spark applications. You don't need \nthe steps below if you decide to use other schemes like 'hdfs' or 's3' storage.\n\nUsing Google cloud utilities (gsutil and gcloud ; should already be setup on your local laptop), we create a GCS bucket \nand associate it with your GCP project.  \n\n    ```\n    # Create a bucket using gsutil\n    # NOTE: Bucket names have to be globally unique. Pick a unique name if spark-history-server bucket exists.\n    gsutil mb -c nearline gs://spark-history-server-store\n    # Specify a account name\n    export ACCOUNT_NAME=sparkonk8s-test\n    # Change below to specify your Google cloud project name. Use 'gcloud config list' if you don't know. \n    export GCP_PROJECT_ID= your-gcp-project-id\n    # Create a service account and generate credentials\n    gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name \"${ACCOUNT_NAME}\"\n    gcloud iam service-accounts keys create \"${ACCOUNT_NAME}.json\" --iam-account \"${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com\"\n    # Grant admin rights to the bucket\n    gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member \"serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com\" --role roles/storage.admin\n    gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://spark-history-server-store\n    ```\nIn order for history server to be able read from the GCS bucket, we need to mount the json key file on the history \nserver pod. Copy the json file into 'conf/secrets' directory for umbrella chart.\n\n```text\ncp sparkonk8s-test.json spark-umbrella/conf/secrets/\n```\n\nBy default, umbrella chart does not deploy the History server. We enable the History server deployment by modifying the\n'values.yaml' file. We also specify the GCS bucket path created above. History server will read spark events from this path.  \n\n```text\nhistoryserver:\n  # whether to enable history server\n  enabled: true\n  historyServerConf:\n    # URI of the GCS bucket\n    eventsDir: \"gs://spark-history-server-store\"\n```\n\nNext, set the SPARK_HISTORY_OPTS so that history server uses json key file while accessing the GCS bucket\n```text\nenvironment:\n  SPARK_HISTORY_OPTS: -Dspark.hadoop.google.cloud.auth.service.account.json.keyfile=/etc/secrets/sparkonk8s-test.json\n```\n\nFinally, we configure Zeppelin to log events to the same GCS bucket\n\n```text\nzeppelin:\n\n  environment:\n    SPARK_SUBMIT_OPTIONS: \u003e-\n       --conf spark.kubernetes.driver.docker.image=snappydatainc/spark-driver:v2.2.0-kubernetes-0.5.1\n       --conf spark.kubernetes.executor.docker.image=snappydatainc/spark-executor:v2.2.0-kubernetes-0.5.1\n       --conf spark.executor.instances=2\n       --conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=/etc/secrets/sparkonk8s-test.json\n\n  sparkEventLog:\n    enableHistoryEvents: true\n    # eventsLogDir should point to a URI of GCS bucket where history events will be dumped\n    eventLogDir: \"gs://spark-history-server-store\"\n```\n\n#### Launch Spark, Zeppelin and History server cluster\n\nFollow the Helm install command to launch everything as [described above](#launch-Spark-and-Notebook-servers). For Spark\nbatch job (Spark-submit) follow instructions below (you need additional configuration)\n\nYou can access the History server UI using URL History-server-external-IP:18080\n\u003e Note: When using GCS for logging the logs become visible only when the Spark application exits. You may have to \nrestart the Zeppelin interpreter to view the logs. Use the Zeppelin Spark Driver UI for current state.   \n\u003e The spark-submit logs should be immediately accessible from the history server\n\n##### Enable spark-submit to log spark history events\nThe spark-submit example below shows Spark job that logs historical events to the GCS bucket created in above steps. \nOnce job finishes, use the Spark history server UI to view the job execution details.\n\n  ```\n  bin/spark-submit \\\n      --master k8s://https://\u003ck8s-master-IP\u003e \\\n      --deploy-mode cluster \\\n      --name spark-pi \\\n      --conf spark.kubernetes.namespace=spark \\\n      --class org.apache.spark.examples.SparkPi \\\n      --conf spark.eventLog.enabled=true \\\n      --conf spark.eventLog.dir=gs://spark-history-server-store/ \\\n      --conf spark.executor.instances=2 \\\n      --conf spark.hadoop.google.cloud.auth.service.account.json.keyfile=/etc/secrets/sparkonk8s-test.json \\\n      --conf spark.kubernetes.driver.secrets.history-secrets=/etc/secrets \\\n      --conf spark.kubernetes.executor.secrets.history-secrets=/etc/secrets \\\n      --conf spark.kubernetes.driver.docker.image=snappydatainc/spark-driver:v2.2.0-kubernetes-0.5.1 \\\n      --conf spark.kubernetes.executor.docker.image=snappydatainc/spark-executor:v2.2.0-kubernetes-0.5.1 \\\n      local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar\n  ```\n\n#### Deleting the chart\nUse `helm delete` command to delete the chart\n```text\nhelm delete --purge spark-all\n```\n\n---\n\n\n## Want to dig more?\n\n### How does it work ?\n\n#### Kubernetes\nKubernetes follows a master/worker architecture. Master Node is the control plane which contains the components that \nmake global decisions about the cluster, apiserver exposes Kubernetes API, etcd is used as the backing store for all \ncluster data, controller-manager is responsible for running all controllers(Node Controller, Replication Controller, \nEndpoints Controller, Service Account \u0026 Token Controllers, etc). Worker Node is the server where containers are deployed, \nkubelet is an agent running on each worker node and ensures that all containers are running and stay healthy. kube-proxy \nenables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding.\n\n![Kubernetes Architecture](kubernetes-how-does-it-work.1.png)\n\n#### Spark on Kubernetes\nWhen a Spark application is submitted to the Master Node of Kubernetes cluster, a driver pod will be created first. \nOnce the driver pod is up and running, it will communicate back to Master Node and asks for executor pods creation, \nonce the executor pods are created, they will communicate with driver pod and start accepting tasks.\n\n![spark on k8s Architecture](spark-on-kubernetes-how-does-it-work.2.png)\n\n(see [description](https://azureq.gitbooks.io/data-on-kubernetes/content/chapter1.html) from Qi Shao for more details)\n\n\n#### Composite Spark applications using Helm Charts\n\n![High Level Architecture](k8s-helm-spark-architecture-draw.io.png)\n\nThe above graphic shows how Helm is used to deploy the various Kubernetes objects and the interactions amongst them. \nA Helm chart is a bundle of information necessary to create an instance of a Kubernetes application. This information is \nin YAML files stored as a template. The YAML is a specification for a Kubernetes object like a Service or a Pod. \nThe config contains configuration information that can be merged into a packaged chart to create a releasable object. The\nconfiguration is obtained from the values.yaml file. A release is a running instance of a chart, combined with a specific config.\n\n\nWhen a client executes a 'helm install \u003cchart\u003e' the Helm client communicates with the Helm server (Tiller) running in the\nk8s cluster to combine the chart with configuration information to build a release. This release is deployed as k8s objects \nusing the k8s API. Helm keeps a track of subsequent releases - upgrading and uninstalling by interacting with k8s. Read \nmore about helm architecture [here](https://docs.helm.sh/architecture/).\n\n\nIn our case, when the umbrella chart is deployed, it launches the Notebook server pod(s) and the History server. We also \ncreate [LoadBalancer service](https://kubernetes.io/docs/concepts/services-networking/service/#type-loadbalancer) objects opening endpoints \nso Notebook servers and the history server UI is accessible from outside Kubernetes. \nWhen a notebook Spark paragraph is executed, the notebook server launches a 'in-cluster client' Spark driver within the \nsame pod as the notebook server. The driver is automatically configured to use the k8s master as the cluster manager. \nK8s then launches the executor pods. All these Pods use the configured storage to write Spark events to a shared folder \nwhich is also accessible from the History server. \n\n\nWhen launching Spark batch jobs using Spark-submit, the driver creates executors which are also running within Kubernetes\n pods and connects to them, and executes application code. When the application completes, the executor pods terminate \n and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s \n eventually garbage collected or manually cleaned up.\nNote that in the completed state, the driver pod does not use any computational or memory resources.\n\n\nThe driver and executor pod scheduling is handled by Kubernetes. It is possible to schedule the driver and executor pods \non a subset of available nodes through a node selector using the configuration property for it. It will be possible to \nuse more advanced scheduling hints like node/pod affinities in a future release.\n\n\n### Submitting Applications to Kubernetes (details)\n\nUse `spark-submit` to submit Spark batch jobs. The quickstart above provided an example for how to run the Pi Application\nthat was packaged within the docker image using the `local:\\\\\\\u003cpath to JAR\u003e`. \n\n\nThe Spark master, specified either via passing the `--master` command line argument to `spark-submit` or by setting\n`spark.master` in the application's configuration, must be a URL with the format `k8s://\u003capi_server_url\u003e`. Prefixing the\nmaster string with `k8s://` will cause the Spark application to launch on the Kubernetes cluster, with the API server\nbeing contacted at `api_server_url`. If no HTTP protocol is specified in the URL, it defaults to `https`. For example,\nsetting the master to `k8s://example.com:443` is equivalent to setting it to `k8s://https://example.com:443`, but to\nconnect without TLS on a different port, the master would be set to `k8s://http://example.com:8443`.\n\nOne way to discover the apiserver URL is by executing `kubectl cluster-info`.\n\n    \u003e kubectl cluster-info\n    Kubernetes master is running at http://127.0.0.1:8080\n\nIn the above example, the specific Kubernetes cluster can be used with spark submit by specifying\n`--master k8s://http://127.0.0.1:8080` as an argument to spark-submit.\n\nNote that applications can currently only be executed in cluster mode, where the driver and its executors are running on\nthe cluster.\n\n#### Configuring Service Account\nWhen Kubernetes [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/) is enabled,\nthe `default` service account used by the driver may not have appropriate pod `edit` permissions\nfor launching executor pods. We recommend to add another service account, say `sparkjob`, with\nthe necessary privilege. For example:\n\n    kubectl create serviceaccount sparkjob\n    kubectl create clusterrolebinding spark-edit --clusterrole edit --serviceaccount spark:sparkjob \n\nIn the above command, `--serviceaccount` option accepts value of the format 'namespace:serviceAccount'. Here we have \nassigned `edit` role to `sparkjob` service account for namespace called `spark`\n \nOne can then modify `global` section in values.yaml file to specify the service account to use. \n\n    global:\n      serviceAccount: sparkjob\n\n### Dependency Management\n\nApplication dependencies that are being submitted from your machine need to be sent to a **resource staging server**\nthat the driver and executor can then communicate with to retrieve those dependencies. The umbrella chart described in \n[quickstart](#quickstart) deploys resource staging server.  The command below shows how usage of resource staging\nserver to specify jar for spark-examples. This jar will be copied from you local machine to the resource staging server\nwhich will make it available to the Spark driver and executors during job execution.\n\n\u003e Note: The spark distribution with support for kubernetes can be downloaded [here](https://github.com/apache-spark-on-k8s/spark/releases/tag/v2.2.0-kubernetes-0.5.0)\nWe will use spark-submit from this distribution to deploy a batch job. Example below runs the built in SparkPi job. \n\n    bin/spark-submit \\\n      --deploy-mode cluster \\\n      --class org.apache.spark.examples.SparkPi \\\n      --master k8s://https://\u003ck8s-master-IP\u003e \\\n      --conf spark.kubernetes.namespace=spark \\\n      --conf spark.executor.instances=5 \\\n      --conf spark.app.name=spark-pi \\\n      --conf spark.kubernetes.driver.docker.image=snappydatainc/spark-driver:v2.2.0-kubernetes-0.5.1 \\\n      --conf spark.kubernetes.executor.docker.image=snappydatainc/spark-executor:v2.2.0-kubernetes-0.5.1 \\\n      --conf spark.kubernetes.initcontainer.docker.image=snappydatainc/spark-init:v2.2.0-kubernetes-0.5.1 \\\n      --conf spark.kubernetes.resourceStagingServer.uri=http://\u003cURI of resource staging server as displayed on console while deploying it\u003e \\\n      examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar\n\n\u003e Note: The URL of the resource staging server can be found it using 'kubectl get svc' command. Use the externalIp:port\ncombination of rss service to form the URL.\n\n#### Dependency Management Without The Resource Staging Server\n\nNote that this resource staging server is only required for submitting local dependencies. If your application's\ndependencies are all hosted in remote locations like HDFS or http servers, they may be referred to by their appropriate\nremote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images. Those dependencies\ncan be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH`\nenvironment variable in your Dockerfiles.\n\n### Dynamic Executor Scaling\nSpark provides a mechanism to dynamically adjust the the number of executors your application uses based on the workload. \nThis means that your application can reduce the number of executors when there is no demand and request them again later\nwhen there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.\n\nSpark on Kubernetes supports Dynamic Allocation. This mode requires running an external shuffle \nservice. This is typically a daemonset with a provisioned hostpath volume. This shuffle service may be shared by \nexecutors belonging to different SparkJobs. The umbrella chart described in [quickstart](#quickstart) deploys \nshuffle service daemonset.\n\nSpark application can target a particular shuffle service based on the labels assigned to the pods in the shuffle \ndaemonset. For example, the umbrella chart creates a shuffle service daemon set and has pods with labels \napp=spark-shuffle-service and spark-version=2.2.0, we can use those tags to target that particular shuffle service at \njob launch time. In order to run a job with dynamic allocation enabled, the command may then look like the following:\n\n```\n  bin/spark-submit \\\n    --deploy-mode cluster \\\n    --class org.apache.spark.examples.GroupByTest \\\n    --master k8s://https://\u003ck8s-master-IP\u003e \\\n    --conf spark.kubernetes.namespace=spark \\\n    --conf spark.app.name=group-by-test \\\n    --conf spark.local.dir=/tmp/spark-local \\\n    --conf spark.kubernetes.driver.docker.image=snappydatainc/spark-driver:v2.2.0-kubernetes-0.5.1 \\\n    --conf spark.kubernetes.executor.docker.image=snappydatainc/spark-executor:v2.2.0-kubernetes-0.5.1 \\\n    --conf spark.dynamicAllocation.enabled=true \\\n    --conf spark.shuffle.service.enabled=true \\\n    --conf spark.kubernetes.shuffle.namespace=default \\\n    --conf spark.kubernetes.shuffle.labels=\"app=spark-shuffle-service,spark-version=2.2.0\" \\\n    local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar 10 400000 2\n```\n\nIn order to enable dynamic executor scaling for Zeppelin notebooks, one may modify the 'values.yaml' and set \nSPARK_SUBMIT_OPTIONS accordingly. For example,\n\n```\n  SPARK_SUBMIT_OPTIONS: \u003e-\n     --conf spark.kubernetes.driver.docker.image=snappydatainc/spark-driver:v2.2.0-kubernetes-0.5.1\n     --conf spark.kubernetes.executor.docker.image=snappydatainc/spark-executor:v2.2.0-kubernetes-0.5.1\n     --conf spark.local.dir=/tmp/spark-local\n     --conf spark.driver.cores=\"300m\"\n     --conf spark.dynamicAllocation.enabled=true\n     --conf spark.shuffle.service.enabled=true\n     --conf spark.kubernetes.shuffle.namespace=spark\n     --conf spark.kubernetes.shuffle.labels=\"app=spark-shuffle-service,spark-version=2.2.0\"\n     --conf spark.dynamicAllocation.initialExecutors=0\n     --conf spark.dynamicAllocation.minExecutors=1\n     --conf spark.dynamicAllocation.maxExecutors=5\n```\n\n### Persistent Volume configuration for charts\n\nBy default, this chart will provision volumes(PV) dynamically for Jupyter and Zeppelin. These PVs can be used for notebook\nstorage. When the Helm chart is deleted, the volume claims and PVs are not deleted. This allows users to reuse the \npersistent volume claim, if the chart is deployed again. A user can specify the name of the already created PVC in \nthe `persistence.existingClaim` field of the Zeppelin/Jupyter configuration when the chart is deployed again.\n\nFor example, if you deploy the umbrella chart as follows:\n```\n  helm install --name spark-all --namespace spark ./spark-umbrella\n```\nThis deployment will create two PVCs and dynamically provision volumes for those.\n\n```\n  $ kubectl get pvc --namespace=spark\n  NAME                                     STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE\n  spark-all-jupyter                        Bound     pvc-4dc8dbc2-4931-11e8-86bc-42010a800173   6Gi        RWO            standard       2m\n  spark-all-zeppelin                       Bound     pvc-4dc9b4cd-4931-11e8-86bc-42010a800173   8Gi        RWO            standard       2m\n\n```\n\nWhen the deployment is deleted, following message will be shown, indicating that the PVC has not been removed\n\n```\n  $ helm delete --purge spark-all\n  These resources were kept due to the resource policy:\n  [PersistentVolumeClaim] spark-all-zeppelin\n  [PersistentVolumeClaim] spark-all-jupyter\n\n  release \"spark-all\" deleted\n```\n\nTo reuse these PVCs in the subsequent deployment, modify the `persistence.existingClaim` field in `values.yaml`\n\nFor example\n\n```\n  zeppelin:\n    persistence:\n      existingClaim: spark-all-zeppelin\n```  \n\nSimilarly for Jupyter\n\n```\n  jupyter:\n    persistence:\n      existingClaim: spark-all-jupyter\n```\n\nDeploy the umbrella chart again and the same volumes will be bound again:\n```\n  helm install --name spark-all --namespace spark ./spark-umbrella\n```\n\nNote that if you do not specify the `persistence.existingClaim` fields and the PVC already exists, the chart will error out\n\n```\n  $ helm install --name spark-all --namespace spark ./spark-umbrella/\n  Error: release spark-all failed: persistentvolumeclaims \"spark-all-jupyter\" already exists\n```\n\n\u003eNote: A user can specify a manually created persistent volume claim(PVC) in the `persistence.existingClaim` field. This is useful\nif one wants to use an existing PVC instead of provisioning a new volume dynamically thru chart.\n\n\n### Configuring sub-charts\n\nYou can configure the components in the umbrella chart's [values.yaml](charts/spark-umbrella/values.yaml). \n\nDetailed description of various attributes can be found in internal readme of the respective sub-charts. The links for which\nare given below.\n\n- [Zeppelin](charts/zeppelin-with-spark/README.md#chart-configuration)\n- [Jupyter](charts/jupyter-with-spark/README.md#configuration-properties-list)\n- [Spark History Server](charts/spark-hs/README.md#configuration)\n- [Shuffle Service](charts/spark-shuffle/README.md#configuration)\n- [Resource Staging Server](charts/spark-rss/README.md#configuration)\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftibcosoftware%2Fsnappy-on-k8s","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftibcosoftware%2Fsnappy-on-k8s","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftibcosoftware%2Fsnappy-on-k8s/lists"}