https://github.com/larryclaman/spark-on-kubernetes-tutorial

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/larryclaman/spark-on-kubernetes-tutorial
Owner: larryclaman
Created: 2019-09-10T00:18:05.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-09-13T15:26:53.000Z (almost 6 years ago)
Last Synced: 2025-01-21T22:08:50.129Z (6 months ago)
Language: Python
Size: 29.3 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Running Apache Spark jobs on AKS

[Apache Spark][apache-spark] is a fast engine for large-scale data processing. As of the [Spark 2.3.0 release][spark-latest-release], Apache Spark supports native integration with Kubernetes clusters. Azure Kubernetes Service (AKS) is a managed Kubernetes environment running in Azure. This document details preparing and running Apache Spark jobs on an Azure Kubernetes Service (AKS) cluster.

## Prerequisites

In order to complete the steps within this article, you need the following.

* An Azure VM running Ubuntu
* Basic understanding of Kubernetes and [Apache Spark][spark-quickstart].
* An [Azure Container Registry][acr-create].
* Azure CLI [installed][azure-cli] on your development system. See below:
```
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
```
* **JDK 8** installed on your system. See below:
```
sudo apt-get install openjdk-8-jdk-headless
```
* SBT ([Scala Build Tool][sbt-install]) installed on your system.
```
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkps://keyserver.ubuntu.com:443 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
```
* Git command-line tools installed on your system.
* Docker installed on your system. Example:
```bash
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
```

## Create an AKS cluster

Spark is used for large-scale data processing and requires that Kubernetes nodes are sized to meet the Spark resources requirements. We recommend a minimum size of `Standard_D3_v2` for your Azure Kubernetes Service (AKS) nodes.

If you need an AKS cluster that meets this minimum recommendation, run the following commands.

Create an AKS cluster

Create a resource group for the cluster.

```azurecli
az group create --name mySparkCluster --location eastus
```

Create the AKS cluster with nodes that are of size `Standard_D3_v2`.

```azurecli
az aks create --resource-group mySparkCluster --name mySparkCluster --node-vm-size Standard_D3_v2
```

Connect to the AKS cluster.

```azurecli
az aks get-credentials --resource-group mySparkCluster --name mySparkCluster
```

Install kubectl:
```
sudo az aks install-cli
```

IMPORTANT: If you are using Azure Container Registry (ACR) to store container images, configure authentication between AKS and ACR. See the [ACR authentication documentation][acr-aks] for these steps:
```
ACR_NAME= # replace with your acr name
SERVICE_PRINCIPAL_NAME=-acr-service-principal

# Populate the ACR login server and resource id.
ACR_LOGIN_SERVER=$(az acr show --name $ACR_NAME --query loginServer --output tsv)
ACR_REGISTRY_ID=$(az acr show --name $ACR_NAME --query id --output tsv)

# Create acrpull role assignment with a scope of the ACR resource.
SP_PASSWD=$(az ad sp create-for-rbac --name http://$SERVICE_PRINCIPAL_NAME --role acrpull --scopes $ACR_REGISTRY_ID --query password --output tsv)
echo "Service principal password: $SP_PASSWD"
echo "Make sure this is set!!"
sleep 20

# Get the service principal client id.
CLIENT_ID=$(az ad sp show --id http://$SERVICE_PRINCIPAL_NAME --query appId --output tsv)

# Output used when creating Kubernetes secret.
echo "Service principal ID: $CLIENT_ID"

```

## Install Spark on your VM & build spark container

### Build the Spark source

Before running Spark jobs on an AKS cluster, you need to build the Spark source code and package it into a container image. The Spark source includes scripts that can be used to complete this process.

Clone the Spark project repository to your development system.

```bash
git clone https://github.com/apache/spark
```

Change into the directory of the cloned repository and save the path of the Spark source to a variable.

```bash
cd spark
sparkdir=$(pwd)
```

Run the following command to build the Spark source code with Kubernetes support. **NOTE: This may take quite a while to complete! (15-30 min or longer)**

```bash
./build/mvn -Pkubernetes -DskipTests clean package
```

The following commands create the Spark container image and push it to a container image registry. Replace `registry.example.com` with the name of your container registry and `v1` with the tag you prefer to use. If using Docker Hub, this value is the registry name. If using Azure Container Registry (ACR), this value is the ACR login server name.

```bash
REGISTRY_NAME=registry.example.com
REGISTRY_TAG=v1
```

```bash
./bin/docker-image-tool.sh -r $REGISTRY_NAME -t $REGISTRY_TAG build
```

Push the container image to your container image registry.

```bash
./bin/docker-image-tool.sh -r $REGISTRY_NAME -t $REGISTRY_TAG push
```

## Prepare a Spark job

Next, prepare a Spark job. A jar file is used to hold the Spark job and is needed when running the `spark-submit` command. The jar can be made accessible through a public URL or pre-packaged within a container image. For this example, you can either use a pre-created jar file, or you can create one from scratch. (If you have an existing jar, feel free to substitute.)

### Option 1: Pre-created jar file
A sample jar with a Sparkpi job can be found at: https://docs.azuredatabricks.net/_static/libs/SparkPi-assembly-0.1.jar

This jar file runs a sample spark job that calculates digits of PI.

In your bash shell, run:
```bash
jarUrl="https://docs.azuredatabricks.net/_static/libs/SparkPi-assembly-0.1.jar"
```

Variable `jarUrl` now contains the publicly accessible path to the jar file.

Now jump ahead to [Submit a Spark job](#submit-a-spark-job)

### Option 2: Create a new jar file

In this example, a sample jar is created to calculate the value of Pi. This jar is then uploaded to Azure storage. If you have an existing jar, feel free to substitute

Create a directory where you would like to create the project for a Spark job.

```bash
mkdir myprojects
cd myprojects
```

Create a new Scala project from a template.

```bash
sbt new sbt/scala-seed.g8
```

When prompted, enter `SparkPi` for the project name.

```bash
name [Scala Seed Project]: SparkPi
```

Navigate to the newly created project directory.

```bash
cd sparkpi
```

Run the following commands to add an SBT plugin, which allows packaging the project as a jar file.

```bash
touch project/assembly.sbt
echo 'addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")' >> project/assembly.sbt
```

Run these commands to copy the sample code into the newly created project and add all necessary dependencies.

```bash
EXAMPLESDIR="src/main/scala/org/apache/spark/examples"
mkdir -p $EXAMPLESDIR
cp $sparkdir/examples/$EXAMPLESDIR/SparkPi.scala $EXAMPLESDIR/SparkPi.scala

cat <> build.sbt
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0" % "provided"
EOT

sed -ie 's/scalaVersion.*/scalaVersion := "2.11.11"/' build.sbt
sed -ie 's/name.*/name := "SparkPi",/' build.sbt
```

To package the project into a jar, run the following command.

```bash
sbt assembly
```

After successful packaging, you should see output similar to the following.

```bash
[info] Packaging /Users/me/myprojects/sparkpi/target/scala-2.11/SparkPi-assembly-0.1.0-SNAPSHOT.jar ...
[info] Done packaging.
[success] Total time: 10 s, completed Mar 6, 2018 11:07:54 AM
```

#### Copy job to storage

Create an Azure storage account and container to hold the jar file.

```azurecli
RESOURCE_GROUP=sparkdemo
STORAGE_ACCT=sparkdemo$RANDOM
az group create --name $RESOURCE_GROUP --location eastus
az storage account create --resource-group $RESOURCE_GROUP --name $STORAGE_ACCT --sku Standard_LRS
export AZURE_STORAGE_CONNECTION_STRING=`az storage account show-connection-string --resource-group $RESOURCE_GROUP --name $STORAGE_ACCT -o tsv`
```

Upload the jar file to the Azure storage account with the following commands.

```bash
CONTAINER_NAME=jars
BLOB_NAME=SparkPi-assembly-0.1.0-SNAPSHOT.jar
FILE_TO_UPLOAD=target/scala-2.11/SparkPi-assembly-0.1.0-SNAPSHOT.jar

echo "Creating the container..."
az storage container create --name $CONTAINER_NAME
az storage container set-permission --name $CONTAINER_NAME --public-access blob

echo "Uploading the file..."
az storage blob upload --container-name $CONTAINER_NAME --file $FILE_TO_UPLOAD --name $BLOB_NAME

jarUrl=$(az storage blob url --container-name $CONTAINER_NAME --name $BLOB_NAME --output json| tr -d '"')
```

Variable `jarUrl` now contains the publicly accessible path to the jar file.

## Submit a Spark job

### Prepare Kubernetes context
If you haven't already done so, ensure that you can reach your Kubernetes cluster:
```
kubectl get nodes
```

You next need to add a service account and role to the Kubernetes cluster:

```
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
```

### Start Kubernetes proxy
Start kube-proxy in a **separate command-line** with the following code.

```bash
kubectl proxy
```

In the first window, navigate back to the root of Spark repository.

```bash
cd $sparkdir
```

### Spark Docker image
You have two options: You can use the image you have just built, or you can a prebuild docker image.

#### Option 1: Use your image
Run the following in bash to store the image info in variables:
```bash
REGISTRY_NAME= # example: larrysreg.azurecr.io
REGISTRY_TAG=
```

#### Option 2: Use pre-made image

The spark image can be found at https://cloud.docker.com/u/larryms/repository/docker/larryms/spark

Run the following in bash to store the image info in variables:
```bash
REGISTRY_NAME=larryms
REGISTRY_TAG=v1
```

Submit the job using `spark-submit`.

```bash
./bin/spark-submit \
--master k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image=$REGISTRY_NAME/spark:$REGISTRY_TAG \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
$jarUrl
```

This operation starts the Spark job, which streams job status to your shell session. While the job is running, you can see Spark driver pod and executor pods using the kubectl get pods command. Open a second terminal session to run these commands.

```console
$ kubectl get pods

NAME READY STATUS RESTARTS AGE
spark-pi-2232778d0f663768ab27edc35cb73040-driver 1/1 Running 0 16s
spark-pi-2232778d0f663768ab27edc35cb73040-exec-1 0/1 Init:0/1 0 4s
spark-pi-2232778d0f663768ab27edc35cb73040-exec-2 0/1 Init:0/1 0 4s
spark-pi-2232778d0f663768ab27edc35cb73040-exec-3 0/1 Init:0/1 0 4s
```

While the job is running, you can also access the Spark UI. In the third terminal session, use the `kubectl port-forward` command provide access to Spark UI.

```bash
kubectl port-forward spark-pi-2232778d0f663768ab27edc35cb73040-driver 4040:4040
```

To access Spark UI, open the address `127.0.0.1:4040` in a browser.

![Spark UI](media/aks-spark-job/spark-ui.png)

## Get job results and logs

After the job has finished, the driver pod will be in a "Completed" state. Get the name of the pod with the following command.

```bash
kubectl get pods --show-all
```

Output:

```bash
NAME READY STATUS RESTARTS AGE
spark-pi-2232778d0f663768ab27edc35cb73040-driver 0/1 Completed 0 1m
```

Use the `kubectl logs` command to get logs from the spark driver pod. Replace the pod name with your driver pod's name.

```bash
kubectl logs spark-pi-2232778d0f663768ab27edc35cb73040-driver
```

Within these logs, you can see the result of the Spark job, which is the value of Pi.

```bash
Pi is roughly 3.152155760778804
```

## Package jar with container image

In the above example, the Spark jar file was uploaded to Azure storage or pulled from a website. Another option is to package the jar file into your own custom-built Docker images.

To do so, find the `dockerfile` for the Spark image located at `$sparkdir/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/` directory. Add an `ADD` statement for the Spark job `jar` somewhere between `WORKDIR` and `ENTRYPOINT` declarations.

Update the jar path to the location of the `SparkPi-assembly-0.1.0-SNAPSHOT.jar` file on your development system. You can also use your own custom jar file.

```bash
WORKDIR /opt/spark/work-dir

ADD /path/to/SparkPi-assembly-0.1.0-SNAPSHOT.jar SparkPi-assembly-0.1.0-SNAPSHOT.jar

ENTRYPOINT [ "/opt/entrypoint.sh" ]
```

Build and push the image with the included Spark scripts.

```bash
./bin/docker-image-tool.sh -r -t build
./bin/docker-image-tool.sh -r -t push
```

When running the job, instead of indicating a remote jar URL, the `local://` scheme can be used with the path to the jar file in the Docker image.

```bash
./bin/spark-submit \
--master k8s://https://: \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image= \
local:///opt/spark/work-dir/.jar
```
# FIN
> [NOTE]
> From Spark [documentation][spark-docs]: "The Kubernetes scheduler is currently experimental. In future versions, there may be behavioral changes around configuration, container images and entrypoints".

## Next steps

Check out Spark documentation for more details.

> [Spark documentation][spark-docs]

[apache-spark]: https://spark.apache.org/
[docker-hub]: https://docs.docker.com/docker-hub/
[java-install]: https://aka.ms/azure-jdks
[sbt-install]: https://www.scala-sbt.org/1.0/docs/Setup.html
[spark-docs]: https://spark.apache.org/docs/latest/running-on-kubernetes.html
[spark-latest-release]: https://spark.apache.org/releases/spark-release-2-3-0.html
[spark-quickstart]: https://spark.apache.org/docs/latest/quick-start.html

[acr-aks]: https://docs.microsoft.com/azure/container-registry/container-registry-auth-aks
[acr-create]: https://docs.microsoft.com/azure/container-registry/container-registry-get-started-azure-cli
[aks-quickstart]: https://docs.microsoft.com/azure/aks/
[azure-cli]: https://docs.microsoft.com/cli/azure/?view=azure-cli-latest
[storage-account]: https://docs.microsoft.com/azure/storage/common/storage-azure-cli

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/larryclaman/spark-on-kubernetes-tutorial

Awesome Lists containing this project

README