Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample

Spark ETL example processing New York taxi rides public dataset on EKS
https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample

Last synced: 2 months ago
JSON representation

Spark ETL example processing New York taxi rides public dataset on EKS

Awesome Lists containing this project

README

        

# amazon-eks-spark-best-practices
Examples providing best practices for Apache Spark on Amazon EKS

## Pre-requisite

* Docker
* Eksctl [https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html)
* One of those:
* An ECR repository accessible from the EKS cluster you will deploy
* A Dockerhub account and public access for the EKS cluster you will deploy

## Preparing the required Docker images

Run the folowwing command to build respectively a spark base image and the application image

`cd spark-application`

`docker build -t /spark-eks:v3.1.2 .`

`docker push /spark-eks:v3.1.2`

## Running the demo steps

* Create the EKS cluster using eksctl

`eksctl create cluster -f kubernetes/eksctl.yaml`

* Deploy the Kubernetes autoscaler

`kubectl create -f kubernetes/cluster_autoscaler.yaml`

* Create an Amazon IAM Policy with the right permissions for the job

* Create two IAM role for service accounts with the previous Policy ARN
```
eksctl create iamserviceaccount \
--name spark \
--namespace spark \
--cluster spark-eks-best-practices \
--attach-policy-arn \
--approve --override-existing-serviceaccounts
```
```
eksctl create iamserviceaccount \
--name spark-fargate \
--namespace spark-fargate \
--cluster spark-eks-best-practices \
--attach-policy-arn \
--approve --override-existing-serviceaccounts
```

* Launch Spark jobs with self managed Amazon EKS Nodegroups or with AWS Fargate

`kubectl apply -f examples/spark-job-hostpath-volume.yaml`

`kubectl apply -f examples/spark-job-fargate.yaml`

* Monitor Kubernetes Nodes and Pods via the Kubernetes Dashboard

* Monitor the Spark job progress via the Spark UI. To do that I can forward the Spark UI port to localhost and access it via my browser
* Get the Spark driver Pod name
* Forward the 4040 port from the Spark driver Pod
* Access the Spark UI via this URL https://localhost:4040

`kubectl get pod -n=spark`

`kubectl port-forward -n=spark 4040:4040`