Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample
Spark ETL example processing New York taxi rides public dataset on EKS
https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample
Last synced: about 1 month ago
JSON representation
Spark ETL example processing New York taxi rides public dataset on EKS
- Host: GitHub
- URL: https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample
- Owner: aws-samples
- License: mit-0
- Created: 2019-08-23T17:58:15.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2023-01-05T17:42:51.000Z (almost 2 years ago)
- Last Synced: 2024-08-01T19:45:08.852Z (4 months ago)
- Language: Scala
- Size: 5.83 MB
- Stars: 41
- Watchers: 7
- Forks: 29
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-aws-research - Apache Spark on EKS
README
# amazon-eks-spark-best-practices
Examples providing best practices for Apache Spark on Amazon EKS## Pre-requisite
* Docker
* Eksctl [https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html)
* One of those:
* An ECR repository accessible from the EKS cluster you will deploy
* A Dockerhub account and public access for the EKS cluster you will deploy
## Preparing the required Docker imagesRun the folowwing command to build respectively a spark base image and the application image
`cd spark-application`
`docker build -t /spark-eks:v3.1.2 .`
`docker push /spark-eks:v3.1.2`
## Running the demo steps* Create the EKS cluster using eksctl
`eksctl create cluster -f kubernetes/eksctl.yaml`
* Deploy the Kubernetes autoscaler
`kubectl create -f kubernetes/cluster_autoscaler.yaml`
* Create an Amazon IAM Policy with the right permissions for the job
* Create two IAM role for service accounts with the previous Policy ARN
```
eksctl create iamserviceaccount \
--name spark \
--namespace spark \
--cluster spark-eks-best-practices \
--attach-policy-arn \
--approve --override-existing-serviceaccounts
```
```
eksctl create iamserviceaccount \
--name spark-fargate \
--namespace spark-fargate \
--cluster spark-eks-best-practices \
--attach-policy-arn \
--approve --override-existing-serviceaccounts
```* Launch Spark jobs with self managed Amazon EKS Nodegroups or with AWS Fargate
`kubectl apply -f examples/spark-job-hostpath-volume.yaml`
`kubectl apply -f examples/spark-job-fargate.yaml`
* Monitor Kubernetes Nodes and Pods via the Kubernetes Dashboard
* Monitor the Spark job progress via the Spark UI. To do that I can forward the Spark UI port to localhost and access it via my browser
* Get the Spark driver Pod name
* Forward the 4040 port from the Spark driver Pod
* Access the Spark UI via this URL https://localhost:4040`kubectl get pod -n=spark`
`kubectl port-forward -n=spark 4040:4040`