https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample

Spark ETL example processing New York taxi rides public dataset on EKS
https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample

Last synced: 3 months ago
JSON representation

Spark ETL example processing New York taxi rides public dataset on EKS

Host: GitHub
URL: https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample
Owner: aws-samples
License: mit-0
Created: 2019-08-23T17:58:15.000Z (almost 6 years ago)
Default Branch: main
Last Pushed: 2023-01-05T17:42:51.000Z (over 2 years ago)
Last Synced: 2024-11-07T12:45:20.628Z (8 months ago)
Language: Scala
Size: 5.83 MB
Stars: 44
Watchers: 7
Forks: 29
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-aws-research - Apache Spark on EKS

README

        # amazon-eks-spark-best-practices

Examples providing best practices for Apache Spark on Amazon EKS

## Pre-requisite

 * Docker

 * Eksctl [https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html)

 * One of those:

    * An ECR repository accessible from the EKS cluster you will deploy

     * A Dockerhub account and public access for the EKS cluster you will deploy 

     

## Preparing the required Docker images

Run the folowwing command to build respectively a spark base image and the application image

   

   `cd spark-application`

 

   `docker build -t /spark-eks:v3.1.2 .`

   

   `docker push /spark-eks:v3.1.2`

   

## Running the demo steps

 * Create the EKS cluster using eksctl

 

   `eksctl create cluster -f kubernetes/eksctl.yaml`

   

 * Deploy the Kubernetes autoscaler

 

   `kubectl create -f kubernetes/cluster_autoscaler.yaml`

 

 * Create an Amazon IAM Policy with the right permissions for the job

   

 * Create two IAM role for service accounts with the previous Policy ARN

```

eksctl create iamserviceaccount \

--name spark \

--namespace spark \

--cluster spark-eks-best-practices \

--attach-policy-arn  \

--approve --override-existing-serviceaccounts

```

```

eksctl create iamserviceaccount \

--name spark-fargate \

--namespace spark-fargate \

--cluster spark-eks-best-practices \

--attach-policy-arn  \

--approve --override-existing-serviceaccounts

```

 * Launch Spark jobs with self managed Amazon EKS Nodegroups or with AWS Fargate

`kubectl apply -f examples/spark-job-hostpath-volume.yaml`

`kubectl apply -f examples/spark-job-fargate.yaml`

 * Monitor Kubernetes Nodes and Pods via the Kubernetes Dashboard

 * Monitor the Spark job progress via the Spark UI. To do that I can forward the Spark UI port to localhost and access it via my browser

   * Get the Spark driver Pod name

   * Forward the 4040 port from the Spark driver Pod

   * Access the Spark UI via this URL https://localhost:4040

`kubectl get pod -n=spark`

   

`kubectl port-forward -n=spark  4040:4040`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample

Awesome Lists containing this project

README