Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vitalibo/spark-aws-orchestration

Deployment/Orchestration of Apache Spark applications on Amazon EMR.
https://github.com/vitalibo/spark-aws-orchestration

aws cloudformation emr spark step-functions

Last synced: 3 months ago
JSON representation

Deployment/Orchestration of Apache Spark applications on Amazon EMR.

Host: GitHub
URL: https://github.com/vitalibo/spark-aws-orchestration
Owner: vitalibo
Created: 2018-10-01T01:44:00.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-10-19T06:53:58.000Z (over 2 years ago)
Last Synced: 2023-03-03T10:12:55.545Z (almost 2 years ago)
Topics: aws, cloudformation, emr, spark, step-functions
Language: Java
Homepage:
Size: 126 KB
Stars: 5
Watchers: 3
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Orchestration Apache Spark on Amazon EMR

This solution help you automate the deployment of Apache Spark applications on Amazon EMR.
Orchestration Apache Spark implemented via integration with AWS CloudFormation and AWS Step Functions.
Using AWS CloudFormation custom resource you can easily deploy your Apache Spark application on Amazon EMR cluster.
It is useful for deploying your application on dev environment for testing purpose or organize CI/CD of Real-Time application.
Integration with AWS Step Functions allow serverless* orchestration Apache Spark application and organize flexible Apache Spark based ETL pipeline.

[![Build Status](https://travis-ci.org/vitalibo/spark-aws-orchestration.svg?branch=master)](https://travis-ci.org/vitalibo/spark-aws-orchestration)

### Usage

This guide provide samples of deployment and orchestration Apache Spark applications on Amazon EMR.

##### Deploy Apache Spark application via AWS CloudFormation stack

One of the problems that solved this solution is deployment Apache Spark application.
Integration with AWS CloudFormation implemented via [Custom Resource](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-custom-resources.html) (high level diagram you can see below).

![CloudFormation_Diagram](https://fwtbbmf399.execute-api.us-east-1.amazonaws.com/Prod/svg?source=https://raw.githubusercontent.com/vitalibo/spark-aws-orchestration/master/README.md&name=cloudformation.svg)

SVG code

```
@cloudformation.svg

VPC

AWS CloudFormation
Template

AWS CloudFormation...

AWS CloudFormation

AWS Lambda

Amazon EMR

Update
AWS CloudFormation stack

Update...

Create/Update/Delete events

Create/Update/Delete...

Interactive with Apache Spark through Apache Livy

Interactive with Apache Spark...Viewer does not support full SVG 1.1
@cloudformation.svg
```

The solution include the following steps:

1. The developer define `SparkJob` custom resource in CloudFormation template, which includes Apache Livy host name and Apache Spark parameters (`ClassName`, `NumExecutors`, `ExecutorCores` etc.).
2. Whenever anything changes in resource, AWS CloudFormation sends a request (`Create`/`Update`/`Delete`) to Apache Spark Job submitter.
3. Apache Spark Job submitter is a AWS Lambda that launch in VPC and interactive with Apache Spark through Apache Livy that launched on Amazon EMR in same VPC.

Following sample demonstrate usage AWS CloudFormation custom resource.

```yaml
SparkPi:
Type: 'Custom::SparkJob'
Properties:
ServiceToken: !Fn::ImportValue 'spark-job-submitter'
LivyHost: !GetAtt
- EmrSparkCluster
- MasterPublicDNS
LivyPort: 8998
Parameters:
Name: 'Spark Pi'
ClassName: 'org.apache.spark.examples.SparkPi'
NumExecutors: 1
DriverMemory: '512m'
ExecutorMemory: '512m'
ExecutorCores: 1
Args:
- 10
```

##### Orchestration Apache Spark based ETL pipeline via AWS Step Functions

Another popular problem is manage ETL task.
Amazon EMR provide out of the box installation of [Apache Oozie](https://oozie.apache.org), but no integration with others AWS services and difficult configure, manage Apache Spark applications might not be suitable for some users.
Integration with AWS Step Functions allow simplify deployment and manage ETL tasks with moder UI and integrations with other AWS services (high level solution diagram you can see below).

![StepFunctions_Diagram](https://fwtbbmf399.execute-api.us-east-1.amazonaws.com/Prod/svg?source=https://raw.githubusercontent.com/vitalibo/spark-aws-orchestration/master/README.md&name=stepfunctions.svg)

SVG code

```
@stepfunctions.svg

Start

FetchAnOrder

RegionChoice

CreateOrederB

CreateOrderA

OrderOK

ProcessOrder

DatabaseError

UnservedRegion

NoOrderPossible

End

AWS Step Function

Interactive with Apache Spark
through Apache Livy

Interactive with Apache Spark...

Java Application

Apache Spark

Amazon EMR

Master Instance

Core Instances

...

Instance 1

Instance N

AWS Step Functions

Create new activity task

Pulls activity tasks

Pulls activity tasksViewer does not support full SVG 1.1
@stepfunctions.svg
```

The solution include the following steps:

1. The developer define `spark-job` activity task in your state machine template using Amazon State Language, which includes Apache Livy host name and Apache Spark parameters (`ClassName`, `NumExecutors`, `ExecutorCores` etc.).
2. Trigger the AWS Step Function state machine by other external events, like CloudWatch Event or S3 object creation.
3. AWS Step Function referenced on activity Arn create new activity task and pass Apache Spark parameters as input state.
4. On Amazon EMR master instance running activity worker that polls activity task and interactive with Apache Spark through Apache Livy based on input state.
Every N seconds application synchronize Apache Spark job state and send heartbeat to AWS Step Function.

Following sample demonstrate usage AWS Step Functions custom Apache Spark activity.

```json
{
"Comment": "Spark ETL State Machine",
"StartAt": "SparkPi",
"TimeoutSeconds": 300,
"States": {
"SparkPi": {
"Type": "Task",
"Resource": "arn:aws:states:us-east-1:123456789012:activity:example.spark-job",
"Parameters": {
"LivyHost.$": "$.SparkClusterPublicDNS",
"LivyPort": 8998,
"Parameters": {
"Name": "Spark Pi",
"ClassName": "org.apache.spark.examples.SparkPi",
"NumExecutors": 1,
"DriverMemory": "512m",
"ExecutorMemory": "512m",
"ExecutorCores": 1,
"Args": [
"10"
]
}
},
"End": true
}
}
}
```

### Links

- [Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy](https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/)
- [Build a Concurrent Data Orchestration Pipeline Using Amazon EMR and Apache Livy](https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/)