{"id":19772041,"url":"https://github.com/danieldacosta/etl-spark-stepfunctions","last_synced_at":"2026-04-17T19:32:27.995Z","repository":{"id":112641551,"uuid":"418662168","full_name":"DanielDaCosta/etl-spark-stepfunctions","owner":"DanielDaCosta","description":"ETL pipeline using Spark on EMR cluster and Step functions for orchestrations.","archived":false,"fork":false,"pushed_at":"2021-10-18T21:16:59.000Z","size":94,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-28T11:30:44.871Z","etag":null,"topics":["aws","aws-step-functions","etl","spark"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DanielDaCosta.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-18T20:44:37.000Z","updated_at":"2021-10-18T21:18:40.000Z","dependencies_parsed_at":"2023-06-02T06:15:31.024Z","dependency_job_id":null,"html_url":"https://github.com/DanielDaCosta/etl-spark-stepfunctions","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DanielDaCosta/etl-spark-stepfunctions","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2Fetl-spark-stepfunctions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2Fetl-spark-stepfunctions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2Fetl-spark-stepfunctions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2Fetl-spark-stepfunctions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DanielDaCosta","download_url":"https://codeload.github.com/DanielDaCosta/etl-spark-stepfunctions/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2Fetl-spark-stepfunctions/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31943354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-17T17:29:20.459Z","status":"ssl_error","status_checked_at":"2026-04-17T17:28:47.801Z","response_time":62,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-step-functions","etl","spark"],"created_at":"2024-11-12T05:05:12.647Z","updated_at":"2026-04-17T19:32:27.973Z","avatar_url":"https://github.com/DanielDaCosta.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# ETL Job - Spark \u0026 Step Functions\nTo preprocess the raw data from s3, we will be using EMR as our computer resource and AWS Step Functions as our orchestrator. The pipeline is triggered daily through a Cloud Watch event.\n\n## Let's Begin!\n\n## Workflow\n\nThe state machine's workflow that will be building is the following:\n\n![Architecture](Images/step_function_architecture.png)\n\n## Input\nThe first step of our pipeline is to create our EMR cluster. This script offers the possibility of using an already created cluster. Therefore, the input JSON for the pipeline is:\n```json\n{\n    \"CreateCluster\": true,\n    \"TerminateCluster\": true\n}\n```\n\nIf you want to use an already created cluster, you must pass the Cluster-Id to the pipeline, for example:\n\n```json\n{\n    \"CreateCluster\": false,\n    \"TerminateCluster\": false,\n    \"ClusterId\": {YOUR_CLUSTER_ID}\n}\n```\n\n## Create Cluster\nIn this state, we will:\n- **Configure the cluster's hardware**: type and number of instances,\n- **IAM Permissions**\n- **Subnet Configurations**\n- **Enable Termination Protection**: Since the ETL pipeline may take many hours, we should Enable Termination to protect the workflow against an accidental EMR termination.\n```json\n\"Should_Create_Cluster\": {\n      \"Type\": \"Choice\",\n      \"Choices\": [\n        {\n          \"Variable\": \"$.CreateCluster\",\n          \"BooleanEquals\": true,\n          \"Next\": \"Create_A_Cluster\"\n        },\n        {\n          \"Variable\": \"$.CreateCluster\",\n          \"BooleanEquals\": false,\n          \"Next\": \"Enable_Termination_Protection\"\n        }\n      ],\n      \"Default\": \"Create_A_Cluster\"\n    },\n    \"Create_A_Cluster\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::elasticmapreduce:createCluster.sync\",\n      \"Parameters\": {\n        \"Name\": \"{YOUR CLUSTER NAME}\",\n        \"VisibleToAllUsers\": true,\n        \"ReleaseLabel\": \"emr-6.2.0\",\n        \"Applications\": [\n          {\n            \"Name\": \"spark\"\n          },\n          {\n            \"Name\": \"Hive\"\n          }\n        ],\n        \"Configurations\": [\n          {\n            \"Classification\": \"spark\",\n            \"Properties\": {\n              \"maximizeResourceAllocation\": \"true\"\n            }\n          }\n        ],\n        \"ServiceRole\": \"EMR_DefaultRole\",\n        \"JobFlowRole\": \"ec2_defaultrole\",\n        \"LogUri\": \"{S3-BUCKET-FOR-LOGGING}\",\n        \"Instances\": {\n          \"Ec2SubnetId\": \"{YOUR_SUBNET}\",\n          \"KeepJobFlowAliveWhenNoSteps\": true,\n          \"InstanceFleets\": [\n            {\n              \"InstanceFleetType\": \"MASTER\",\n              \"TargetOnDemandCapacity\": 1,\n              \"InstanceTypeConfigs\": [\n                {\n                  \"InstanceType\": \"m5.2xlarge\"\n                }\n              ]\n            },\n            {\n              \"InstanceFleetType\": \"CORE\",\n              \"TargetOnDemandCapacity\": 1,\n              \"InstanceTypeConfigs\": [\n                {\n                  \"InstanceType\": \"c5.4xlarge\"\n                }\n              ]\n            },\n            {\n              \"InstanceFleetType\": \"TASK\",\n              \"TargetSpotCapacity\": 4,\n              \"InstanceTypeConfigs\": [\n                {\n                  \"InstanceType\": \"c5.12xlarge\"\n                }\n              ]\n            }\n          ]\n        }\n      },\n      \"ResultPath\": \"$.CreateClusterResult\",\n      \"Next\": \"Merge_Results\"\n    },\n    \"Merge_Results\": {\n      \"Type\": \"Pass\",\n      \"Parameters\": {\n        \"CreateCluster.$\": \"$.CreateCluster\",\n        \"TerminateCluster.$\": \"$.TerminateCluster\",\n        \"ClusterId.$\": \"$.CreateClusterResult.ClusterId\"\n      },\n      \"Next\": \"Enable_Termination_Protection\"\n    },\n    \"Enable_Termination_Protection\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::elasticmapreduce:setClusterTerminationProtection\",\n      \"Parameters\": {\n        \"ClusterId.$\": \"$.ClusterId\",\n        \"TerminationProtected\": true\n      },\n      \"ResultPath\": null,\n      \"Next\": \"Step_One\"\n    },\n```\n\n## Add Spark Step\nThere are two different ways of running a script using an EMR Step:\n- `command-runner.jar`:  run commands on your cluster, and you specify command-runner.jar without using its entire path.\n- `script-runner.jar`: Hosted on Amazon S3 at s3://\u003cregion\u003e.elasticmapreduce/libs/script-runner/script-runner.jar where \u003cregion\u003e is the Region in which your Amazon EMR cluster resides. You can use it to run scripts saved locally or on Amazon S3 on your cluster.\n\nFor more information, go to [Amazon EMR Documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-commandrunner.html).\n\nIn this tutorial, we have preferred to use a bash script for running our spark-submit.\n\n```json\n\"Step_One\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::elasticmapreduce:addStep.sync\",\n      \"Parameters\": {\n        \"ClusterId.$\": \"$.ClusterId\",\n        \"Step\": {\n          \"Name\": \"The first step\",\n          \"ActionOnFailure\": \"CONTINUE\",\n          \"HadoopJarStep\": {\n            \"Jar\": \"s3://us-west-1.elasticmapreduce/libs/script-runner/script-runner.jar\",\n            \"Args\": [\n              \"s3://{PATH_TO_YOUR_BASH_SCRIPT}\"\n            ]\n          }\n        }\n      },\n      \"Catch\": [\n        {\n          \"ErrorEquals\": [\n            \"States.TaskFailed\"\n          ],\n          \"ResultPath\": \"$.err_mgs_17\",\n          \"Next\": \"Disable_Termination_Protection\"\n        }\n      ],\n      \"ResultPath\": null,\n      \"Next\": \"Disable_Termination_Protection\"\n```\n\nOur bash script:\n\n```bash\n#!/bin/sh\n\naws s3 cp s3://{YOUR_BUCKET}/preprocessing_script/ ./ --recursive\n\nspark-submit --master yarn --deploy-mode cluster preprocessing.py\n```\n\nThe preprocessing.py file corresponds to your ETL python script.\n\n## Terminate Cluster\nFirst, we have to disable the Termination Protection, and then we can terminate it.\n\n```json\n\"Should_Terminate_Cluster\": {\n      \"Type\": \"Choice\",\n      \"Choices\": [\n        {\n          \"Variable\": \"$.TerminateCluster\",\n          \"BooleanEquals\": true,\n          \"Next\": \"Terminate_Cluster\"\n        },\n        {\n          \"Variable\": \"$.TerminateCluster\",\n          \"BooleanEquals\": false,\n          \"Next\": \"Wrapping_Up\"\n        }\n      ],\n      \"Default\": \"Wrapping_Up\"\n    },\n    \"Terminate_Cluster\": {\n      \"Type\": \"Task\",\n      \"Resource\": \"arn:aws:states:::elasticmapreduce:terminateCluster.sync\",\n      \"Parameters\": {\n        \"ClusterId.$\": \"$.ClusterId\"\n      },\n      \"Next\": \"Wrapping_Up\"\n    },\n    \"Wrapping_Up\": {\n      \"Type\": \"Pass\",\n      \"End\": true\n    }\n\n```\n\n# References\n- https://docs.aws.amazon.com/step-functions/latest/dg/connect-emr.html\n- https://docs.aws.amazon.com/emr/latest/APIReference/API_RunJobFlow.html\n- https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-commandrunner.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Fetl-spark-stepfunctions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanieldacosta%2Fetl-spark-stepfunctions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Fetl-spark-stepfunctions/lists"}