{"id":20162730,"url":"https://github.com/airscholar/emr-for-data-engineers","last_synced_at":"2025-04-10T00:36:11.143Z","repository":{"id":234048221,"uuid":"717891933","full_name":"airscholar/EMR-for-data-engineers","owner":"airscholar","description":"This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.","archived":false,"fork":false,"pushed_at":"2023-11-12T22:42:39.000Z","size":524,"stargazers_count":7,"open_issues_count":0,"forks_count":8,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-24T02:21:58.317Z","etag":null,"topics":["apache-spark","aws","aws-s3","emr-cluster"],"latest_commit_sha":null,"homepage":"https://youtu.be/ZFns7fvBCH4","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/airscholar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-11-12T22:41:51.000Z","updated_at":"2025-02-02T20:29:50.000Z","dependencies_parsed_at":"2024-04-18T02:57:14.990Z","dependency_job_id":"b94fb852-4e19-43cf-b440-c480155dfb4a","html_url":"https://github.com/airscholar/EMR-for-data-engineers","commit_stats":null,"previous_names":["airscholar/emr-for-data-engineers"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FEMR-for-data-engineers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FEMR-for-data-engineers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FEMR-for-data-engineers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/airscholar%2FEMR-for-data-engineers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/airscholar","download_url":"https://codeload.github.com/airscholar/EMR-for-data-engineers/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137950,"owners_count":21053771,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","aws","aws-s3","emr-cluster"],"created_at":"2024-11-14T00:26:39.627Z","updated_at":"2025-04-10T00:36:11.127Z","avatar_url":"https://github.com/airscholar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS EMR Data Processing for Data Engineers\n\n## Description\nThis project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.\n\n## System Architecture\n![Architecture.png](assets%2FArchitecture.png)\n\n## Project Structure\n- `spark-etl.py`: The main Spark script used for ETL operations.\n- `commands.py`: Scripts for AWS EMR cluster setup and management.\n- `data/`: Directory containing the dataset used in the ETL process.\n\n## Spark Script\nThe `spark-etl.py` is a Python script that uses Apache Spark to perform ETL operations. It reads data from an input directory, processes it by adding a timestamp, and writes the result to an output directory in Parquet format.\n\n### Usage\nTo run the script, use the following command:\n```\nspark-submit spark-etl.py [s3-input-folder] [s3-output-folder]\n```\nReplace `[s3-input-folder]` with the path to the input data directory and `[s3-output-folder]` with the path where you want to save the output.\n\n## AWS Commands\nThe `commands.py` directory contains detailed instructions and necessary scripts to set up and manage an AWS EMR cluster. This includes steps for creating an EMR cluster, configuring necessary services, and submitting Spark jobs.\n\n## Data\nThe `data/` directory contains the dataset used for the ETL process. This dataset is a sample that represents the type of data the Spark script is designed to process.\n\n## Requirements\n- Apache Spark\n- AWS CLI\n- An AWS account with necessary permissions to create and manage EMR clusters\n\n## Watch the Video Tutorial\nFor a complete walkthrough and practical demonstration, check out the video here:\n\n[![EMR Masterclass](https://img.youtube.com/vi/ZFns7fvBCH4/0.jpg)](https://www.youtube.com/watch?v=ZFns7fvBCH4)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairscholar%2Femr-for-data-engineers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fairscholar%2Femr-for-data-engineers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fairscholar%2Femr-for-data-engineers/lists"}