https://github.com/wednesday-solutions/data-engineering-onboarding-starter

This repository contains a 10 step program to enter the world of Data Engineering
https://github.com/wednesday-solutions/data-engineering-onboarding-starter

aws aws-glue data data-engg-learning data-engineering data-engineering-starter data-template dataengg dataengg-template etl glue spark workflow

Last synced: 6 months ago
JSON representation

This repository contains a 10 step program to enter the world of Data Engineering

Host: GitHub
URL: https://github.com/wednesday-solutions/data-engineering-onboarding-starter
Owner: wednesday-solutions
Created: 2023-07-25T13:48:36.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-07-10T07:06:28.000Z (about 1 year ago)
Last Synced: 2025-03-27T11:38:34.135Z (6 months ago)
Topics: aws, aws-glue, data, data-engg-learning, data-engineering, data-engineering-starter, data-template, dataengg, dataengg-template, etl, glue, spark, workflow
Language: Python
Homepage: https://wednesday.is/building-products/?utm_source=github&utm_medium=dataengg_template
Size: 6.42 MB
Stars: 14
Watchers: 3
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Data Engineering Onboarding Starter

An immersive data engineering journey awaits you in this comprehensive starter kit, featuring a curated list of resources, tools, and best practices to help you get started with data engineering. This starter kit is designed to help you learn the basics of data engineering and get you up and running with your first data engineering project.

---

Expert teams of digital product strategists, developers, and designers.

---

[![Data Engineering - Deploy to AWS Glue](https://github.com/wednesday-solutions/data-engg/actions/workflows/cd.yml/badge.svg)](https://github.com/wednesday-solutions/data-engg/actions/workflows/cd.yml) [![Data Engineering CI](https://github.com/wednesday-solutions/data-engg/actions/workflows/ci.yml/badge.svg)](https://github.com/wednesday-solutions/data-engg/actions/workflows/ci.yml)

---

## Prerequisites

1. [Python3 with PIP](https://www.python.org/downloads/)
2. [Install Java 8](https://www.oracle.com/in/java/technologies/downloads/#java8-mac)
3. [AWS CLI configured locally](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html)
4. [Docker](https://docs.docker.com/desktop/install/mac-install/) (Optional)

## Folder Structure

```
├── Makefile
├── README.md
│
├── automation
│   └── deploy_glue_job.sh
│
├── examples
│   ├── 01_pyspark_dataframe
│   │   ├── README.md
│   │   └── main.py
│   ├── 02_applying_filters
│   │   ├── README.md
│   │   └── main.py
│   ├── 03_transform_columns
│   │   ├── README.md
│   │   └── main.py
│   ├── 04_remap_columns
│   │   ├── README.md
│   │   └── main.py
│   ├── 05_complex_transformations
│   │   ├── README.md
│   │   └── main.py
│   ├── 06_write_dataframe
│   │   ├── README.md
│   │   └── main.py
│   ├── 07_pyspark_in_glue_jobs
│   │   ├── README.md
│   │   └── main.py
│   ├── 08_glue_dynamic_frame
│   │   ├── README.md
│   │   └── main.py
│   ├── 09_apply_mappings
│   │   ├── README.md
│   │   └── main.py
│   └── 10_write_to_target
│   ├── README.md
│   └── main.py
│
└── src
├── data
│   ├── customers.csv
│   ├── survey_results_public.csv
│   └── survey_result
│
└── scripts
   ├── a_stackoverflow_survey
   │   └── main.py
   ├── b_fix_this_script
   │   ├── README.md
   │   └── main.py
   └── c_top_spotify_tracks
   └── README.md | -> Allows you to run commands for setup, test, lint, etc | -> Documentation for the project setup and usage | -> Contains scripts to automate deployment and testing | -> Script to deploy or update glue job | -> Contains example scripts to demonstrate pyspark features | -> Create a DataFrame by reading data from a source (CSV, Parquet Database, etc) | -> Contains instructions to run the example | -> Example script to read csv file and write to parquet | -> Apply filters on a dataframe | -> Contains instructions to run the example | -> Example script to apply filters on dataframe | -> Transform columns & manipulate data in a dataframe | -> Contains instructions to run the example | -> Example script to transform columns | -> Normalise columns in a dataframe | -> Contains instructions to run the example | -> Example script to normalise columns in a dataframe | -> Perform complex transformations on a dataframe | -> Contains instructions to run the example | -> Example script to perform some complex transformations | -> Write a dataframe to a target | -> Contains instructions to run the example | -> Example script to write dataframe to parquet or RDBMS Database | -> Examples of using PySpark in AWS Glue Jobs | -> Contains instructions to run the example | -> Example script to run pyspark script in glue job | -> Create a DynamicFrame by reading data from a data catalog | -> Contains instructions to run the example | -> Example script to create a dynamic frame from a data catalog | -> Apply mappings on a dynamic frame (change column names, data types, etc) | -> Contains instructions to run the example | -> Example script to apply mappings on dynamic frame | -> Write a dynamic frame to a target (CSV, Parquet, Database, etc) | -> Contains instructions to run the example | -> Example script to write dynamic frame to parquet and store in S3 | -> Contains all the source code for the onboarding exercise | -> Contains data files for the onboarding exercise | -> Customer Dataset CSV file | -> Stackoverflow Survey CSV file s_public.parquet | -> Stackoverflow Survey Parquet file | -> Contains all the glue scripts exercise | -> A sample glue script to read, apply mappings, transform data | -> A broken glue script for you to fix | -> A task for you to complete. Best of luck!

```

---

## Setup

**Step 1:** Clone this repository and install required packages

```bash
$ make install
```

**Step 2:** Clone AWS Glue Python Lib

AWS Glue libraries are not available on via PIP. Hence, we need to install it manually.

```bash
# Clone the master branch for Glue 4.0
$ git clone https://github.com/awslabs/aws-glue-libs.git

$ export AWS_GLUE_HOME=$(pwd)/aws-glue-libs
```

**Step 3:** Install Apache Maven

```bash
$ curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz -o apache-maven-3.6.0-bin.tar.gz

$ tar -xvf apache-maven-3.6.0-bin.tar.gz

$ ln -s apache-maven-3.6.0-bin maven

$ export MAVEN_HOME=$(pwd)/maven
```

**Step 3:** Install Apache Spark

```bash
$ curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-4.0/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz -o spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz

$ tar -xvf spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz

$ ln -s spark-3.3.0-amzn-1-bin-3.3.3-amzn-0 spark

$ export SPARK_HOME=$(pwd)/spark
```

**Step 4:** Export Paths

```bash
$ export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
```

verify installation by running

`mvn --version`

`pyspark --version`

**Step 5:** Download Glue ETL .jar files

```bash
$ cd $AWS_GLUE_HOME

$ mvn install dependency:copy-dependencies

$ cp $AWS_GLUE_HOME/jarsv1/AWSGlue*.jar $SPARK_HOME/jars/

$ cp $AWS_GLUE_HOME/jarsv1/aws*.jar $SPARK_HOME/jars/
```

**After this step you should be able to execute**
**`gluepyspark`, `gluepytest`, `gluesparksubmit`**
**from shell**

#### References:

- [Run Glue Jobs Locally | AWS Docs](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html)
- [Setup AWS glue locally with PySpark](https://medium.com/@divs.sheth/setup-aws-glue-locally-using-pycharm-ce-visual-studio-code-d948e5cf1b59)

#### Frequent Errors:

`tools.jar` error
solution: [YouTube](https://www.youtube.com/watch?v=W8gsavSbOcw&ab_channel=JustAnotherDangHowToChannel)

## Run Locally Using

```
$ gluesparksubmit src/scripts/main.py
```

---

## Run Tests

**To run all test suites run:**

```bash
$ make test
```

**To geneate html coverage report run:**

```bash
$ python3 -m coverage html
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wednesday-solutions/data-engineering-onboarding-starter

Awesome Lists containing this project

README

Data Engineering Onboarding Starter

Expert teams of digital product strategists, developers, and designers.