{"id":25615195,"url":"https://github.com/wednesday-solutions/data-engineering-onboarding-starter","last_synced_at":"2025-04-13T21:13:10.933Z","repository":{"id":195983015,"uuid":"670637193","full_name":"wednesday-solutions/Data-Engineering-Onboarding-Starter","owner":"wednesday-solutions","description":"This repository contains a 10 step program to enter the world of Data Engineering","archived":false,"fork":false,"pushed_at":"2024-07-10T07:06:28.000Z","size":6729,"stargazers_count":14,"open_issues_count":2,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-27T11:38:34.135Z","etag":null,"topics":["aws","aws-glue","data","data-engg-learning","data-engineering","data-engineering-starter","data-template","dataengg","dataengg-template","etl","glue","spark","workflow"],"latest_commit_sha":null,"homepage":"https://wednesday.is/building-products/?utm_source=github\u0026utm_medium=dataengg_template","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wednesday-solutions.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-25T13:48:36.000Z","updated_at":"2024-12-03T21:13:02.000Z","dependencies_parsed_at":"2024-07-10T09:07:59.314Z","dependency_job_id":null,"html_url":"https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter","commit_stats":null,"previous_names":["wednesday-solutions/data-engineering-onboarding-starter"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wednesday-solutions%2FData-Engineering-Onboarding-Starter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wednesday-solutions%2FData-Engineering-Onboarding-Starter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wednesday-solutions%2FData-Engineering-Onboarding-Starter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wednesday-solutions%2FData-Engineering-Onboarding-Starter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wednesday-solutions","download_url":"https://codeload.github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248782260,"owners_count":21160717,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-glue","data","data-engg-learning","data-engineering","data-engineering-starter","data-template","dataengg","dataengg-template","etl","glue","spark","workflow"],"created_at":"2025-02-22T03:18:51.539Z","updated_at":"2025-04-13T21:13:10.908Z","avatar_url":"https://github.com/wednesday-solutions.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg align=\"left\" src=\"https://github-production-user-asset-6210df.s3.amazonaws.com/105773536/269245524-c4fefc57-ebfe-4f1b-87ba-e4e4fc2bc745.png\" width=\"480\" height=\"540\" /\u003e\n\n\u003cdiv\u003e\n  \u003ca href=\"https://www.wednesday.is?utm_source=gthb\u0026utm_medium=repo\u0026utm_campaign=data-engineering-onboarding\" align=\"left\" style=\"margin-left: 0;\"\u003e\n    \u003cimg src=\"https://uploads-ssl.webflow.com/5ee36ce1473112550f1e1739/5f5879492fafecdb3e5b0e75_wednesday_logo.svg\"\u003e\n  \u003c/a\u003e\n  \u003cp\u003e\n    \u003ch1 align=\"left\"\u003eData Engineering Onboarding Starter\n    \u003c/h1\u003e\n  \u003c/p\u003e\n\n  \u003cp\u003e\nAn immersive data engineering journey awaits you in this comprehensive starter kit, featuring a curated list of resources, tools, and best practices to help you get started with data engineering. This starter kit is designed to help you learn the basics of data engineering and get you up and running with your first data engineering project.\n  \u003c/p\u003e\n\n---\n\n  \u003cp\u003e\n    \u003ch4\u003e\n      Expert teams of digital product strategists, developers, and designers.\n    \u003c/h4\u003e\n  \u003c/p\u003e\n\n  \u003cdiv\u003e\n    \u003ca href=\"https://www.wednesday.is/contact-us?utm_source=gthb\u0026utm_medium=repo\u0026utm_campaign=data-engineering-onboarding\" target=\"_blank\"\u003e\n      \u003cimg src=\"https://uploads-ssl.webflow.com/5ee36ce1473112550f1e1739/5f6ae88b9005f9ed382fb2a5_button_get_in_touch.svg\" width=\"121\" height=\"34\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/wednesday-solutions/\" target=\"_blank\"\u003e\n      \u003cimg src=\"https://uploads-ssl.webflow.com/5ee36ce1473112550f1e1739/5f6ae88bb1958c3253756c39_button_follow_on_github.svg\" width=\"168\" height=\"34\"\u003e\n    \u003c/a\u003e\n  \u003c/div\u003e\n\n---\n\n[![Data Engineering - Deploy to AWS Glue](https://github.com/wednesday-solutions/data-engg/actions/workflows/cd.yml/badge.svg)](https://github.com/wednesday-solutions/data-engg/actions/workflows/cd.yml) [![Data Engineering CI](https://github.com/wednesday-solutions/data-engg/actions/workflows/ci.yml/badge.svg)](https://github.com/wednesday-solutions/data-engg/actions/workflows/ci.yml)\n\n---\n\n## Prerequisites\n\n1. [Python3 with PIP](https://www.python.org/downloads/)\n2. [Install Java 8](https://www.oracle.com/in/java/technologies/downloads/#java8-mac)\n3. [AWS CLI configured locally](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html)\n4. [Docker](https://docs.docker.com/desktop/install/mac-install/) (Optional)\n\n## Folder Structure\n\n```\n├── Makefile                                   | -\u003e Allows you to run commands for setup, test, lint, etc\n├── README.md                                  | -\u003e Documentation for the project setup and usage\n│\n├── automation                                 | -\u003e Contains scripts to automate deployment and testing\n│   └── deploy_glue_job.sh                     | -\u003e Script to deploy or update glue job\n│\n├── examples                                   | -\u003e Contains example scripts to demonstrate pyspark features\n│   ├── 01_pyspark_dataframe                   | -\u003e Create a DataFrame by reading data from a source (CSV, Parquet Database, etc)\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to read csv file and write to parquet\n│   ├── 02_applying_filters                    | -\u003e Apply filters on a dataframe\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to apply filters on dataframe\n│   ├── 03_transform_columns                   | -\u003e Transform columns \u0026 manipulate data in a dataframe\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to transform columns\n│   ├── 04_remap_columns                       | -\u003e Normalise columns in a dataframe\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to normalise columns in a dataframe\n│   ├── 05_complex_transformations             | -\u003e Perform complex transformations on a dataframe\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to perform some complex transformations\n│   ├── 06_write_dataframe                     | -\u003e Write a dataframe to a target\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to write dataframe to parquet or RDBMS Database\n│   ├── 07_pyspark_in_glue_jobs                | -\u003e Examples of using PySpark in AWS Glue Jobs\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to run pyspark script in glue job\n│   ├── 08_glue_dynamic_frame                  | -\u003e Create a DynamicFrame by reading data from a data catalog\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to create a dynamic frame from a data catalog\n│   ├── 09_apply_mappings                      | -\u003e Apply mappings on a dynamic frame (change column names, data types, etc)\n│   │   ├── README.md                             | -\u003e Contains instructions to run the example\n│   │   └── main.py                               | -\u003e Example script to apply mappings on dynamic frame\n│   └── 10_write_to_target                     | -\u003e Write a dynamic frame to a target (CSV, Parquet, Database, etc)\n│       ├── README.md                             | -\u003e Contains instructions to run the example\n│       └── main.py                               | -\u003e Example script to write dynamic frame to parquet and store in S3\n│\n└── src                                        | -\u003e Contains all the source code for the onboarding exercise\n    ├── data                                   | -\u003e Contains data files for the onboarding exercise\n    │   ├── customers.csv                         | -\u003e Customer Dataset CSV file\n    │   ├── survey_results_public.csv             | -\u003e Stackoverflow Survey CSV file\n    │   └── survey_results_public.parquet         | -\u003e Stackoverflow Survey Parquet file\n    │\n    └── scripts                                | -\u003e Contains all the glue scripts exercise\n        ├── a_stackoverflow_survey                | -\u003e A sample glue script to read, apply mappings, transform data\n        │   └── main.py\n        ├── b_fix_this_script                     | -\u003e A broken glue script for you to fix\n        │   ├── README.md\n        │   └── main.py\n        └── c_top_spotify_tracks                  | -\u003e A task for you to complete. Best of luck!\n            └── README.md\n\n```\n\n---\n\n## Setup\n\n**Step 1:** Clone this repository and install required packages\n\n```bash\n$ make install\n```\n\n**Step 2:** Clone AWS Glue Python Lib\n\nAWS Glue libraries are not available on via PIP. Hence, we need to install it manually.\n\n```bash\n# Clone the master branch for Glue 4.0\n$ git clone https://github.com/awslabs/aws-glue-libs.git\n\n$ export AWS_GLUE_HOME=$(pwd)/aws-glue-libs\n```\n\n**Step 3:** Install Apache Maven\n\n```bash\n$ curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz -o apache-maven-3.6.0-bin.tar.gz\n\n$ tar -xvf apache-maven-3.6.0-bin.tar.gz\n\n$ ln -s apache-maven-3.6.0-bin maven\n\n$ export MAVEN_HOME=$(pwd)/maven\n```\n\n**Step 3:** Install Apache Spark\n\n```bash\n$ curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-4.0/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz -o spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz\n\n$ tar -xvf spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz\n\n$ ln -s spark-3.3.0-amzn-1-bin-3.3.3-amzn-0 spark\n\n$ export SPARK_HOME=$(pwd)/spark\n```\n\n**Step 4:** Export Paths\n\n```bash\n$ export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin\n```\n\nverify installation by running\n\n`mvn --version`\n\n`pyspark --version`\n\n**Step 5:** Download Glue ETL .jar files\n\n```bash\n$ cd $AWS_GLUE_HOME\n\n$ mvn install dependency:copy-dependencies\n\n$ cp $AWS_GLUE_HOME/jarsv1/AWSGlue*.jar $SPARK_HOME/jars/\n\n$ cp $AWS_GLUE_HOME/jarsv1/aws*.jar $SPARK_HOME/jars/\n```\n\n**After this step you should be able to execute**\n**`gluepyspark`, `gluepytest`, `gluesparksubmit`**\n**from shell**\n\n#### References:\n\n- [Run Glue Jobs Locally | AWS Docs](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html)\n- [Setup AWS glue locally with PySpark](https://medium.com/@divs.sheth/setup-aws-glue-locally-using-pycharm-ce-visual-studio-code-d948e5cf1b59)\n\n#### Frequent Errors:\n\n`tools.jar` error\nsolution: [YouTube](https://www.youtube.com/watch?v=W8gsavSbOcw\u0026ab_channel=JustAnotherDangHowToChannel)\n\n## Run Locally Using\n\n```\n$ gluesparksubmit src/scripts/main.py\n```\n\n---\n\n## Run Tests\n\n**To run all test suites run:**\n\n```bash\n$ make test\n```\n\n**To geneate html coverage report run:**\n\n```bash\n$ python3 -m coverage html\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwednesday-solutions%2Fdata-engineering-onboarding-starter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwednesday-solutions%2Fdata-engineering-onboarding-starter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwednesday-solutions%2Fdata-engineering-onboarding-starter/lists"}