https://github.com/wednesday-solutions/aws-glue-jupyter-notebook-starter
A starter repository for your next AWS Glue project. This comes with complete IaC, a CD pipeline and a reusable common SDK. Set up jupyter notebook for AWS Glue locally
https://github.com/wednesday-solutions/aws-glue-jupyter-notebook-starter
aws aws-glue data-engineering de etl glue jupyter jupyter-notbook
Last synced: 6 months ago
JSON representation
A starter repository for your next AWS Glue project. This comes with complete IaC, a CD pipeline and a reusable common SDK. Set up jupyter notebook for AWS Glue locally
- Host: GitHub
- URL: https://github.com/wednesday-solutions/aws-glue-jupyter-notebook-starter
- Owner: wednesday-solutions
- Created: 2023-08-08T10:14:14.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-06T10:12:30.000Z (about 2 years ago)
- Last Synced: 2025-04-13T21:13:11.463Z (6 months ago)
- Topics: aws, aws-glue, data-engineering, de, etl, glue, jupyter, jupyter-notbook
- Language: Jupyter Notebook
- Homepage: https://www.wednesday.is/?utm_source=github&utm_medium=aws-glue-jupyter-notebook-starter
- Size: 43 KB
- Stars: 6
- Watchers: 4
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AWS Glue Jupyter Notebook Starter
## Philosophy
This project is a AWS glue starter project. It simulates the glue environment so that you can test your scripts out locally.
It also comes with out of the box IaC (Infrastructure as Code) so that you can create the entire AWS Glue stack with a single command.
This comes with
- A S3 Bucket
- 2 glue jobs
- A source crawler
- A target crawler
- Integration with Athena
- Required IAM Roles and Policies## Folder Structure
```
├── Dockerfile |-> Custom Dockerfile to run Glue locally with support for .env
├── Makefile |-> Makefile that contains all automation
├── README.md
│
├── config |-> Contains configurable properties
│ ├── properties.yml |-> make infra will auto-populates the bucket_name which is then read in the CD script
├── assets |-> This folder will contain all IaC related assets
│ ├── glue-template.yaml.j2 |-> Jinja2 templating that creates various assets needed for Glue and prefixes the stack name
│ └── output.yaml |-> Output cloudformation yaml after replacing variables
│
├── data |-> Contains all data assets required for local exection
│ └── raw |-> Contains all of the raw data files
│ └── sample.csv |-> Raw sample csv file
│
├── landing |-> Contains all data post transformation
│ ├── job1 |-> Contains landing data related to job1
│ │ └── output |-> Contains all of the parts associated to output
│ └── job2 |-> Contains landing data related to job2
│ └── output |-> Contains all of the parts associated to output
│
├── scripts |-> Contain all of the executables for this project
│ ├── convert-notebooks-to-scripts.sh |-> Recursively iterates job folders and convert notebooks to scripts
│ ├── create-glue-job.sh |-> Creates resources on AWS, copy appropriate files to the s3 buckets and trigger crawlers
│ ├── env-to-args.sh |-> Recursively iterates job folders and convert env files to job parameters
│ └── run.sh |-> Runs the notebook locally
│ └── tear-down-glue.sh |-> Brings down all created AWS resources
│ └── update-glue-job.sh |-> Updates the glue job along with other resources via the cd pipeline
│
└── src |-> Contains all glue job files and folders
└── jobs |-> Contains all glue jobs in individual folders
├── job1 |-> Contains all data for job1 including env, notebook and the script
│ ├── notebook.ipynb |-> Notebook for job1
│ └── script.py |-> Script generated from the notebook
└── ...
├── ...
└── ...```
## Getting startedThe Makefile contains the following commands
- infra: IaC for setting things up on aws. Try it out using
```
make infra name=aws-glue-jupyter-notebook-starter region=ap-south-1
```- local: fire up the docker container and get your glue environment running on http://localhost:8888/lab
```
make local
```- env-to-args: automation to convert environment variables into DefaultArguments for respective jobs
```
make env-to-args
```- notebooks-to-scripts: automation to convert your notebooks to scripts recursively across jobs
```
make notebooks-to-scripts
```- update-infra: automation to update infra and scripts in the cd pipeline
```
run: make update-infra name=aws-glue-jupyter-notebook-starter region=ap-south-1
```
- teardown-infra: automation to delete contents of s3 and tearn dow n created infr
```
make teardown-infra
```