Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/moritzkoerber/covid-19-data-engineering-pipeline

A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
https://github.com/moritzkoerber/covid-19-data-engineering-pipeline

apache-airflow apache-spark api aws aws-cdk aws-cloudformation aws-ecr aws-glue aws-lambda aws-redshift aws-s3 docker great-expectations pyspark spark

Last synced: 2 months ago
JSON representation

A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.

Awesome Lists containing this project

README

        

This repo is my playground to try out various data engineering stuff. The used services/tools/design is not always the best choice or sometimes unnecessary cumbersome – this just reflects me trying to explore different things. At the moment, the pipeline processes Covid-19 data as follows:
![aws](https://user-images.githubusercontent.com/25953031/222958382-52ccbfe7-b8aa-4fe5-87f2-9767a1fa031f.png)
All infrastructure is templated in AWS CloudFormation or AWS CDK. All steps feature an alarm on failure. The stack can be deployed via Github Actions. I use poetry to manage the dependencies/virtual environment.