Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/moritzkoerber/covid-19-data-engineering-pipeline
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
https://github.com/moritzkoerber/covid-19-data-engineering-pipeline
apache-airflow apache-spark api aws aws-cdk aws-cloudformation aws-ecr aws-glue aws-lambda aws-redshift aws-s3 docker great-expectations pyspark spark
Last synced: 2 months ago
JSON representation
A Covid-19 data pipeline on AWS featuring PySpark/Glue, Docker, Great Expectations, Airflow, and Redshift, templated in CloudFormation and CDK, deployable via Github Actions.
- Host: GitHub
- URL: https://github.com/moritzkoerber/covid-19-data-engineering-pipeline
- Owner: moritzkoerber
- License: mit
- Created: 2021-11-14T16:01:07.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2023-11-21T21:39:57.000Z (about 1 year ago)
- Last Synced: 2024-04-19T02:06:52.428Z (9 months ago)
- Topics: apache-airflow, apache-spark, api, aws, aws-cdk, aws-cloudformation, aws-ecr, aws-glue, aws-lambda, aws-redshift, aws-s3, docker, great-expectations, pyspark, spark
- Language: Python
- Homepage:
- Size: 1.31 MB
- Stars: 22
- Watchers: 3
- Forks: 5
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
This repo is my playground to try out various data engineering stuff. The used services/tools/design is not always the best choice or sometimes unnecessary cumbersome – this just reflects me trying to explore different things. At the moment, the pipeline processes Covid-19 data as follows:
![aws](https://user-images.githubusercontent.com/25953031/222958382-52ccbfe7-b8aa-4fe5-87f2-9767a1fa031f.png)
All infrastructure is templated in AWS CloudFormation or AWS CDK. All steps feature an alarm on failure. The stack can be deployed via Github Actions. I use poetry to manage the dependencies/virtual environment.