Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kevingastelum/mydataengineering
Data Engineering Projects: Using AWS, Azure, Google Cloud Platform, Snowflake to set up real-time data pipelines
https://github.com/kevingastelum/mydataengineering
Last synced: 1 day ago
JSON representation
Data Engineering Projects: Using AWS, Azure, Google Cloud Platform, Snowflake to set up real-time data pipelines
- Host: GitHub
- URL: https://github.com/kevingastelum/mydataengineering
- Owner: KevinGastelum
- Created: 2024-01-04T23:07:43.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-04-15T00:00:34.000Z (7 months ago)
- Last Synced: 2024-04-18T03:00:06.262Z (7 months ago)
- Language: Python
- Homepage:
- Size: 1.73 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# [Azure End to End Data Engineering Project](https://github.com/KevinGastelum/MyDataEngineering/tree/main/02._Azure_DataEngineeringProjects)
Created and deployed this Azure workflow to perform data extraction, cleaning, and visualization of data. Utilizing docker to containerize the entire pipeline (server, database, python code) and deploy locally or on the cloud.
## Technologies Used
- **Data Extraction**: Wikipedia Website
- **Workflow Automation**: Apache Airflow
- **Database Management**: PostgreSQL
- **Cloud Storage**: Azure Blob
- **Data Transformation**: Azure Data Factory
- **Query Service**: Azure Synapse
- **Data Warehousing**: Azure Databricks
- **Data Visualization**: Power BI# [AWS End to End Data Engineering Project](https://github.com/KevinGastelum/MyDataEngineering/tree/main/01._AWS_DataEngineeringProject#aws-end-to-end-data-engineering-project)
## Reddit real-time Data Extraction --> Data Warehousing --> Data Modeling --> Data Pipeline
The purpose of this pipeline is to automize fetching/scraping data from Reddit post, we will be using the Reddit API, Apache Airflow to trigger tasks that run once a day, Docker to run everything in a containerized local environment, and SQL Postgres database to store the fetched data. After setting everything up locally we want this pipeline running on Cloud infrastructure which provides additional security, storage and processing capacity. I'll set up the pipeline using AWS to fully automate fetching, cleaning, and storing live data using AWS S3, AWS Lambda,AWS Glue, AWS Athena, and AWS Redshift.
## Technologies Used
- **Data Extraction**: Reddit API
- **Workflow Automation**: Apache Airflow, Celery
- **Database Management**: PostgreSQL
- **Cloud Storage**: Amazon S3
- **Data Transformation**: AWS Glue, Lambda
- **Query Service**: Amazon Athena
- **Data Warehousing**: Amazon Redshift
- **Data Visualization**: Quicksight## Objective
Showcases my ability to integrate various technologies to create a robust and scalable data pipeline. Demonstrate my expertise in handling big data and my capabilities to deliver efficient and reliable data solutions.
## [Walkthrough HERE](https://github.com/KevinGastelum/MyDataEngineering/tree/main/01._AWS_DataEngineeringProject#aws-end-to-end-data-engineering-project)