Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alanchn31/data-engineering-projects
Personal Data Engineering Projects
https://github.com/alanchn31/data-engineering-projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: about 3 hours ago
JSON representation
Personal Data Engineering Projects
- Host: GitHub
- URL: https://github.com/alanchn31/data-engineering-projects
- Owner: alanchn31
- Created: 2020-04-20T10:47:33.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2023-02-08T00:44:31.000Z (almost 2 years ago)
- Last Synced: 2024-10-19T23:27:25.629Z (24 days ago)
- Topics: airflow, aws-redshift, cassandra, data-engineering, data-engineering-nanodegree, data-lake, data-modeling, data-warehouse, ingest-data, mongodb, postgres, scrapy, spark, star-schema
- Language: Jupyter Notebook
- Homepage:
- Size: 2.92 MB
- Stars: 843
- Watchers: 9
- Forks: 184
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Description
---
* This repo contains projects done which applies principles in data engineering.
* Notes taken during the course can be found in folder `0. Back to Basics`## Projects
---
1. Postgres ETL :heavy_check_mark:
* This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have2. Cassandra ETL :heavy_check_mark:
* Looking at the realm of big data, Cassandra helps to ingest large amounts of data in a NoSQL context. This project adopts a query centric approach in ingesting data into data tables in Cassandra, to answer business questions about a music app3. Web Scrapying using Scrapy, MongoDB ETL :heavy_check_mark:
* In storing semi-structured data, one form to store it in, is in the form of documents. MongoDB makes this possible, with a specific collection containing related documents. Each document contains fields of data which can be queried.
* In this project, data is scraped from a books listing website using Scrapy. The fields of each book, such as price of a book, ratings, whether it is available is stored in a document in the books collection in MongoDB.4. Data Warehousing with AWS Redshift :heavy_check_mark:
* This project creates a data warehouse, in AWS Redshift. A data warehouse provides a reliable and consistent foundation for users to query and answer some business questions based on requirements.5. Data Lake with Spark & AWS S3 :heavy_check_mark:
* This project creates a data lake, in AWS S3 using Spark.
* Why create a data lake? A data lake provides a reliable store for large amounts of data, from unstructured to semi-structured and even structured data. In this project, we ingest json files, denormalize them into fact and dimension tables and upload them into a AWS S3 data lake, in the form of parquet files.6. Data Pipelining with Airflow :heavy_check_mark:
* This project schedules data pipelines, to perform ETL from json files in S3 to Redshift using Airflow.
* Why use Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative7. Capstone Project :heavy_check_mark:
* This project is the finale to Udacity's data engineering nanodegree. Udacity provides a default dataset however I chose to embark on my own project.
* My project is on building a movies data warehouse, which can be used to build a movies recommendation system, as well as predicting box-office earnings. View the project here: [Movies Data Warehouse](https://github.com/alanchn31/Udacity-Data-Engineering-Capstone)