Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/san089/Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
https://github.com/san089/Udacity-Data-Engineering-Projects

airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database

Last synced: 11 days ago
JSON representation

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Lists

README

        

# Data Engineering Projects

![](https://github.com/san089/Udacity-Data-Engineering-Projects/blob/master/image.jpeg)

## Project 1: Data Modeling with Postgres
In this project, we apply Data Modeling with Postgres and build an ETL pipeline using Python. A startup wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Currently, they are collecting data in json format and the analytics team is particularly interested in understanding what songs users are listening to.

Link: [Data_Modeling_with_Postgres](https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master/Data_Modeling_with_Postgres)

## Project 2: Data Modeling with Cassandra
In this project, we apply Data Modeling with Cassandra and build an ETL pipeline using Python. We will build a Data Model around our queries that we want to get answers for.
For our use case we want below answers:

- Get details of a song that was herad on the music app history during a particular session.
- Get songs played by a user during particular session on music app.
- Get all users from the music app history who listened to a particular song.

Link : [Data_Modeling_with_Apache_Cassandra](https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master/Data_Modeling_with_Apache_Cassandra)

## Project 3: Data Warehouse
In this project, we apply the Data Warehouse architectures we learnt and build a Data Warehouse on AWS cloud. We build an ETL pipeline to extract and transform data stored in json format in s3 buckets and move the data to Warehouse hosted on Amazon Redshift.

Use Redshift IaC script - [Redshift_IaC_README](https://github.com/san089/Udacity-Data-Engineering-Projects/blob/master/Redshift_IaC_README.md)

Link - [Data_Warehouse](https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master/Data_Warehouse)

## Project 4: Data Lake
In this project, we will build a Data Lake on AWS cloud using Spark and AWS EMR cluster. The data lake will serve as a Single Source of Truth for the Analytics Platform. We will write spark jobs to perform ELT operations that picks data from landing zone on S3 and transform and stores data on the S3 processed zone.

Link: [Data_Lake](https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master/Data_Lake)

## Project 5: Data Pipelines with Airflow
In this project, we will orchestrate our Data Pipeline workflow using an open-source Apache project called Apache Airflow. We will schedule our ETL jobs in Airflow, create project related custom plugins and operators and automate the pipeline execution.

Link: [Airflow_Data_Pipelines](https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master/Airflow_Data_Pipelines)

## Project 6: Api Data to Postgres
In this project, we build an etl pipeline to fetch data from yelp API and insert it into the Postgres Database. This project is a very basic example of fetching real time data from an open source API.

Link: [API to Postgres](https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master/Data_Api_to_Postgres)

## CAPSTONE PROJECT
Udacity provides their own crafted Capstone project with dataset that include data on immigration to the United States, and supplementary datasets that include data on airport codes, U.S. city demographics, and temperature data.

I worked on my own open-ended project.

Here is the link - [goodreads_etl_pipeline](https://github.com/san089/goodreads_etl_pipeline)