{"id":13625290,"url":"https://github.com/alanchn31/Data-Engineering-Projects","last_synced_at":"2025-04-16T06:32:19.150Z","repository":{"id":37060206,"uuid":"257250498","full_name":"alanchn31/Data-Engineering-Projects","owner":"alanchn31","description":"Personal Data Engineering Projects","archived":false,"fork":false,"pushed_at":"2023-02-08T00:44:31.000Z","size":3067,"stargazers_count":846,"open_issues_count":5,"forks_count":185,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-10-30T04:50:08.552Z","etag":null,"topics":["airflow","aws-redshift","cassandra","data-engineering","data-engineering-nanodegree","data-lake","data-modeling","data-warehouse","ingest-data","mongodb","postgres","scrapy","spark","star-schema"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alanchn31.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-04-20T10:47:33.000Z","updated_at":"2024-10-29T04:04:03.000Z","dependencies_parsed_at":"2024-01-14T08:32:54.239Z","dependency_job_id":"f69740cb-6276-4826-acd6-db3fb344876f","html_url":"https://github.com/alanchn31/Data-Engineering-Projects","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alanchn31%2FData-Engineering-Projects","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alanchn31%2FData-Engineering-Projects/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alanchn31%2FData-Engineering-Projects/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alanchn31%2FData-Engineering-Projects/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alanchn31","download_url":"https://codeload.github.com/alanchn31/Data-Engineering-Projects/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223700336,"owners_count":17188301,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","aws-redshift","cassandra","data-engineering","data-engineering-nanodegree","data-lake","data-modeling","data-warehouse","ingest-data","mongodb","postgres","scrapy","spark","star-schema"],"created_at":"2024-08-01T21:01:53.545Z","updated_at":"2024-11-08T14:30:54.357Z","avatar_url":"https://github.com/alanchn31.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Data Engineering ##","Uncategorized"],"sub_categories":["Uncategorized"],"readme":"## Description\n---\n* This repo contains projects done which applies principles in data engineering. \n* Notes taken during the course can be found in folder `0. Back to Basics`\n\n## Projects\n---\n1. \u003cins\u003e Postgres ETL \u003c/ins\u003e :heavy_check_mark:\n* This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have\n\n2. \u003cins\u003e Cassandra ETL \u003c/ins\u003e :heavy_check_mark:\n* Looking at the realm of big data, Cassandra helps to ingest large amounts of data in a NoSQL context. This project adopts a query centric approach in ingesting data into data tables in Cassandra, to answer business questions about a music app\n\n3. \u003cins\u003e Web Scrapying using Scrapy, MongoDB ETL \u003c/ins\u003e :heavy_check_mark:\n* In storing semi-structured data, one form to store it in, is in the form of documents. MongoDB makes this possible, with a specific collection containing related documents. Each document contains fields of data which can be queried. \n* In this project, data is scraped from a books listing website using Scrapy. The fields of each book, such as price of a book, ratings, whether it is available is stored in a document in the books collection in MongoDB.\n\n4. \u003cins\u003e Data Warehousing with AWS Redshift \u003c/ins\u003e :heavy_check_mark:\n* This project creates a data warehouse, in AWS Redshift. A data warehouse provides a reliable and consistent foundation for users to query and answer some business questions based on requirements.\n\n5. \u003cins\u003e Data Lake with Spark \u0026 AWS S3 \u003c/ins\u003e :heavy_check_mark:\n* This project creates a data lake, in AWS S3 using Spark. \n* Why create a data lake? A data lake provides a reliable store for large amounts of data, from unstructured to semi-structured and even structured data. In this project, we ingest json files, denormalize them into fact and dimension tables and upload them into a AWS S3 data lake, in the form of parquet files.\n\n6. \u003cins\u003e Data Pipelining with Airflow \u003c/ins\u003e :heavy_check_mark:\n* This project schedules data pipelines, to perform ETL from json files in S3 to Redshift using Airflow. \n* Why use Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative\n\n7. \u003cins\u003e Capstone Project \u003c/ins\u003e :heavy_check_mark:  \n* This project is the finale to Udacity's data engineering nanodegree. Udacity provides a default dataset however I chose to embark on my own project. \n* My project is on building a movies data warehouse, which can be used to build a movies recommendation system, as well as predicting box-office earnings. View the project here: [Movies Data Warehouse](https://github.com/alanchn31/Udacity-Data-Engineering-Capstone)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falanchn31%2FData-Engineering-Projects","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falanchn31%2FData-Engineering-Projects","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falanchn31%2FData-Engineering-Projects/lists"}