{"id":26940395,"url":"https://github.com/yeopster/data_engineering_gcp","last_synced_at":"2025-04-02T15:18:21.963Z","repository":{"id":275281641,"uuid":"925619332","full_name":"yeopster/Data_Engineering_GCP","owner":"yeopster","description":"Data Engineering Using Google Could Platform and Mage","archived":false,"fork":false,"pushed_at":"2025-02-01T11:58:13.000Z","size":5172,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-01T12:19:13.364Z","etag":null,"topics":["data-engineering","data-pipeline","data-visualization","gcp","google-bigquery","google-cloud-platform","google-cloud-storage","google-virtualmachine","looker-studio","mage-ai-pipeline","sql"],"latest_commit_sha":null,"homepage":"https://lookerstudio.google.com/u/0/reporting/693d0b23-f69b-462b-866a-fc8313d33345/page/287iE","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yeopster.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-01T10:13:12.000Z","updated_at":"2025-02-01T12:12:32.000Z","dependencies_parsed_at":"2025-02-01T12:29:18.389Z","dependency_job_id":null,"html_url":"https://github.com/yeopster/Data_Engineering_GCP","commit_stats":null,"previous_names":["yeopster/data_engineering_gcp"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yeopster%2FData_Engineering_GCP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yeopster%2FData_Engineering_GCP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yeopster%2FData_Engineering_GCP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yeopster%2FData_Engineering_GCP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yeopster","download_url":"https://codeload.github.com/yeopster/Data_Engineering_GCP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246837647,"owners_count":20841905,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-pipeline","data-visualization","gcp","google-bigquery","google-cloud-platform","google-cloud-storage","google-virtualmachine","looker-studio","mage-ai-pipeline","sql"],"created_at":"2025-04-02T15:18:21.497Z","updated_at":"2025-04-02T15:18:21.944Z","avatar_url":"https://github.com/yeopster.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Engineering Project Portfolio using Google Cloud Platform GCP\n\n## Introduction\nThe goal of this project is being able to perform data engineering using GCP storage, Virtual Machine VM Instance, BigQuery and Mage AI Data Pipeline Tool and perform data analytics visualization on Looker Studio using data stored in BigQuery.\n\n## Data Architecture\n![Data Architecture](architecture.jpg)\nData architecture basic from this project raw data are stored in Google Cloud Storage. Then, running Mage AI on VM instance the data is load, transform and then export to the BigQuery database. From BigQuery data is export to Looker Studio for analysis.\n\n## Technolgy Used\nAmong the technology used for this project are as follows:\n- Programming Languages:\n1. Python\n2. SQL\n- Google Cloud Platform:\n1. Google Storage\n2. VM Instance\n3. BigQuery\n4. Looker Studio\n- Data Pipeline Tool:\n1. Mage AI\n\n## Dataset Used\nTLC Trip Record Data Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.\n\nLink for dataset : https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf\n\n## Data Model\n![Data Model](Taxi_Data_Model.png)\n\nData Model is create using Lucid.app Link: https://lucid.app/documents/#/home?folder_id=recent\n\n## Step by Step Guides\n1. Firstly register Google Cloud Platform account.\n2. Create a new GCP Project.\n3. Upload your dataset into Google Cloud Storage under bucket section.\n![GCP Bucket](gcp-bucket.png)\n4. Create a VM Instance. Then click SSH to run the VM.\n![GCP VM](gcp-vm.png)\n5. Once VM running, please run VM commands to install libraries inside the VM.\n6. Then start Mage AI to create a data pipeline. Data Pipeline file are inside Mage Data Pipeline. Example of Data Pipeline are as below.\n![Data Pipeline](data_pipeline_mageai.png)\n7. Transform data are export to BigQuery\n![GCP BigQuery](gcp-bigquery.png)\n8. SQL command is done to combine data into one table call Data Analytics.\n9. Data is export to Looker Studio for analysis.\n![GCP Looker Studio](gcp-looker.png)\n\nLink for Dashboard:https://lookerstudio.google.com/u/0/reporting/693d0b23-f69b-462b-866a-fc8313d33345/page/287iE\n\n## Challenges\nAmong the challenges I had during this project are:\n- My data pipeline keep crashing due to high usage of VM CPU. What I did was I reduced the dataset size. You can see the code inside https://github.com/yeopster/Data_Engineering_GCP/blob/main/Notebook%20File/Reduced_the_taxi_dataset.ipynb\n- Second problem that during analysis I want to analyse the are with high number of pickup but the problem is the dataset only have latitude and longitude. So what I did is to convert the latitude and longitude into Area Name using code inside https://github.com/yeopster/Data_Engineering_GCP/blob/main/BigQuery%20SQL/Adding%20Pickup%20Zone.sql\n\n## Conclusion\nIn conclusion, from this project I have gain knowledge to use some tools from Google Cloud Storage and also do data pipeline using Mage AI. I will able to use this knowledge to implement in real life situation and also industrial need.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyeopster%2Fdata_engineering_gcp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyeopster%2Fdata_engineering_gcp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyeopster%2Fdata_engineering_gcp/lists"}