{"id":23238424,"url":"https://github.com/zkan/dtc-data-engineering-zoomcamp-project","last_synced_at":"2025-08-19T23:32:52.858Z","repository":{"id":44398507,"uuid":"510539667","full_name":"zkan/dtc-data-engineering-zoomcamp-project","owner":"zkan","description":"DataTalks.Club's Data Engineering Zoomcamp Project","archived":false,"fork":false,"pushed_at":"2022-07-14T02:02:42.000Z","size":1703,"stargazers_count":16,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-04-14T13:48:47.869Z","etag":null,"topics":["apache-airflow","bigquery","data-engineering","data-studio","dbt","docker","kafka","minio","python","stomp","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zkan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-05T00:36:35.000Z","updated_at":"2024-02-14T04:39:38.000Z","dependencies_parsed_at":"2022-08-23T15:51:22.171Z","dependency_job_id":null,"html_url":"https://github.com/zkan/dtc-data-engineering-zoomcamp-project","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zkan%2Fdtc-data-engineering-zoomcamp-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zkan%2Fdtc-data-engineering-zoomcamp-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zkan%2Fdtc-data-engineering-zoomcamp-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zkan%2Fdtc-data-engineering-zoomcamp-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zkan","download_url":"https://codeload.github.com/zkan/dtc-data-engineering-zoomcamp-project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230374248,"owners_count":18216044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","bigquery","data-engineering","data-studio","dbt","docker","kafka","minio","python","stomp","terraform"],"created_at":"2024-12-19T04:17:48.076Z","updated_at":"2024-12-19T04:17:48.542Z","avatar_url":"https://github.com/zkan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataTalks.Club's Data Engineering Zoomcamp Project\n\n**Table of Contents**\n\n* [Project Overview](#project-overview)\n* [Dataset](#dataset)\n* [Technologies](#technologies)\n* [Files and What They Do](#files-and-what-they-do)\n* [Instruction on Running the Project](#instruction-on-running-the-project)\n* [References](#references)\n\n## Project Overview\n\n![Project Overview](./img/dtc-data-engineering-zoomcamp-project-overview.png)\n\nThis project builds an automated end-to-end data pipeline that aims to get the\nlivestream of train movement data and analyze the train operating company's\nperformance.  The source of streaming data comes from the [UK's Network Rail\ncompany](https://datafeeds.networkrail.co.uk) provided through an ActiveMQ\ninterface. A train movement message is sent whenever a train arrives, passes or\ndeparts a location. It also records the time at which the event happens.\n\nIn this project, we first extract the live stream of train movement messages\nfrom the Network Rail's ActiveMQ endpoint and stream the messages into Kafka.\nWe then consume and put them into a data lake (MinIO). After that we schedule a\ndata pipeline (Airflow) to run daily to load the data to a data warehouse\n(Google BigQuery). Later on, we transform the data in the warehouse using dbt.\nFinally, once the data is cleaned and transformed, we can monitor and analyze\nthe data on a dashboard (Google Data Studio).\n\n## Dataset\n\nIn this project, our dataset is the public data feed provided by [Network\nRail](https://datafeeds.networkrail.co.uk).\n\n![Network Rail feed](./img/networkrail-feed.png)\n\n## Technologies\n\n* [Apache Airflow](https://airflow.apache.org/) for orchestrating workflow\n* [Apache Kafka](https://kafka.apache.org/) for stream processing\n* [MinIO](https://min.io/) for data lake storage\n* [dbt](https://www.getdbt.com/) for data transformation\n* [Google BigQuery](https://cloud.google.com/bigquery) for data warehousing and analysis\n* [Google Data Studio](https://datastudio.google.com/overview) for dashboard\n* [Terraform](https://www.terraform.io/) for provisioning BigQuery dataset\n* [Docker](https://www.docker.com/) for running services on local machine\n\n## Files and What They Do\n\n| Name | Description |\n| - | - |\n| `mnt/dags/load_networkrail_movement_data_to_bigquery.py` | An Airflow DAG file that runs the ETL data pipeline on Network Rail train movement data and load them to BigQuery |\n| `networkrail/` | A dbt project used to clean and transform the train movement data |\n| `playground/` | A folder that contains code for testing ideas |\n| `terraform/` | A Terraform project used to provision the Google BigQuery dataset |\n| `.env.example` | A environment example file that contains the environment variables we use in this project |\n| `docker-compose.yaml` | A Docker Compose file that runs the Confluent platform (Kafka and friends), an Airflow instance, and MinIO |\n| `Makefile` | A Makefile file for running commands |\n| `get_networkrail_movements.py` | A Python script that get the live stream data through the Network Rail's ActiveMQ interface |\n| `consume_networkrail_movements_to_data_lake.py` | A Python script that consumes the messages from Kafka and puts them into a data lake storage |\n| `README.md` | README file that provides the discussion on this project |\n| `requirements.txt` | A file that contains Python package dependencies used in this project |\n| `secrets.json.example` | A secret file that contains the Network Rail's username and password |\n\n## Instruction on Running the Project\n\nHere is the list of services we use in this project:\n\n* Confluent Control Center: http://localhost:9021\n* Airflow UI: http://localhost:8080\n* MinIO Console: http://localhost:9001\n\nWe can start all services by running the commands below:\n\n```sh\nmake setup\nmake up\n```\n\nTo shutdown all services, run:\n\n```sh\nmake down\n```\n\n### Getting Started\n\nBefore we can get the Network Rail data feed, we'll need to register a new\naccount on [the Network Rail website](https://datafeeds.networkrail.co.uk/)\nfirst.\n\nAfter we have the account, let's set up a virtual environment and install\npackage dependencies:\n\n```sh\npython -m venv ENV\nsource ENV/bin/activate\npip install -r requirements.txt\n```\n\n**Note:** We need to install the Apache Kafka C/C++ Library named\n[librdkafka](https://github.com/edenhill/librdkafka) first.\n\nOnce we've installed the package dependencies, we can run the following command\nto get the Network Rail livestream data and produce messages to the Kafka:\n\n```sh\npython get_networkrail_movements.py\n```\n\nBefore we can consume the messages from Kafka, we need to set up a service\naccount on MinIO first, so we can put the data into a data lake. Please see the\n[Steps to Set Up a Service Account on\nMinIO](#steps-to-set-up-a-service-account-on-minio) section.\n\nAfter we have the service account, we'll save the AWS access key ID and AWS\nsecret access key from MinIO to the file `.env`. Here we have an example env\nfile, so we can use it as a template.\n\n```sh\ncp env.example .env\n```\n\nTo consume the messages from Kafka, run the following commands:\n\n```sh\nexport $(cat .env)\npython consume_networkrail_movements_to_data_lake.py\n```\n\nAll the messages should be in the data lake (MinIO) by now.\n\nWe can go to the Airflow UI and manually trigger the data pipeline to load the\ndata to the data warehouse (Google BigQuery) then wait for the data to show up\non the dashboard (Google Data Studio). See the live dashboard\n[here](https://datastudio.google.com/reporting/5d38cb3d-248e-4aed-b65f-0db54fab4b9d).\n\n## References\n\n### Kafka Topic on Confluent Control Center\n\nThe screenshot below shows the Kafka topic on Confluent control center.\n\n![Kafka Topic on Confluent Control Center](./img/kafka-topic-on-confluent-control-center.png)\n\n### Data Pipeline on Airflow\n\nThe screenshots below show the data pipeline on Airflow.\n\n![Data Pipeline on Airflow 1](./img/airflow-data-pipeline-01.png)\n\n![Data Pipeline on Airflow 2](./img/airflow-data-pipeline-02.png)\n\n### Airflow S3 Connection to MinIO\n\n- Connection Name: `minio` or any name you like\n- Connection Type: S3\n- Login: `\u003creplace_here_with_your_minio_access_key\u003e`\n- Password: `\u003creplace_here_with_your_minio_secret_key\u003e`\n- Extra: a JSON object with the following properties:\n  ```json\n  {\n    \"host\": \"http://minio:9000\"\n  }\n  ```\n\n**Note:** If we were using AWS S3, we don't need to specify the host in the extra.\n\n### Data Models on Google BigQuery\n\nThe screenshot below shows the data models on Google BigQuery.\n\n![Data Models on Google BigQuery](./img/data-models-on-bigquery.png)\n\n### Network Rail TOC's Performance Dashboard\n\nThe screenshot below shows the dashboard to monitor the Network Rail train\noperating company (TOC)'s performance. View the live dashboard here: [Network\nRail Train Operating Company's\nPerformance](https://datastudio.google.com/reporting/5d38cb3d-248e-4aed-b65f-0db54fab4b9d).\n\n![Network Rail TOC's Performance Dashboard](./img/toc-performance-dashboard.png)\n\n\n### Steps to Set Up a Service Account on MinIO\n\nThe screenshots below show how to set up a service account on MinIO. Airflow\nneeds it in order to get data from the MinIO storage.\n\n![Set up a Service Account on MinIO 1](./img/minio-set-up-service-account-01.png)\n\n![Set up a Service Account on MinIO 2](./img/minio-set-up-service-account-02.png)\n\n![Set up a Service Account on MinIO 3](./img/minio-set-up-service-account-03.png)\n\n![Set up a Service Account on MinIO 4](./img/minio-set-up-service-account-04.png)\n\n### Steps to Set Up a Service Account on Google Cloud Platform (GCP)\n\nThe screenshots belwo show how to set up a service account on GCP. This service\naccount is required for Airflow to load data to the BigQuery as well as dbt to\ntransform data in the BigQuery.\n\n![Set up a Service Account on GCP 1](./img/gcp-set-up-service-account-01.png)\n\n![Set up a Service Account on GCP 2](./img/gcp-set-up-service-account-02.png)\n\n![Set up a Service Account on GCP 3](./img/gcp-set-up-service-account-03.png)\n\n![Set up a Service Account on GCP 4](./img/gcp-set-up-service-account-04.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzkan%2Fdtc-data-engineering-zoomcamp-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzkan%2Fdtc-data-engineering-zoomcamp-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzkan%2Fdtc-data-engineering-zoomcamp-project/lists"}