An open API service indexing awesome lists of open source software.

https://github.com/danherman212/streaming-systems

Production Grade Data Pipelines
https://github.com/danherman212/streaming-systems

bigquery cloudrun cloudtask dataflow gcp pipelines pubsub realtimedatabase streamingsystems

Last synced: 8 months ago
JSON representation

Production Grade Data Pipelines

Awesome Lists containing this project

README

          

# Streaming Data Pipelines


Production grade data pipelines capable of processing event streams in real time or near real time.


This implementation is connected to the [MTA real-time data feed](https://api.mta.info/#/subwayRealTimeFeeds) for the New York City subway system. There are 8 separate feeds available, where this project is connected to a single feed, which is the ACE subway line. The entire system can be observed on the [subway diagram](https://www.mta.info/map/5256).

The subway system is the largest in the world, with approximately 3.5 million daily riders, accessing subways through 470 stations, operating 24/7. The ACE line will travel through approximately 100 of these stations and produce 1,200 unique daily trips on a weekday or 750 unique daily trips on weekends.

The MTA claims the feed is updated with each subway vehicle timestamp every 30 seconds. However, we found the updates are produced from 5 - 25 seconds.
We are polling the subway feed every 20 seconds, processing 3 messages per minute. We are getting roughly 50 - 60 updates per message, so about 150 - 180 updates per minute. The feed produces nearly 1.25gb of data every 24 hrs with roughly 250,000 updates per 24 hrs on a weekday and about 130,000 updates on a weekend.

# Video Tutorial
I will launch a video tutorial sometime soon to walk through the project.

![Architecture Diagram](/6-images/architecture2.png)



The architecture uses the following GCP services:

- Artifact Registry: Universal Package Manager

There are two applications built with Flask that will be containerized to poll the MTA endpoint, executed through Cloud Run. The first application is the event processor, which fetches messages and publishes to pubsub. The next application is a task queue, which will setup 3 tasks to poll the event processor. The first task will get fired off immediately, the second task will be scheduled on a 20 second delay and the third on a 40 second delay. The task queue will be triggered every minute by Cloud Scheduler.

- Cloud Run: Serverless Application Execution

The event processor and task queue will be deployed for serverless execution on Cloud Run

- Cloud Tasks: Queue Management

This is a workaround, since I was having trouble finding ways to poll the MTA endpoint continuously. Cloud Run does not allow you to run a continuous loop in a container. The container will time out after 12 minutes. Cloud tasks allows us to distribute triggers asyncronously for granular controls of long running tasks. The task queue sends a `POST` message to the event processor every 20 seconds to fetch messages.

- Cloud Scheduler: Cron Jobs (Event Triggers)

Creates event triggers on a schedule. There is a constraint where the lowest time interval available is 1 minute. You cannot schedule sub 1 minute triggers. Therefore, we setup the workaround with Cloud Tasks, where we receive a trigger every minute and distribute 3 tasks in 20 second intervals, providing more granular control.

- Pub/Sub: Message Broker

Enterprise messaging bus provided by Google. The infrastructure processes 100 million messages per second. Messages are fetched from the MTA and published to a pubsub topic. There is a pubsub pull subscription setup with the consumer, which is Dataflow, the data processing engine.

- Dataflow: Data Processing Engine

Dataflow consumes messages, applies transformations and writes to a relational database. For this implementation there are 4 primary transforms, where we will first 1) Flatten 2) Filter 3) Enrich and 4) Apply windowing to our datset before we write to BigQuery. The messages come in json and need to be flattened. Most information in the feed is not required, so we will filter out 97% of the information. After filtering, we will enrich with station information that is provided through a static csv file. After enrichment, we will apply windowing to handle late arriving data and ensure data consistency.

- BigQuery: Data Warehouse

Once the data is processed it will be written to BigQuery, available for analysis between 7 - 35 seconds after the data is generated. For faster read/write, where lower latency is a requirement, BigTable can be plugged in as an alternative data sink.


# Implementation Steps
Most implementation is automated through Terraformdrawing

# Quick Start
Deploy the entire pipeline with a single command!

## Prerequisites
- Google Cloud Project with billing enabled
- Owner or Editor permissions
- Cloud Shell or `gcloud` CLI installed

## One-Command Deployment

**Step 1:** Open [Google Cloud Shell](https://shell.cloud.google.com/)

**Step 2:** Clone the repository
```shell
git clone https://github.com/DanHerman212/streaming-systems.git
cd streaming-systems
```

**Step 3:** Run the deployment script
```shell
chmod +x deploy.sh
./deploy.sh YOUR_PROJECT_ID us-east1
```

## **That's it!**

### Here is what happens next:

The deployment should take 5 minutes or so to deploy everything.



The last part of the deployment is the dataflow pipeline. Once dataflow is deployed, you will see 4 warning messages. That means the deployment is successful and the datastream is running. You can leave the terminal and go to the dataflow dashboard in the GCP console for better visibility.

# Dataflow Dashboard
It will take 3 minutes for Dataflow to get up and running. You can check the data watermark lag on the first step of the pipeline. That's the primary performance metric you should be concerned about.

![Dataflow Dashboard](/6-images/dataflow.png)



You can click the three small dots and expand the dashboard for better visibility.

![Dataflow Dashboard Expanded](/6-images/1206.png)

# Data Dictionary
Data definition can be found at [data dictionary page](data.md)

# SQL Anlaysis and Data Visualization
As a frequent passenger of the ACE subway line, I answered a few common questions I was curious about:

- What is the average time between train arrivals on the ACE line during a weekday?

- What are the top 10 busiest stations on the ACE line?

- What is the average idle time per station on the ACE line?


Queries can be found in the [sql folder](/5-sql) folder.

Make sure to update your project-id in the queries before executing.

## Avg Time Between Trains and Frequency
The range of time waiting for a train can be less than 2 minutes to over 16 minutes. There is a clear correlation between busy stations and wait times. Busier stations are served with <2 minute wait times.
The top 5 stations can be observed in the lower right quadrant. The next insight will identify those stations.

![Avg Time Between Train Arrivals](/6-images/avg-time-bet-trains.png)

# Top 10 Busiest Stations
These are the busiest stations for the ACE line, based on total number of train arrivals in a 24 hour weekday period.
![Top 10 Busiest Stations](/6-images/barplot.png)
Fun Fact: [42-St Port Authority Terminal](https://www.mta.info/agency/new-york-city-transit/subway-bus-ridership-2024) (Times Square) has the most riders with 58 million paid passengers in 2024.

# Idle Time Per Station

The map represents the geographic footprint of all stations and average idle time per station. Most stations show less than 30 second idle time, for the ACE line.

![Idle Time Per Station](/6-images/idle.png)
---
# Folder Structure
```
├── 1-dataflow # dataflow pipeline script and utilities
│   ├── dataflow.py
│   └── replace_project_id.sh
├── 2-event-processor # event processor application that fetches messages
│   ├── Dockerfile
│   ├── app.py
│   └── requirements.txt
├── 3-task-queue # task queue polls the event processor every 20 seconds
│   ├── Dockerfile
│   ├── main.py
│   └── requirements.txt
├── 4-terraform # terraform infrastructure as code - automates deployment in 1 minute
│   ├── main.tf
│   ├── modules
│   │   ├── apis
│   │   ├── cloud_run
│   │   ├── cloud_tasks
│   │   ├── pubsub
│   │   ├── scheduler
│   │   ├── service_accounts
│   │   └── storage
│   ├── outputs.tf
│   ├── sample.tfvars
│   ├── schema.json
│   └── variables.tf
├── 5-sql # a few sample SQL queries to experiment with
│   ├── avg idle time by station.sql
│   └── avg wait time and total trips per station.sql
├── 6-images # just images for presentation
│   ├── 0.5 Architecture.png
│   ├── architecture2.png
│   ├── avg_time_bet_trains1.png
│   ├── barplot.png
│   ├── dataflow.png
│   ├── idle.png
│   ├── image.png
│   ├── scheduler.png
│   ├── shell.png
│   └── tf.png
├── build_images.sh # script to automate container builds
├── data.md # data dictionary
├── deploy.sh # one-command deployment script
└── readme.md # this file
```