{"id":23682469,"url":"https://github.com/iht/bigquery-dataflow-cdc-example","last_synced_at":"2026-01-04T01:30:16.473Z","repository":{"id":268389665,"uuid":"904176639","full_name":"iht/bigquery-dataflow-cdc-example","owner":"iht","description":"A Dataflow streaming pipeline written in Java, reading data from Pubsub and recovering the sessions from potentially unordered data, and upserting the session data into BigQuery with no duplicates","archived":false,"fork":false,"pushed_at":"2025-02-01T22:26:27.000Z","size":128,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-01T23:24:21.023Z","etag":null,"topics":["apache-beam","bigquery","cdc","dataflow","google-cloud","pubsub"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iht.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-16T11:50:12.000Z","updated_at":"2025-02-01T22:26:10.000Z","dependencies_parsed_at":"2024-12-16T13:37:32.139Z","dependency_job_id":"5e2b70ea-f0b1-4bd1-9784-d951b7038659","html_url":"https://github.com/iht/bigquery-dataflow-cdc-example","commit_stats":null,"previous_names":["iht/bigquery-dataflow-cdc-example"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fbigquery-dataflow-cdc-example","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fbigquery-dataflow-cdc-example/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fbigquery-dataflow-cdc-example/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fbigquery-dataflow-cdc-example/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iht","download_url":"https://codeload.github.com/iht/bigquery-dataflow-cdc-example/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239734484,"owners_count":19688256,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-beam","bigquery","cdc","dataflow","google-cloud","pubsub"],"created_at":"2024-12-29T19:50:38.049Z","updated_at":"2026-01-04T01:30:14.414Z","avatar_url":"https://github.com/iht.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Taxi pipeline: recover sessions, upsert in BigQuery\n\nThis repository contains a streaming Dataflow pipeline reading data from Pubsub\nand recovering the sessions from potentially unordered data, by using a common\nkey to all the points received for the same vehicle.\n\nThe pipeline can probably be easily adapted to any other Apache Beam runner,\nbut this repository assumes you are running in Google Cloud Dataflow.\n\n## Data input\n\nWe are using here a public PubSub topic with data, so we don't need to setup our\nown to run this pipeline.\n\nThe topic is `projects/pubsub-public-data/topics/taxirides-realtime`.\n\nThat topic contains messages from the NYC Taxi Ride dataset. Here is a sample of\nthe data contained in a message in that topic:\n\n```json\n{\n  \"ride_id\": \"328bec4b-0126-42d4-9381-cb1dbf0e2432\",\n  \"point_idx\": 305,\n  \"latitude\": 40.776270000000004,\n  \"longitude\": -73.99111,\n  \"timestamp\": \"2020-03-27T21:32:51.48098-04:00\",\n  \"meter_reading\": 9.403651,\n  \"meter_increment\": 0.030831642,\n  \"ride_status\": \"enroute\",\n  \"passenger_count\": 1\n}\n```\n\nBut the messages also contain metadata, that is useful for streaming pipelines.\nIn this case, the messages contain an attribute of name `ts`, which contains the\nsame timestamp as the field of name `timestamp` in the data. Remember that\nPubSub treats the data as just a string of bytes (in topics with no schema), so\nit does not _know_ anything about the data itself. The metadata fields are\nnormally used to publish messages with specific ids and/or timestamps.\n\n## Data output\n\nThe goal is grouping together all the messages that belong to the\nsame taxi ride, and recovering the initial and end timestamps, the initial\nand end status (ride_status) and calculating the duration of the trip in\nseconds. We have to insert these sessions in streaming in BigQuery, doing\nupserts. We want to deal with potential late data too, recalculating the\nsessions if necessary.\n\nThe pipeline uses three triggers:\n\n- An early trigger for every single message received before the watermark\n- A trigger when the watermark is reached.\n- A late trigger for every single late message before a certain threshold,\n  which is configurable.\n\nSo in Bigquery, you will see some sessions that are \"partial\" (the end\nstatus is not `dropoff` yet, but those sessions should be all eventually complete).\n\nIn addition to this, the output is written to different tables, depending on the\nfirst character of the session id. This is done just to show how to write to dynamic\ndestinations with BigQueryIO and using change-data-capture / upserts.\n\n## gcloud authentication\n\nYou need to have a Google Cloud project with editor or owner permissions,\nin order to be able to create the resources for this demo.\n\nYou need to have [the Google Cloud SDK installed](https://cloud.google.com/sdk/docs/install-sdk),\nor alternatively you can use the Cloud Shell in your project.\n\nThe code snippets below set some environment variables that will be useful to\nrun other commands. You can use these code snippets locally or in the Cloud Shell.\nMake sure that you set the right values for the variables before proceeding with\nthe rest of code snippets.\n\n```shell\nexport YOUR_EMAIL=\u003cWRITE YOUR EMAIL HERE\u003e\nexport YOUR_PROJECT=\u003cWRITE YOUR PROJECT ID HERE\u003e\nexport GCP_REGION=\u003cWRITE HERE YOUR GCP REGION\u003e  # e.g. europe-southwest1\n```\n\nThese are other values that you probably don't need to change:\n\n```shell\nexport SUBNETWORK_NAME=default\nexport SUBSCRIPTION_NAME=taxis\nexport DATASET_NAME=data_playground\nexport SERVICE_ACCOUNT_NAME=taxi-pipeline-sa\nexport SESSIONS_TABLE=sessions\nexport ERRORS_TABLE=errors\n```\n\nYou need to make sure that the subnetwork above (the default subnetwork in your\nchosen region) has [Private Google Access enabled](https://cloud.google.com/vpc/docs/configure-private-google-access#enabling-pga).\n\nRun the following to create a specific configuration for your Google Cloud project.\nYou can probably skip this if you are in the Cloud Shell.\n\n```shell\ngcloud config configurations create taxipipeline-streaming\ngcloud config set account $YOUR_EMAIL\ngcloud config set project $YOUR_PROJECT\n```\n\nMake sure that you are authenticated, by running\n\n```shell\ngcloud auth login\n```\n\nand\n\n```shell\ngcloud auth application-default login\n```\n\n## Required resources\n\n### Required services\n\nIn your project you need to enable the following APIs:\n\n```shell\ngcloud services enable dataflow\ngcloud services enable pubsub\ngcloud services enable bigquery\n```\n\n### Bucket\n\nYou will need a GCS bucket for staging files and for temp files. We create a bucket\nwith the same name as the project:\n\n```shell\ngcloud storage buckets create gs://$YOUR_PROJECT --location=$GCP_REGION\n```\n\n### Pub/Sub subscription\n\nTo inspect the messages from this topic, you can create a subscription, and then\npull some messages.\n\nTo create a subscription, use the gcloud cli utility (installed by default in\nthe Cloud Shell). Fill this for the subscription name (for instance, `taxis`):\n\n```shell\nexport TOPIC=projects/pubsub-public-data/topics/taxirides-realtime\ngcloud pubsub subscriptions create $SUBSCRIPTION_NAME --topic $TOPIC\n```\n\nTo pull messages:\n\n```shell\ngcloud pubsub subscriptions pull $SUBSCRIPTION_NAME --limit 3\n```\n\nor if you have `jq` installed (for pretty printing of JSON)\n\n```shell\ngcloud pubsub subscriptions pull $SUBSCRIPTION_NAME --limit 3 | grep \" {\" | cut -f 2 -d ' ' | jq\n```\n\n### BigQuery dataset\n\nCreate the dataset with name chosen above:\n\n```shell\nbq mk -d --data_location=$GCP_REGION $DATASET_NAME\n```\n\n### Service account\n\nLet's now create a Dataflow worker service accounst, with permissions to read from\nthe Pub/Sub subscription and to write to BigQuery:\n\n```shell\ngcloud iam service-accounts create $SERVICE_ACCOUNT_NAME\n```\n\nAnd now let's give all the required permissions:\n\n```shell\ngcloud projects add-iam-policy-binding $YOUR_PROJECT \\\n--member=\"serviceAccount:$SERVICE_ACCOUNT_NAME@$YOUR_PROJECT.iam.gserviceaccount.com\" \\\n--role=\"roles/dataflow.worker\"\n\ngcloud projects add-iam-policy-binding $YOUR_PROJECT \\\n--member=\"serviceAccount:$SERVICE_ACCOUNT_NAME@$YOUR_PROJECT.iam.gserviceaccount.com\" \\\n--role=\"roles/storage.admin\"\n\ngcloud projects add-iam-policy-binding $YOUR_PROJECT \\\n--member=\"serviceAccount:$SERVICE_ACCOUNT_NAME@$YOUR_PROJECT.iam.gserviceaccount.com\" \\\n--role=\"roles/pubsub.editor\"\n\ngcloud projects add-iam-policy-binding $YOUR_PROJECT \\\n--member=\"serviceAccount:$SERVICE_ACCOUNT_NAME@$YOUR_PROJECT.iam.gserviceaccount.com\" \\\n--role=\"roles/pubsub.subscriber\"\n\ngcloud projects add-iam-policy-binding $YOUR_PROJECT \\\n--member=\"serviceAccount:$SERVICE_ACCOUNT_NAME@$YOUR_PROJECT.iam.gserviceaccount.com\" \\\n--role=\"roles/bigquery.dataEditor\"\n```\n\n## Build a FAT jar\n\nTo create a FAT jar, ready to be deployed in Dataflow without additional\ndependencies, run the following command:\n\n```shell\n./gradlew build\n```\n\nBut see below, you don't need to create a JAR for deployment just for testing this repo.\n\n## Test\n\nExecute the following command to run all the unit tests\n\n```shell\n./gradlew test\n```\n\n## Run the pipeline in Dataflow\n\nMake sure that you have followed the steps above, and you are authenticated and\nhave created the input subscription and the output BigQuery datasets, prior to\nrunning the pipeline.\n\nMake also sure that you have Java \u003e=8 or \u003c=17 installed in your machine.\nCheck your current version with:\n\n```shell\njava -version\n```\n\nYou can run the pipeline recompiling from the sources, there is no need to generate a FAT jar:\n\n```shell\nTEMP_LOCATION=gs://$YOUR_PROJECT/tmp\nSUBSCRIPTION=projects/$YOUR_PROJECT/subscriptions/$SUBSCRIPTION_NAME\nNETWORK=regions/$GCP_REGION/subnetworks/$SUBNETWORK_NAME\nSERVICE_ACCOUNT=$SERVICE_ACCOUNT_NAME@$YOUR_PROJECT.iam.gserviceaccount.com\n\n./gradlew run -Pargs=\"\n--pipeline=taxi-sessions \\\n--runner=DataflowRunner \\\n--project=$YOUR_PROJECT \\\n--region=$GCP_REGION \\\n--tempLocation=$TEMP_LOCATION \\\n--usePublicIps=false \\\n--serviceAccount=$SERVICE_ACCOUNT \\\n--subnetwork=$NETWORK \\\n--enableStreamingEngine \\\n--rideEventsSubscription=$SUBSCRIPTION \\\n--destinationDataset=$DATASET_NAME \\\n--sessionsDestinationTable=$SESSIONS_TABLE \\\n--parsingErrorsDestinationTable=$ERRORS_TABLE\"\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fbigquery-dataflow-cdc-example","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiht%2Fbigquery-dataflow-cdc-example","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fbigquery-dataflow-cdc-example/lists"}