{"id":30209067,"url":"https://github.com/ksmin23/gcp-datastream-cdc-data-pipeline","last_synced_at":"2026-04-14T06:33:34.896Z","repository":{"id":309329075,"uuid":"1034006515","full_name":"ksmin23/gcp-datastream-cdc-data-pipeline","owner":"ksmin23","description":"A complete Terraform setup for creating a secure, private data replication pipeline from Cloud SQL (MySQL) to BigQuery using   Datastream and Private Service Connect (PSC).","archived":false,"fork":false,"pushed_at":"2025-09-25T03:18:38.000Z","size":187,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-25T05:30:33.364Z","etag":null,"topics":["bigquery","cloud-sql","data-pipeline","datastream","google-cloud-platform","mysql","terraform"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ksmin23.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-07T17:29:54.000Z","updated_at":"2025-09-25T03:18:42.000Z","dependencies_parsed_at":"2025-08-11T10:14:55.596Z","dependency_job_id":"5ab28afd-2e58-440c-92a9-aacc4e513862","html_url":"https://github.com/ksmin23/gcp-datastream-cdc-data-pipeline","commit_stats":null,"previous_names":["ksmin23/gcp-datastream-cdc-data-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ksmin23/gcp-datastream-cdc-data-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksmin23%2Fgcp-datastream-cdc-data-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksmin23%2Fgcp-datastream-cdc-data-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksmin23%2Fgcp-datastream-cdc-data-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksmin23%2Fgcp-datastream-cdc-data-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ksmin23","download_url":"https://codeload.github.com/ksmin23/gcp-datastream-cdc-data-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ksmin23%2Fgcp-datastream-cdc-data-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31785677,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T02:24:21.117Z","status":"ssl_error","status_checked_at":"2026-04-14T02:24:20.627Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","cloud-sql","data-pipeline","datastream","google-cloud-platform","mysql","terraform"],"created_at":"2025-08-13T19:01:06.592Z","updated_at":"2026-04-14T06:33:34.889Z","avatar_url":"https://github.com/ksmin23.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Terraform: Datastream Pipeline for Cloud SQL CDC via PSC\n\nThis Terraform project provisions a complete end-to-end solution for capturing Change Data Capture (CDC) data from a Cloud SQL for MySQL instance and replicating it to BigQuery. The entire connection is established securely and privately using **Private Service Connect (PSC)**.\n\nThe infrastructure is deployed in **two distinct stages** to separate network setup from application resources, which is a best practice for managing infrastructure lifecycles.\n\n## Overall Architecture Diagram\n\n```mermaid\ngraph LR\n    A[(Cloud SQL\u003cbr/\u003efor MySQL)] --\"CDC via\u003cbr/\u003ePrivate Service Connect\"--\u003e DS(Datastream)\n    DS --\"Real-time Stream\"--\u003e BQ[(BigQuery)]\n\n    style A fill:#4285F4,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    style BQ fill:#4285F4,stroke:#ffffff,stroke-width:2px,color:#ffffff\n    style DS fill:#F4B400,stroke:#ffffff,stroke-width:2px,color:#ffffff\n```\n\n## Architecture\n\n### Stage 1: Network (`terraform/01-network`)\n\nThis stage builds the foundational network infrastructure.\n\n1.  **VPC \u0026 Subnets**: A new VPC is created in custom mode, with public and private subnets in each availability zone of the specified region.\n2.  **Cloud NAT \u0026 Router**: A Cloud Router and NAT gateway are configured to allow instances in the private subnets to access the internet for outbound traffic (e.g., for system updates).\n3.  **Firewall Rules**: Basic firewall rules are created to allow internal traffic, SSH, and ICMP.\n4.  **Service Networking**: A VPC Peering connection is established with Google's services network to enable private access for Cloud SQL.\n5.  **PSC Subnet**: A dedicated subnet is created specifically for Private Service Connect, which Datastream will use.\n\n### Stage 2: Application Infrastructure (`terraform/02-app-infra`)\n\nThis stage deploys the application-specific resources on top of the network foundation.\n\n1.  **Cloud SQL for MySQL**: A new MySQL 8.0 instance is provisioned with a private IP address. It is configured as a service producer for PSC.\n2.  **BigQuery Dataset**: A destination dataset is created in BigQuery to store the replicated data.\n3.  **Datastream**:\n    *   A **Network Attachment** is created for PSC connectivity.\n    *   A **Datastream Private Connection** uses the attachment to connect to the VPC.\n    *   Source (MySQL) and Destination (BigQuery) **Connection Profiles** are created.\n    *   A **Datastream Stream** is configured to capture changes from the MySQL source and deliver them to the BigQuery destination.\n\n## Prerequisites\n\n*   **Terraform**: `v1.5.7` or later\n*   **Google Cloud SDK**: Authenticated to your GCP account (`gcloud auth application-default login`).\n*   **Enabled APIs**: Before starting, ensure the required APIs are enabled. You can run the `gcloud` command provided in the [FAQ](FAQ.md) or let Terraform enable them automatically by running `terraform apply` in each stage.\n\n## How to Use\n\nDeployment is a two-stage process. You must deploy the network first, followed by the application infrastructure.\n\n### Stage 1: Deploy the Network\n\n1.  **Navigate to the network directory:**\n    ```bash\n    cd terraform/01-network\n    ```\n\n2.  **Create `terraform.tfvars`**:\n    Copy the example file and provide the required values.\n    ```bash\n    cp terraform.tfvars.example terraform.tfvars\n    ```\n    Edit `terraform.tfvars` and set your `project_id` and a unique `psc_subnet_cidr_range`.\n\n3.  **Initialize and Apply**:\n    ```bash\n    terraform init\n    terraform plan\n    terraform apply\n    ```\n    When prompted, type `yes` to confirm the deployment.\n\n### Stage 2: Deploy the Application Infrastructure\n\n1.  **Navigate to the app-infra directory:**\n    ```bash\n    cd ../02-app-infra\n    ```\n\n2.  **Create `terraform.tfvars`**:\n    Copy the example file and provide the required values.\n    ```bash\n    cp terraform.tfvars.example terraform.tfvars\n    ```\n    Edit `terraform.tfvars` and set your `project_id` and the `allowed_psc_projects`. Your own project ID must be in this list.\n\n3.  **Initialize and Apply**:\n    ```bash\n    terraform init\n    terraform plan\n    terraform apply\n    ```\n    This will deploy Cloud SQL, BigQuery, and Datastream, referencing the network created in Stage 1.\n\n## Post-Deployment\n\nAfter deployment, you must complete a few manual steps.\n\n### 1. Grant SQL Permissions\n\nYou need to connect to the newly created Cloud SQL instance to grant permissions. The easiest way to perform this one-time setup is by using **Cloud SQL Studio**.\n\n#### a. Get Admin Password\n\nFirst, retrieve the generated admin password from the Terraform output. From the `terraform/02-app-infra` directory, run:\n```bash\nterraform output admin_user_password\n```\n\n#### b. Connect via Cloud SQL Studio\n\n1.  Open the [Cloud SQL instances page](https://console.cloud.google.com/sql/instances) in the GCP Console.\n2.  Find your instance (e.g., `mysql-src-ds`) and click its name.\n3.  From the left menu, select **\"Cloud SQL Studio\"**.\n4.  Log in with the username `admin` and the password from the previous step. The database name is `testdb`.\n\n#### c. Execute the GRANT Command\n\nIn the Cloud SQL Studio query editor, run the following SQL commands:\n```sql\nGRANT REPLICATION SLAVE, SELECT, REPLICATION CLIENT ON *.* TO 'datastream'@'%';\nFLUSH PRIVILEGES;\n```\n\n\u003e **Note on Connecting from Your VPC**\n\u003e \n\u003e For connections from your applications, scripts, or bastion hosts inside the VPC, you should use the stable **Private Service Connect (PSC) endpoint**. This provides a private, internal IP address for your Cloud SQL instance.\n\u003e \n\u003e Get the connection details from the `terraform/02-app-infra` directory:\n\u003e ```bash\n\u003e # Use this stable internal IP for your applications\n\u003e terraform output cloud_sql_psc_endpoint_ip\n\u003e \n\u003e # Use this password for the 'admin' user\n\u003e terraform output admin_user_password\n\u003e ```\n\u003e You would then connect using a standard MySQL client to the IP address provided by the `cloud_sql_psc_endpoint_ip` output.\n\n### 2. Start the Stream\n\nThe Datastream stream is created in a `NOT_STARTED` state. You must manually start it. For example:\n```bash\ngcloud datastream streams update mysql-to-bigquery-stream \\\n    --location=us-central1 \\\n    --state=RUNNING\n```\n\n\n## Testing the Pipeline with Fake Data\n\nAfter the infrastructure is deployed and the stream is running, you can insert sample data to verify the pipeline.\n\n1.  **Navigate to the scripts directory**:\n    ```bash\n    cd ../../scripts \n    # (If you are in terraform/02-app-infra)\n    ```\n\n2.  **Setup the Python environment**:\n    Follow the instructions in `scripts/README.md` to set up the `uv` virtual environment and install dependencies.\n\n3.  **Generate SQL statements**:\n    ```bash\n    uv run python generate_fake_sql.py --generate-ddl --max-count 1000 \u003e sample_data.sql\n    ```\n\n4.  **Import the SQL data via Cloud SQL Studio**:\n    The simplest way to import the data is to use Cloud SQL Studio again.\n    a. Connect to your database in Cloud SQL Studio as described in the \"Post-Deployment\" section.\n    b. Open the `sample_data.sql` file in a text editor and copy its contents.\n    c. Paste the SQL into the query editor and click **\"Run\"**.\n\n5.  **Verify in BigQuery**:\n    After a few minutes, you should see a new dataset (e.g., `datastream_destination_dataset_testdb`) and a `retail_trans` table in your BigQuery project. Query the table to confirm that the data has been replicated.\n\n## Clean Up\n\nTo destroy all resources, you must run `terraform destroy` in the reverse order of creation.\n\n1.  **Destroy Application Infrastructure**:\n    ```bash\n    cd terraform/02-app-infra\n    terraform destroy\n    ```\n\n2.  **Destroy Network**:\n    ```bash\n    cd ../01-network\n    terraform destroy\n    ```\n\n## References\n\n- [Quickstart: Replicate data from a Cloud SQL for MySQL database to BigQuery](https://cloud.google.com/datastream/docs/quickstart-replication-to-bigquery)\n- [Configure a Cloud SQL for MySQL database as a source](https://cloud.google.com/datastream/docs/configure-cloudsql-mysql)\n- [Configure Private Service Connect interfaces](https://cloud.google.com/datastream/docs/psc-interfaces)\n- [Codelab: Connecting to CloudSQL via Private Service Connect (Terraform)](https://codelabs.developers.google.com/codelabs/cloudsql-psc-terraform#0)\n- [Codelab: How to create a Private Services Connect for CloudSQL](https://codelabs.developers.google.com/devsite/codelabs/psc-cloud-sql#0)\n- [Datastream Diagnose issues - MySQL errors](https://cloud.google.com/datastream/docs/diagnose-issues#mysql-errors)\n- [Datastream Source-specific information for MySQL databases](https://cloud.google.com/datastream/docs/sources-mysql)\n- [Datastream Known limitations for MySQL databases](https://cloud.google.com/datastream/docs/sources-mysql#mysqlknownlimitations)\n- [Configure BigQuery as a destination - Configure write mode](https://cloud.google.com/datastream/docs/configure-bigquery-destination#configure-write-mode): Merge / Append-only\n- [`gcp-datastream-mysql-cdc-to-gcs` GitHub Repository](https://github.com/ksmin23/gcp-datastream-mysql-cdc-to-gcs): A foundational project that uses Terraform to provision the first part of the pipeline: streaming real-time database changes from Cloud SQL for MySQL to Google Cloud Storage using Datastream.\n- [`gcp-datastream-dataflow-analytics` GitHub Repository](https://github.com/ksmin23/gcp-datastream-dataflow-analytics): A reference project demonstrating a complete, end-to-end pipeline that processes the GCS data from this project's architecture using Dataflow and loads it into BigQuery.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fksmin23%2Fgcp-datastream-cdc-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fksmin23%2Fgcp-datastream-cdc-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fksmin23%2Fgcp-datastream-cdc-data-pipeline/lists"}