https://github.com/ksmin23/gcp-datastream-cdc-data-pipeline
A complete Terraform setup for creating a secure, private data replication pipeline from Cloud SQL (MySQL) to BigQuery using Datastream and Private Service Connect (PSC).
https://github.com/ksmin23/gcp-datastream-cdc-data-pipeline
bigquery cloud-sql data-pipeline datastream google-cloud-platform mysql terraform
Last synced: 2 months ago
JSON representation
A complete Terraform setup for creating a secure, private data replication pipeline from Cloud SQL (MySQL) to BigQuery using Datastream and Private Service Connect (PSC).
- Host: GitHub
- URL: https://github.com/ksmin23/gcp-datastream-cdc-data-pipeline
- Owner: ksmin23
- Created: 2025-08-07T17:29:54.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-25T03:18:38.000Z (9 months ago)
- Last Synced: 2025-09-25T05:30:33.364Z (9 months ago)
- Topics: bigquery, cloud-sql, data-pipeline, datastream, google-cloud-platform, mysql, terraform
- Language: HCL
- Homepage:
- Size: 183 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Terraform: Datastream Pipeline for Cloud SQL CDC via PSC
This Terraform project provisions a complete end-to-end solution for capturing Change Data Capture (CDC) data from a Cloud SQL for MySQL instance and replicating it to BigQuery. The entire connection is established securely and privately using **Private Service Connect (PSC)**.
The infrastructure is deployed in **two distinct stages** to separate network setup from application resources, which is a best practice for managing infrastructure lifecycles.
## Overall Architecture Diagram
```mermaid
graph LR
A[(Cloud SQL
for MySQL)] --"CDC via
Private Service Connect"--> DS(Datastream)
DS --"Real-time Stream"--> BQ[(BigQuery)]
style A fill:#4285F4,stroke:#ffffff,stroke-width:2px,color:#ffffff
style BQ fill:#4285F4,stroke:#ffffff,stroke-width:2px,color:#ffffff
style DS fill:#F4B400,stroke:#ffffff,stroke-width:2px,color:#ffffff
```
## Architecture
### Stage 1: Network (`terraform/01-network`)
This stage builds the foundational network infrastructure.
1. **VPC & Subnets**: A new VPC is created in custom mode, with public and private subnets in each availability zone of the specified region.
2. **Cloud NAT & Router**: A Cloud Router and NAT gateway are configured to allow instances in the private subnets to access the internet for outbound traffic (e.g., for system updates).
3. **Firewall Rules**: Basic firewall rules are created to allow internal traffic, SSH, and ICMP.
4. **Service Networking**: A VPC Peering connection is established with Google's services network to enable private access for Cloud SQL.
5. **PSC Subnet**: A dedicated subnet is created specifically for Private Service Connect, which Datastream will use.
### Stage 2: Application Infrastructure (`terraform/02-app-infra`)
This stage deploys the application-specific resources on top of the network foundation.
1. **Cloud SQL for MySQL**: A new MySQL 8.0 instance is provisioned with a private IP address. It is configured as a service producer for PSC.
2. **BigQuery Dataset**: A destination dataset is created in BigQuery to store the replicated data.
3. **Datastream**:
* A **Network Attachment** is created for PSC connectivity.
* A **Datastream Private Connection** uses the attachment to connect to the VPC.
* Source (MySQL) and Destination (BigQuery) **Connection Profiles** are created.
* A **Datastream Stream** is configured to capture changes from the MySQL source and deliver them to the BigQuery destination.
## Prerequisites
* **Terraform**: `v1.5.7` or later
* **Google Cloud SDK**: Authenticated to your GCP account (`gcloud auth application-default login`).
* **Enabled APIs**: Before starting, ensure the required APIs are enabled. You can run the `gcloud` command provided in the [FAQ](FAQ.md) or let Terraform enable them automatically by running `terraform apply` in each stage.
## How to Use
Deployment is a two-stage process. You must deploy the network first, followed by the application infrastructure.
### Stage 1: Deploy the Network
1. **Navigate to the network directory:**
```bash
cd terraform/01-network
```
2. **Create `terraform.tfvars`**:
Copy the example file and provide the required values.
```bash
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars` and set your `project_id` and a unique `psc_subnet_cidr_range`.
3. **Initialize and Apply**:
```bash
terraform init
terraform plan
terraform apply
```
When prompted, type `yes` to confirm the deployment.
### Stage 2: Deploy the Application Infrastructure
1. **Navigate to the app-infra directory:**
```bash
cd ../02-app-infra
```
2. **Create `terraform.tfvars`**:
Copy the example file and provide the required values.
```bash
cp terraform.tfvars.example terraform.tfvars
```
Edit `terraform.tfvars` and set your `project_id` and the `allowed_psc_projects`. Your own project ID must be in this list.
3. **Initialize and Apply**:
```bash
terraform init
terraform plan
terraform apply
```
This will deploy Cloud SQL, BigQuery, and Datastream, referencing the network created in Stage 1.
## Post-Deployment
After deployment, you must complete a few manual steps.
### 1. Grant SQL Permissions
You need to connect to the newly created Cloud SQL instance to grant permissions. The easiest way to perform this one-time setup is by using **Cloud SQL Studio**.
#### a. Get Admin Password
First, retrieve the generated admin password from the Terraform output. From the `terraform/02-app-infra` directory, run:
```bash
terraform output admin_user_password
```
#### b. Connect via Cloud SQL Studio
1. Open the [Cloud SQL instances page](https://console.cloud.google.com/sql/instances) in the GCP Console.
2. Find your instance (e.g., `mysql-src-ds`) and click its name.
3. From the left menu, select **"Cloud SQL Studio"**.
4. Log in with the username `admin` and the password from the previous step. The database name is `testdb`.
#### c. Execute the GRANT Command
In the Cloud SQL Studio query editor, run the following SQL commands:
```sql
GRANT REPLICATION SLAVE, SELECT, REPLICATION CLIENT ON *.* TO 'datastream'@'%';
FLUSH PRIVILEGES;
```
> **Note on Connecting from Your VPC**
>
> For connections from your applications, scripts, or bastion hosts inside the VPC, you should use the stable **Private Service Connect (PSC) endpoint**. This provides a private, internal IP address for your Cloud SQL instance.
>
> Get the connection details from the `terraform/02-app-infra` directory:
> ```bash
> # Use this stable internal IP for your applications
> terraform output cloud_sql_psc_endpoint_ip
>
> # Use this password for the 'admin' user
> terraform output admin_user_password
> ```
> You would then connect using a standard MySQL client to the IP address provided by the `cloud_sql_psc_endpoint_ip` output.
### 2. Start the Stream
The Datastream stream is created in a `NOT_STARTED` state. You must manually start it. For example:
```bash
gcloud datastream streams update mysql-to-bigquery-stream \
--location=us-central1 \
--state=RUNNING
```
## Testing the Pipeline with Fake Data
After the infrastructure is deployed and the stream is running, you can insert sample data to verify the pipeline.
1. **Navigate to the scripts directory**:
```bash
cd ../../scripts
# (If you are in terraform/02-app-infra)
```
2. **Setup the Python environment**:
Follow the instructions in `scripts/README.md` to set up the `uv` virtual environment and install dependencies.
3. **Generate SQL statements**:
```bash
uv run python generate_fake_sql.py --generate-ddl --max-count 1000 > sample_data.sql
```
4. **Import the SQL data via Cloud SQL Studio**:
The simplest way to import the data is to use Cloud SQL Studio again.
a. Connect to your database in Cloud SQL Studio as described in the "Post-Deployment" section.
b. Open the `sample_data.sql` file in a text editor and copy its contents.
c. Paste the SQL into the query editor and click **"Run"**.
5. **Verify in BigQuery**:
After a few minutes, you should see a new dataset (e.g., `datastream_destination_dataset_testdb`) and a `retail_trans` table in your BigQuery project. Query the table to confirm that the data has been replicated.
## Clean Up
To destroy all resources, you must run `terraform destroy` in the reverse order of creation.
1. **Destroy Application Infrastructure**:
```bash
cd terraform/02-app-infra
terraform destroy
```
2. **Destroy Network**:
```bash
cd ../01-network
terraform destroy
```
## References
- [Quickstart: Replicate data from a Cloud SQL for MySQL database to BigQuery](https://cloud.google.com/datastream/docs/quickstart-replication-to-bigquery)
- [Configure a Cloud SQL for MySQL database as a source](https://cloud.google.com/datastream/docs/configure-cloudsql-mysql)
- [Configure Private Service Connect interfaces](https://cloud.google.com/datastream/docs/psc-interfaces)
- [Codelab: Connecting to CloudSQL via Private Service Connect (Terraform)](https://codelabs.developers.google.com/codelabs/cloudsql-psc-terraform#0)
- [Codelab: How to create a Private Services Connect for CloudSQL](https://codelabs.developers.google.com/devsite/codelabs/psc-cloud-sql#0)
- [Datastream Diagnose issues - MySQL errors](https://cloud.google.com/datastream/docs/diagnose-issues#mysql-errors)
- [Datastream Source-specific information for MySQL databases](https://cloud.google.com/datastream/docs/sources-mysql)
- [Datastream Known limitations for MySQL databases](https://cloud.google.com/datastream/docs/sources-mysql#mysqlknownlimitations)
- [Configure BigQuery as a destination - Configure write mode](https://cloud.google.com/datastream/docs/configure-bigquery-destination#configure-write-mode): Merge / Append-only
- [`gcp-datastream-mysql-cdc-to-gcs` GitHub Repository](https://github.com/ksmin23/gcp-datastream-mysql-cdc-to-gcs): A foundational project that uses Terraform to provision the first part of the pipeline: streaming real-time database changes from Cloud SQL for MySQL to Google Cloud Storage using Datastream.
- [`gcp-datastream-dataflow-analytics` GitHub Repository](https://github.com/ksmin23/gcp-datastream-dataflow-analytics): A reference project demonstrating a complete, end-to-end pipeline that processes the GCS data from this project's architecture using Dataflow and loads it into BigQuery.