{"id":15003175,"url":"https://github.com/googlecloudplatform/cloud-composer-mssql-dataflow-bigquery","last_synced_at":"2025-07-08T15:02:25.889Z","repository":{"id":66041154,"uuid":"281690021","full_name":"GoogleCloudPlatform/cloud-composer-mssql-dataflow-bigquery","owner":"GoogleCloudPlatform","description":"This repository contains an example of how to leverage Cloud Composer and Cloud Dataflow to move data from a Microsoft SQL Server to BigQuery. The diagrams below demonstrate the workflow pipeline.","archived":false,"fork":false,"pushed_at":"2024-05-20T23:15:17.000Z","size":193,"stargazers_count":18,"open_issues_count":2,"forks_count":13,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-01-30T09:41:44.378Z","etag":null,"topics":["airflow","bigquery","cloud-composer","dataflow","microsoft-sql-server"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-07-22T13:48:30.000Z","updated_at":"2024-12-19T03:28:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"a1d1d08d-0426-4aa6-aba5-3271edf37977","html_url":"https://github.com/GoogleCloudPlatform/cloud-composer-mssql-dataflow-bigquery","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fcloud-composer-mssql-dataflow-bigquery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fcloud-composer-mssql-dataflow-bigquery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fcloud-composer-mssql-dataflow-bigquery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fcloud-composer-mssql-dataflow-bigquery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/cloud-composer-mssql-dataflow-bigquery/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237236958,"owners_count":19277082,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","bigquery","cloud-composer","dataflow","microsoft-sql-server"],"created_at":"2024-09-24T18:56:35.152Z","updated_at":"2025-02-05T03:32:04.681Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cloud Composer Orchestrating Moving Data from Microsoft SQL Server to BigQuery\n## Google Cloud Composer Example\n\nThis repository contains an example of how to leverage Cloud Composer and Cloud Dataflow to move data from a Microsoft SQL Server to BigQuery. The diagrams below demonstrate the workflow pipeline.\n\n\n![Diagram Part One](images/diagrams.png)\n\n\nThe Pipeline Steps are as follows:\n\n1. A Cloud Composer DAG is either scheduled or manually triggered which connects a Microsoft SQL Server defined and exports the defined data to Google Cloud Storage as a JSON file.\n\n2. A second Cloud Composer DAG is triggered by a Cloud Function once the JSON file has been written to the storage bucket.\n\n3. The second Cloud Composer DAG triggers a Dataflow batch job which can if needed perform transformations then it writes the data to BigQuery.\n\n5. Included in both Cloud Composer DAGs is the ability to send email notifications.\n\nYou can:\n* Schedule the Cloud Composer DAG to export data as needed with date filters.\n* Perform transformation in Dataflow.\n* Get a notification on a successful or failed jobs.\n\nRequirements:\n* You need a Microsoft SQL Server installed either in Google Cloud or elsewhere.\n\n## How to install\n\n1. [Install the Google Cloud SDK](https://cloud.google.com/sdk/install)\n\n2. Create a export storage bucket for **Microsoft SQL Server Exports**\n\n``` shell\ngsutil mb gs://[BUCKET_NAME]/\n```\n\n3. Create a Dataflow staging storage bucket\n\n``` shell\ngsutil mb gs://[BUCKET_NAME]/\n```\n\n4. Through the [Google Cloud Console](https://console.cloud.google.com) create a folder named **tmp** in the newly created bucket for the DataFlow staging files\n\n\n5. [Create a Cloud Composer Environment](https://cloud.google.com/composer/docs/how-to/managing/creating)\n* You need to use an image equal to or greater to: composer-1.10.6-airflow-1.10.6\n \n6. Create a BigQuery Dataset\n``` shell\nbq mk [YOUR_BIG_QUERY_DATABASE_NAME]\n```\n\n7. Enable the Cloud Dataflow API\n``` shell\ngcloud services enable dataflow\n```\n\n8. Enable the Cloud Composer API\n``` shell\ngcloud services enable composer.googleapis.com\n```\n\n9. Enable the Cloud Functions API\n``` shell\ngcloud services enable cloudfunctions.googleapis.com\n```\n\n10. Granting blob signing permissions to the Cloud Functions Service Account\n```shell\ngcloud iam service-accounts add-iam-policy-binding \\\n[YOUR_PROJECT_ID]@appspot.gserviceaccount.com \\\n--member=serviceAccount:[YOUR_PROJECT_ID]@appspot.gserviceaccount.com \\\n--role=roles/iam.serviceAccountTokenCreator\n```\n\n11. Edit the index.js file\n* In the cloned repo, go to the “cloud-functions” directory and edit the index.js file and change the variables listed below.\n\n* To get your your-iap-client-id execute the following:\n\n``` shell\npython get-client-id/get_client_id.py [PROJECT_ID] [GCP_REGION] [COMPOSER_ENVIRONMENT]\n```\n\n``` js\n  // The project that holds your function\n  const PROJECT_ID = 'your-project-id';\n  // Run python get-client-id/get_client_id.py [PROJECT_ID] [GCP_REGION] [COMPOSER_ENVIRONMENT] to get your client id\n  const CLIENT_ID = 'your-iap-client-id';\n  // This should be part of your webserver's URL:\n  // {tenant-project-id}.appspot.com\n  const WEBSERVER_ID = 'your-tenant-project-id';\n  // The name of the DAG you wish to trigger\n  const DAG_NAME = 'mssql_gcs_dataflow_bigquery_dag_2';\n```\n\n12. Deploy the Cloud Function\n* In the cloned repo, go to the “cloud-functions” directory and deploy the following Cloud Function.\n``` shell\ngcloud functions deploy triggerDag --region=us-central1 --runtime=nodejs8 --trigger-event=google.storage.object.finalize --trigger-resource=[YOUR_UPLOADED_EXPORT_STORAGE_BUCKET_NAME]\n```\n\n13. Deploy the Cloud Dataflow Pipeline\n* Update the fields object to match your table schema\n* In the Cloud Console go to the Composer Environments\n* Click on the DAGs Folder Icon\n* This will open a new window for the Bucket Details\n* Create a Folder called dataflow\n* Upload the cloud-dataflow/process_json.py file to the dataflow folder\n\n14. Create the following variables in the Airflow Web Server\n\n| Key | Val |\n| --- | ----------- |\n| bq_output_table | [DATASET.TABLE] |\n| email | [YOUR_EMAIL_ADDRESS] |\n| gcp_project | [YOUR_PROJECT_ID] |\n| gcp_temp_location | gs://[YOUR_DATAFLOW_STAGE_BUCKET]/tmp |\n| mssql_export_bucket | [YOUR_UPLOADED_EXPORT_STORAGE_BUCKET_NAME] |\n\n\n* For the [DATASET.TABLE] use the dataset name you created in step 6 and choose a name for the table. Cloud Dataflow will create the table for you on it's first run.\n\n15. Create a Airflow connection\n* From the Airflow interface to go to Admin \u003e Connections\n* Edit the mssql_default connection\n* Change the details to match your Microsoft SQL Server\n\n16. In the Cloud Console go to the Composer Environments\n* In the PYPI Packages add pymssql, it should look like:\n\n![PYPI Packages](images/pypi-packages.png)\n\n17. Follow these instructions for [Configuring SendGrid email services](https://cloud.google.com/composer/docs/how-to/managing/creating#notification)\n\n18. Deploy the two Cloud Composer DAGs\n* Before upload the mssql_gcs_dataflow_bigquery_dag_1.py edit line 51 for your respective SQL Statement\n* Upload the two file below to the DAGs folder in Google Cloud Storage\n\n![Dags Folder](images/dags-folder.png)\n\n  * cloud-composer/mssql_gcs_dataflow_bigquery_dag_1.py\n  * cloud-composer/mssql_gcs_dataflow_bigquery_dag_2.py\n\n\n**This is not an officially supported Google product**","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fcloud-composer-mssql-dataflow-bigquery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fcloud-composer-mssql-dataflow-bigquery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fcloud-composer-mssql-dataflow-bigquery/lists"}