https://github.com/datasherlock/csql-copy-dataflow
https://github.com/datasherlock/csql-copy-dataflow
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/datasherlock/csql-copy-dataflow
- Owner: datasherlock
- Created: 2024-08-19T15:18:48.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-20T09:09:53.000Z (almost 2 years ago)
- Last Synced: 2025-05-19T04:11:37.948Z (about 1 year ago)
- Language: Python
- Size: 11.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DataflowToCloudSQL
#### Developed By: Jerome Rajan, Staff Solutions Consultant, Google
## Overview
`DataflowToCloudSQL` is a Python-based project that integrates Apache Beam pipelines with Google Cloud SQL.
The project is structured to allow reading data from a GCS bucket, and writing the results to a GCP CloudSQL Postgres
database. The primary aim of this repo is to demonstrate connecting to CloudSQL from Dataflow :
- Using a private IP
- Using IAM authentication
- Without a CloudSQL Auth Proxy
The solution uses SQLAlchemy's bulk loading capabilities in conjunction with Beam's distributed framework.
## Project Structure
- **common/**: Contains common utility modules used across the project.
- `config.ini`: Configuration file for storing database connection details and other dataflow configurations.
- `get_connection.py`: Module for establishing a connection to Google Cloud SQL.
- `Logger.py`: Module for logging throughout the application.
- `parse_configs.py`: Handles the parsing of the configuration file (`config.ini`).
- `utils.py`: Additional utility functions used across the project.
- **pipelines/**: Directory intended for storing Apache Beam pipeline definitions.
- **sinks/**: Directory intended for modules related to data sinks, such as CloudSQL
- **sources/**: Directory intended for modules related to data sources.
- `read_from_source.py`: Module for reading data from the specified source. This is a sample record generator to demonstrate the CloudSQL write capability
- **Dockerfile**: Docker configuration for containerizing the application.
- **main.py**: The main entry point of the application that initializes and runs the Apache Beam pipeline.
- **requirements.txt**: Python dependencies required for the project.
- **setup.py**: Script for installing the project and its dependencies.
## Configuration
The application uses a `config.ini` file located in the `common/` directory for configuration. This file should contain necessary details such as database credentials, connection strings, and other configurable parameters.
Example `config.ini`:
```ini
[cloudsql]
instance=project:region:cloudsql-instance
database=postgres
schema=public
user=dataflow-sa@project_name.iam
password=''
batch=10000
[dataflow]
project=project_name
temp_location=""
staging_location=""
region=us-central1
service_account=dataflow-sa@project_name.iam.gserviceaccount.com
[source]
src_path=gs://bucket_name/df_big/output*
```
Copy the config.ini to a GCS path
```
gcloud storage cp common/config.ini gs://bucket_name/dataflowtocsql/config/config.ini
```
To build image using `gcloud builds` -
```
gcloud builds submit --tag us-central1-docker.pkg.dev/project_name/repo_name/dataflow/dataflow2csql:1.0
```
Run the job with DataflowRunner -
```
python main.py --config_path="gs://bucket_name/dataflowtocsql/config/config.ini" --runner=DataflowRunner --sdk_container_image="us-central1-docker.pkg.dev/project_name/repo_name/dataflow/dataflow2csql:2.0" --setup_file=./setup.py
```