Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jolares/example-gcp-dataform
Example end-to-end ELT data pipeline using GCP Dataform.
https://github.com/jolares/example-gcp-dataform
bigquery dataform etl-pipeline
Last synced: 2 months ago
JSON representation
Example end-to-end ELT data pipeline using GCP Dataform.
- Host: GitHub
- URL: https://github.com/jolares/example-gcp-dataform
- Owner: jolares
- License: mit
- Created: 2022-10-22T04:57:34.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-08-09T23:35:39.000Z (over 1 year ago)
- Last Synced: 2024-10-15T03:28:03.393Z (4 months ago)
- Topics: bigquery, dataform, etl-pipeline
- Language: JavaScript
- Homepage:
- Size: 128 KB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# stackoverflow-data-reports
- `data-reports-etl`: A project built with BigQuery Dataform that transforms Stackoverflow
public raw data into reporting tables in a BigQuery data warehouse.- `ETL`: A project built with BigQuery Dataform that transforms Stackoverflow
public raw data into reporting tables in a BigQuery data warehouse.## Local Development Environment Setup
- Clone the project workspace repository:
`git clone https://github.com/jolares/stackoverflow-ai.git`- Install project workspace node dependencies:
`npm install`#### (Optional) Install Recommended VSCode Extensions
- `dataform.dataform`: provides syntax highlighting, compilation, and intellisense
for Dataform and SQLX projects. Refer to the [extension site](https://marketplace.visualstudio.com/items?itemName=dataform.dataform)---
## GCP Project Configuration
### Setup Secrets
#### Create a Secret Token with Dataform Service Account Permissions
A secret token is created for the GCP _Dataform Service Account_ to interact with
Dataform resources.- Enable GCP Secret Manager API for the GCP Project
- Open the GCP Secret Manager console and create a new secret token with any
secure value of your preference.
- This project named the secret token `GCP_BIGQUERY_DATAFORM_SA_TOKEN`
- After the secret is created, edit the secret's _permissions_ and
_grant access_ to the Dataform service account; for this, make the service
account a new principal for the secret, and assign to it the role
`Secret Manager Secret Accessor`A secret token is created for the GCP _Dataform Service Account_ to interact with
Dataform resources.#### Assign BigQuery Permissions to the Dataform Service Account
- Open the [Google Cloud AIM Admin console](https://console.cloud.google.com/iam-admin)
- Assign the role of `BigQuery Admin` to the dataform service account
- Edit the _Dataform Service Account_ created by Google by adding the _Role_
`BigQuery Admin` to it.
> Note: if you do not see the service account in the list of principals displayed
> to you within the Permissions page, you probably need to enable/check the
> option that indicates `Include Google-provided role grants`---
## Project Folder Structure
```
stackoverflow-ai/
├── ...
├── definitions/
├ ├── reporting/
├ ├── sources/
├ ├── testing/
├── environments.json
├── schedules.json
└── package.json
```### Dataform Dependencies
### Dataform Configuration File
```json
// dataform.json file
{
// (Required) Set this value to the GCP BigQuery dataset name (the Dataset ID without
// the GCP Project ID subdomain)
"defaultSchema": "{GCP_BIGQUERY_DATASET_NAME}",
// (Required)
"assertionSchema": "dataform_assertions",
// (Required)
"warehouse": "bigquery",
// (Required) Set this value to the GCP Project ID
"defaultDatabase": "{GCP_PROJECT_ID}"
// (Optional) Set this value the BigQuery Dataset Location (i.e. us-central-1, US)
"defaultLocation": "US"
}
```### Dataform Environments File
```json
{
"environments": [
{
"name": "production",
"configOverride": {},
// The git repository branch, or commit SHA, that triggers the workflow
// run using this environment (i.e. master, main, release, develop, etc)
"gitRef": "master"
},
{
"name": "development",
"configOverride": {},
"gitRef": "master"
},
{
"name": "testing",
"configOverride": {},
"gitRef": "master"
},// ... Other environments can be added here
]
}
```### Dataform Schedules File
```json
{
"schedules": [
{
"name": "daily",
"options": {
"includeDependencies": false,
"fullRefresh": false,
"tags": [
"daily"
]
},
"cron": "00 09 * * *",
"notification": {
"onSuccess": false,
"onFailure": false
},
"notifications": [
{
"events": [
"failure"
],
"channels": [
"email jo"
]
}
]
}
],
"notificationChannels": [
{
"name": "email jo",
"email": {
"to": [
"[email protected]"
]
}
}
]
}
```