https://github.com/takegue/bqmake
BigQuery Powered Data Build Suite.
https://github.com/takegue/bqmake
bigquery sql
Last synced: 5 months ago
JSON representation
BigQuery Powered Data Build Suite.
- Host: GitHub
- URL: https://github.com/takegue/bqmake
- Owner: takegue
- License: mit
- Created: 2022-08-14T10:22:01.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-10-01T13:57:41.000Z (over 1 year ago)
- Last Synced: 2025-02-02T08:31:47.511Z (about 1 year ago)
- Topics: bigquery, sql
- Homepage:
- Size: 564 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
bqmake: BigQuery Powered Data Build Suite.
===
`bqmake` provides BigQuery routines that help you to make typical data-modeling.\
All routines are designed to be idempotent and have smart data update mechanism.\
This let free you from awkward DAG workflow management.
This tool gives following utilities.
- **Dynamic whole/partial Data Refresh for BigQuery Table**:\
Like materialized view, `bqmake.v0.partition_table__update` automatically checks freshness and then updates data if needed.\
This is useful to build pre-computed tables which conists of frequent or expensive query.\
See [Refreshing Partition Table Data](#refreshing-partition-table-data) section for more details.
- **Data Snapshot Utilities**:\
Table snapshot captures data changes and stores them in Slowly Changing Dimension II format.
You can recover table state at any timepoint you snapshoted.
`bqmake.v0.snapshot__init` and `bqmake.v0.snapshot__update` are optimized for BigQuery functionality using partitioning/clustering feature
and save processing amount and slots.
- **Metadata Utilities**:\
Preparing useful metadata for tables.
* Embedding intra-dataset data lineage into dataset description in mermaid.js format.
* Labeling available partition information.
Currently this is public beta and all routines are subject to change wihtout notice.
Please send us your comments and suggestion via issue!
## Get Started
All utilities are **BigQuery Routines (UDF or PROCEDER)** and published at `bqmake.v0` dataset.\
You can use them without any installation.
### Refreshing Partition Table Data
`bqmake.v0.partition_table__update` makes derived table fresh in specified partition range.
It dynamically analyze partition whose derived table and its referenced tables and update data if needed.
By using [Scheduling Query](https://cloud.google.com/bigquery/docs/scheduling-queries?hl=ja), the procedure is almost behaves like materialized view.
But comparing materialized view, you can get extra advanteges:
* No restricted query syntax.
* You can get vanilla BigQuery Table that has useful features in BigQuery console such as Preview, BI Engine supports and so on.
```sql
declare query string;
-- Prepare dataset and table
create schema if not exists `zsandbox`;
create or replace table `zsandbox.ga4_count`(event_date date, event_name string, records int64)
partition by event_date;
-- Prepare data generation query parameterized by @begin and @end (DATE type)
set query = """
select date(timestamp_micros(event_timestamp)) as event_date, event_name, count(1)
from `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
where parse_date('%Y%m%d', _TABLE_SUFFIX) between @begin and @end
group by event_date, event_name
""";
-- First call procedure to update data
call `bqmake.v0.partition_table__update`(
(null, 'zsandbox', 'ga4_count')
, [('bigquery-public-data', 'ga4_obfuscated_sample_ecommerce', 'events_*')]
, `bqmake.v0.alignment_day2day`('2021-01-01', '2021-01-01')
, query
, null
);
--> Affect 16 rows
-- Second call won't update partition data because 2022-01-01 partition is still freshed.
call `bqmake.v0.partition_table__update`(
(null, 'zsandbox', 'ga4_count')
, [('bigquery-public-data', 'ga4_obfuscated_sample_ecommerce', 'events_*')]
, `bqmake.v0.alignment_day2day`('2021-01-01', '2021-01-01')
, query
, null
);
--> No affect
```
### Snapshot Table
```sql
declare query string;
set query = "select * from `bigquery-public-data.austin_bikeshare.bikeshare_stations`"
-- Initialize Snapshot table
call `bqmake.v0.snapshot_table__init`(
(null, 'zsandbox', 'ga4_count')
, (
'station_id'
, query
, current_timestamp()
)
, null
);
-- Snapshot after some modification
call `bqmake.v0.snapshot_table__update`(
destination
, null
, (
'station_id'
-- This example changes some records on purpose
, 'select * replace(if(station_id in (2499), "closed", status) as status) from `bigquery-public-data.austin_bikeshare.bikeshare_stations`'
, current_timestamp()
)
)
, to_json(struct(
-- Demo disables staleness check intentionally.
current_timestamp() as force_expired_at
))
)
```
### Metadata Updates
#### Labeling partition tables on Dataset
`v0.dataset__update_table_labels` set useful labels for partitions tables.
- `partition-min`: Oldest partition_id
- `partition-max`: Latest partition_id
- `partition-skip`: Skipped partition count
```sql
call `v0.dataset__update_table_labels`(('your_project', 'your_dataset'))
```
#### Generating Intra-Dataset Lineage on Dataset
`v0.dataset__update_description` generate dataset description with intra-dataset lineage in [marmaid.js](https://mermaid-js.github.io/mermaid/#/) representation.
```sql
call `v0.dataset__update_description`(('your_project', 'your_dataset'))
```