Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kristeligt-dagblad/dbt_ml
Package for dbt that allows users to train, audit and use BigQuery ML models.
https://github.com/kristeligt-dagblad/dbt_ml
bigquery-ml dbt
Last synced: 3 months ago
JSON representation
Package for dbt that allows users to train, audit and use BigQuery ML models.
- Host: GitHub
- URL: https://github.com/kristeligt-dagblad/dbt_ml
- Owner: kristeligt-dagblad
- License: apache-2.0
- Created: 2020-06-05T14:30:47.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-06-18T14:36:52.000Z (5 months ago)
- Last Synced: 2024-06-18T18:00:01.226Z (5 months ago)
- Topics: bigquery-ml, dbt
- Homepage:
- Size: 38.1 KB
- Stars: 58
- Watchers: 2
- Forks: 24
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-dbt - dbt_ml - Package for dbt that allows users to train, audit and use BigQuery ML models. (Packages)
README
## BigQuery ML models in dbt
Package for dbt that allows users to train, audit and use BigQuery ML models. The package implements a `model` materialization that trains a BigQuery ML model from a select statement and a set of parameters. In addition to the `model` materialization a set of helper macros that assist with model audit and prediction are included.
### Installation
To install the package add the package path to the `packages.yml` file in your dbt project
In order to use the model audit post-hook the following variables have to be set in your `dbt_project.yml` file.
| Variable | Description |
| --------------------- | -------------------------- |
| `dbt_ml:audit_schema` | Schema of the audit table. |
| `dbt_ml:audit_table` | Name of the audit table. |You will also need to specify the post-hook in your `dbt_project.yml` file[1] as `{{ dbt_ml.model_audit() }}`. Optionally, you can use the `dbt_ml.create_model_audit_table()` macro to create the audit table automatically if it does not exist - for example in an on-run-start hook.
Example config for `dbt_project.yml` below:
```yaml
vars:
"dbt_ml:audit_schema": "audit"
"dbt_ml:audit_table": "ml_models"
on-run-start:
- '{% do adapter.create_schema(api.Relation.create(target.project, "audit")) %}'
- "{{ dbt_ml.create_model_audit_table() }}"
models:
:
ml:
enabled: true
schema: ml
materialized: model
post-hook: "{{ dbt_ml.model_audit() }}"
```### Usage
In order to use the `model` materialization, simply create a `.sql` file with a select statement and set the materialization to `model`. Additionaly, specify any BigQuery ML options in the `ml_config` key of the config dictionary.
```sql
# model.sql{{
config(
materialized='model',
ml_config={
'model_type': 'logistic_reg',
'early_stop': true,
'ls_init_learn_rate': 0.1,
...
}
)
}}select * from your_input
```> Note that the materialization should not be prefixed with `dbt_ml`, since dbt does not support namespaced materializations.
After training your model you can reference it in downstream dbt models using the included `predict` macro.
```sql
# downstream_model.sql{{
config(
materialized='table'
)
}}with eval_data as (
...
)select * from {{ dbt_ml.predict(ref('model'), 'eval_data') }}
```If you're using a BQML **matrix_factorization** model, you can use the recommend macro in the same way.
```sql
# downstream_model.sqlwith predict_features AS (
...
)select * from {{ dbt_ml.recommend(ref('model'), 'predict_features') }}
```The ML.DETECT_ANOMALIES function provides anomaly detection for BigQuery ML.
```sql
# detect_anomalies_model.sql{{
config(
materialized='table'
)
}}with eval_data as (
...
)select * from {{ dbt_ml.detect_anomalies(ref('model'), 'eval_data', threshold) }}
```If using a forecasting model, you can use the forecast macro in the same way. Here we are forecasting 30 units ahead with 80% confidence.
```sql
# forecast_model.sqlselect * from {{ dbt_ml.forecast(ref('model'), 30, 0.8) }}
```### Tuning hyperparameters
BigQuery ML supports tuning model hyperparameters[2], as does `dbt_ml`. In order to specify which hyperparameters to tune, and which parameterspace to use, one can use the `dbt_ml.hparam_candidates` and `dbt_ml.hparam_range` macros that map to the corresponding BigQuery ML methods.The following example takes advantage of hyperparameter tuning:
```sql
{{
config(
materialized='model',
ml_config={
'model_type': 'dnn_classifier',
'auto_class_weights': true,
'learn_rate': dbt_ml.hparam_range(0.01, 0.1),
'early_stop': false,
'max_iterations': 50,
'num_trials': 4,
'optimizer': dbt_ml.hparam_candidates(['adam', 'sgd'])
}
)
}}
```
It is worth noting that one must set the `num_trials` parameter to a positive integer, otherwise BigQuery will return an error.### Overriding the package
If a user wishes to override/shim this package, instead of defining a var named `dbt_ml_dispatch_list`, they should now define [a config](https://next.docs.getdbt.com/reference/project-configs/dispatch-config) in `dbt_project.yml`, for instance:```yaml
dispatch:
- macro_namespace: dbt_ml
search_order: ['my_project', 'dbt_ml'] # enable override
```### Reservations
Some BigQuery ML models, e.g. Matrix Factorization, cannot be run using the on-demand pricing model. In order to train such models, please set up a flex or regular reservation[3] prior to running the model.### Footnotes
[1] The post-hook has to be specified in the `dbt_project.yml` instead of the actual model file because the relation is not available during parsing hence variables like `{{ this }}` are not properly templated.
[2] https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning
[3] https://cloud.google.com/bigquery/docs/reservations-tasks
### References
- [BigQuery ML Syntax and Options](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create)
- [BigQuery ML Pricing](https://cloud.google.com/bigquery-ml/pricing)