Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/kristeligt-dagblad/dbt_ml

Package for dbt that allows users to train, audit and use BigQuery ML models.
https://github.com/kristeligt-dagblad/dbt_ml

bigquery-ml dbt

Last synced: about 1 month ago
JSON representation

Package for dbt that allows users to train, audit and use BigQuery ML models.

Host: GitHub
URL: https://github.com/kristeligt-dagblad/dbt_ml
Owner: kristeligt-dagblad
License: apache-2.0
Created: 2020-06-05T14:30:47.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2024-02-01T16:05:55.000Z (5 months ago)
Last Synced: 2024-02-07T03:37:48.054Z (4 months ago)
Topics: bigquery-ml, dbt
Homepage:
Size: 32.2 KB
Stars: 55
Watchers: 2
Forks: 22
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-dbt - dbt_ml - Package for dbt that allows users to train, audit and use BigQuery ML models. (Packages)
awesome-dbt - dbt_ml - Package for dbt that allows users to train, audit and use BigQuery ML models. (Packages)
awesome-dbt - dbt_ml - Package for dbt that allows users to train, audit and use BigQuery ML models. (Packages)

README

        ## BigQuery ML models in dbt

Package for dbt that allows users to train, audit and use BigQuery ML models. The package implements a `model` materialization that trains a BigQuery ML model from a select statement and a set of parameters. In addition to the `model` materialization a set of helper macros that assist with model audit and prediction are included.

### Installation

To install the package add the package path to the `packages.yml` file in your dbt project

In order to use the model audit post-hook the following variables have to be set in your `dbt_project.yml` file.

| Variable              | Description                |

| --------------------- | -------------------------- |

| `dbt_ml:audit_schema` | Schema of the audit table. |

| `dbt_ml:audit_table`  | Name of the audit table.   |

You will also need to specify the post-hook in your `dbt_project.yml` file^[1] as `{{ dbt_ml.model_audit() }}`. Optionally, you can use the `dbt_ml.create_model_audit_table()` macro to create the audit table automatically if it does not exist - for example in an on-run-start hook.

Example config for `dbt_project.yml` below:

```yaml

vars:

  "dbt_ml:audit_schema": "audit"

  "dbt_ml:audit_table": "ml_models"

on-run-start:

  - '{% do adapter.create_schema(api.Relation.create(target.project, "audit")) %}'

  - "{{ dbt_ml.create_model_audit_table() }}"

models:

  :

    ml:

      enabled: true

      schema: ml

      materialized: model

      post-hook: "{{ dbt_ml.model_audit() }}"

```

### Usage

In order to use the `model` materialization, simply create a `.sql` file with a select statement and set the materialization to `model`. Additionaly, specify any BigQuery ML options in the `ml_config` key of the config dictionary.

```sql

# model.sql

{{

    config(

        materialized='model',

        ml_config={

            'model_type': 'logistic_reg',

            'early_stop': true,

            'ls_init_learn_rate': 0.1,

            ...

        }

    )

}}

select * from your_input

```

> Note that the materialization should not be prefixed with `dbt_ml`, since dbt does not support namespaced materializations.

After training your model you can reference it in downstream dbt models using the included `predict` macro.

```sql

# downstream_model.sql

{{

    config(

        materialized='table'

    )

}}

with eval_data as (

    ...

)

select * from {{ dbt_ml.predict(ref('model'), 'eval_data') }}

```

If you're using a BQML **matrix_factorization** model, you can use the recommend macro in the same way.

```sql

# downstream_model.sql

with predict_features AS (

    ...

)

select * from {{ dbt_ml.recommend(ref('model'), 'predict_features') }}

```

The ML.DETECT_ANOMALIES function provides anomaly detection for BigQuery ML.

```sql

# detect_anomalies_model.sql

{{

    config(

        materialized='table'

    )

}}

with eval_data as (

    ...

)

select * from {{ dbt_ml.detect_anomalies(ref('model'), 'eval_data', threshold) }}

```

### Tuning hyperparameters

BigQuery ML supports tuning model hyperparameters^[2], as does `dbt_ml`. In order to specify which hyperparameters to tune, and which parameterspace to use, one can use the `dbt_ml.hparam_candidates` and `dbt_ml.hparam_range` macros that map to the corresponding BigQuery ML methods.

The following example takes advantage of hyperparameter tuning:

```sql

{{

    config(

        materialized='model',

        ml_config={

            'model_type': 'dnn_classifier',

            'auto_class_weights': true,

            'learn_rate': dbt_ml.hparam_range(0.01, 0.1),

            'early_stop': false,

            'max_iterations': 50,

            'num_trials': 4,

            'optimizer': dbt_ml.hparam_candidates(['adam', 'sgd'])

        }

    )

}}

```

It is worth noting that one must set the `num_trials` parameter to a positive integer, otherwise BigQuery will return an error.

### Overriding the package

If a user wishes to override/shim this package, instead of defining a var named `dbt_ml_dispatch_list`, they should now define [a config](https://next.docs.getdbt.com/reference/project-configs/dispatch-config) in `dbt_project.yml`, for instance:

```yaml

dispatch:

  - macro_namespace: dbt_ml

    search_order: ['my_project', 'dbt_ml']  # enable override

```

### Reservations

Some BigQuery ML models, e.g. Matrix Factorization, cannot be run using the on-demand pricing model. In order to train such models, please set up a flex or regular reservation^[3]  prior to running the model.

### Footnotes

[1] The post-hook has to be specified in the `dbt_project.yml` instead of the actual model file because the relation is not available during parsing hence variables like `{{ this }}` are not properly templated.

[2] https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning

[3] https://cloud.google.com/bigquery/docs/reservations-tasks

### References

- [BigQuery ML Syntax and Options](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create)

- [BigQuery ML Pricing](https://cloud.google.com/bigquery-ml/pricing)