https://github.com/arnoN7/dbt-incremental-stream

Last synced: about 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/arnoN7/dbt-incremental-stream
Owner: arnoN7
License: mit
Created: 2023-04-30T22:48:29.000Z (about 2 years ago)
Default Branch: master
Last Pushed: 2023-10-29T20:50:48.000Z (over 1 year ago)
Last Synced: 2023-10-30T11:24:39.903Z (over 1 year ago)
Language: Python
Size: 17.4 MB
Stars: 5
Watchers: 3
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-dbt - dbt-incremental-stream - A dbt package for Snowflake Streams. (Packages)
README

        # dbt-incremental-stream

**This dbt package is for Snowflake ❄️ only.**

**It is reproducing dbt incremental materialization** leveraging on Snowflake [streams](https://docs.snowflake.com/en/user-guide/streams-intro) to :

* **Improve model performance 💨**. [Stream](https://docs.snowflake.com/en/user-guide/streams-intro) accelerate source table scan focusing new data only 

* **Optimise FinOps 💰**. Snowflake warehouse will be started only if [stream](https://docs.snowflake.com/en/user-guide/streams-intro) is not empty. 

* Simplify code. `is_incremental()` is generally not required as [stream](https://docs.snowflake.com/en/user-guide/streams-intro) natively collect "new" rows

![](readme/stream.gif)

## Installation Instruction

New to dbt packages? Read more about them [here](https://docs.getdbt.com/docs/building-a-dbt-project/package-management/).

1. Include this package in your `packages.yml`. An example can be found [here](/integration_tests/packages.yml) in the [working example project](#working-example) 

2. Run `dbt deps` 

# Usage

Use `incremental_stream` materialisation like [dbt incremental model](https://docs.getdbt.com/docs/build/incremental-models) :

* Replace `ref` by `stream_ref` macro to add a [stream](https://docs.snowflake.com/en/user-guide/streams-intro) on a dbt model

* Replace `source`by `stream_source` to add a [stream](https://docs.snowflake.com/en/user-guide/streams-intro) on a source

* `incr_stream.get_stream_metadata_columns()` must be included to retreive METADATA columns of each `STREAMS`. Hence the materialization can deal accordingly with the changes on inputs tables (`INSERT`, `UPDATE`, `DELETE`) but Metadata columns (`METADATA$ACTION`, `METADATA$ISUPDATE`, `METADATA$ROW_ID`) will not be included in the model

* Like in [dbt incremental model](https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional) `unique_key` optional parameter is supported. DBT will perform a `MERGE` with a unique_key an `INSERT` instead.

* Like in [dbt incremental model](https://docs.getdbt.com/docs/build/incremental-models#how-do-i-rebuild-an-incremental-model) `--full-refresh` rebuild the target model based on the source tables

* An optional dbt [Project Variable](https://docs.getdbt.com/docs/build/project-variables) can be added to create streams at a specific timestamp using [Snowflake Time Travel with Stream](https://docs.snowflake.com/en/sql-reference/sql/create-stream#using-time-travel-with-the-source-table). The below example shows how to force to recreate a given model and related streams (using `--full-refresh` option) at given TIMESTAMP using `--vars '{TIMESTAMP: 2024-04-13 00:00:00}'`. Date format is : `yyyy-mm-dd hh24:mi:ss`

>```

>dbt run --select 'conso_client_source'  --vars '{TIMESTAMP: 2024-04-13 00:00:00}' --full-refresh

>``` 

> [!NOTE]

> `{{ incr_stream.get_stream_metadata_columns() }}` should not be included if `*` is used to get all the columns from the source like in the example below :  

> ```

>{{-

>    config(

>        materialized='incremental_stream',

>        unique_key=['ID']

>    )

>    select 

>        *

>    from incr_stream.stream_source('src', 'src_table')

>-}}

>```

## Basic examples using one table as input

### Example using a unique key and a ref

A test model [conso_client.sql](/integration_tests/models/dwh_ref/conso_client.sql) following this pattern can be found in the [integration_tests](/integration_tests/) sub DBT project included in the package.

```

{{-

    config(

        materialized='incremental_stream',

        unique_key=['column1']

    )

    select 

        column1 as c_1,

        ...,

        columnN as c_n

        {{ incr_stream.get_stream_metadata_columns() }}

    from incr_stream.stream_ref('input_model')

-}}

```

### Example using a unique key and a source

A test model [conso_client_source.sql](/integration_tests/models/dwh_source/conso_client_source.sql) following this pattern can be found in the [integration_tests](/integration_tests/) sub DBT project included in the package.

Exemple with source

```

{{-

    config(

        materialized='incremental_stream',

        unique_key=['column1']

    )

    select 

        column1 as c_1,

        ...,

        columnN as c_n

        {{ incr_stream.get_stream_metadata_columns() }}

    from incr_stream.stream_source('src', 'src_table')

-}}

```

### Example not using a unique key

Can be found in this test model: [conso_client_insert](/integration_tests/models/dwh_insert/conso_client_insert.sql)

### Example performing UNION on multiple tables

A test model [conso_client_multiple_streams.sql](/integration_tests/models/dwh_multiple_streams/conso_client_multiple_streams.sql) implementing this pattern can be found in the [integration_tests](/integration_tests/) sub DBT project included in the package.

![lineage_streams](/readme/multiple_streams.png)

# Macros

`stream_ref` and `stream_source` macro include created [streams](https://docs.snowflake.com/en/user-guide/streams-intro) in the compiled SQL code of the model. 

* [Stream](https://docs.snowflake.com/en/user-guide/streams-intro) created in `{{this}}.database` and `{{this}}.schema`

* [Stream](https://docs.snowflake.com/en/user-guide/streams-intro) name follows this naming convention : `S_{{this.name}}_{{source_table_name}}`

* `get_stream_metadata_columns` is used to easily retreive metadata columns from the stream (`METADATA$ACTION`, `METADATA$ISUPDATE`, `METADATA$ROW_ID`)

> [!NOTE]

> When using `--full-refresh` flag, macros return `ref` and `source` (not streams) to perform a complete rebuild. 

| macro | description |

|-------|-------------|

| `stream_ref` ([source](macros/stream_source.sql)) | Replace `ref` by the stream name (if not `--full-refresh` flag) |

| `stream_source` ([source](macros/stream_source.sql)) | Replace `source` by the stream name (if not `--full-refresh` flag) |

# Materialization

Incremental materialization will perform the following tasks : 

1. Create `{{this}}` table `IF NOT EXISTS`

2. Create required [stream](https://docs.snowflake.com/en/user-guide/streams-intro) `IF NOT EXISTS`

3. `CREATE OR REPLACE` a view to perform the transformation developped in the model 

4.  If [stream](https://docs.snowflake.com/en/user-guide/streams-intro) is empty **materialization stops** and prevent Snowflake Warehouse `RESUME`  

5. `MERGE` the view in `{{this}}` table based on `unique_key` provided

# Working Example

[integration_test](/integration_tests/) contains a working example with **models and a test**:

* To validate `incremental_stream` and `incremental` materialization produce the same output

* To do performance test 

| model | description |

|-------|-------------|

| [add_clients](#add_clients-model) ([source](/integration_tests/models/stg/add_clients.py)) and [add_clients_](#add_clients-model) ([source](/integration_tests/models/stg/add_clients_.py))| Python 🐍 incremental models adding new random clients. To simulate a stream source like a Kafka topic |

| [conso_client](#conso_client-model) ([source](/integration_tests/models/dwh/conso_client.sql)) | `incremental_stream` model de-duplicating clients on `ID` |

| [conso_client_incr](#conso_client-model) ([source](/integration_tests/models/dwh/conso_client_incr.sql)) | standard dbt `incremental` model to compare result and performance | model to compare result and performance with `conso_client_insert` |

And more to explore in sub DBT Project [integration_tests](/integration_tests/)

![lineage](/readme/lineage.png)

```

# 1. Add 30 new random clients to ADD_CLIENTS table 

# 2. Merge it in CONSO_CLIENT and CONSO_CLIENT_INCR

# 3. Test CONSO_CLIENT and CONSO_CLIENT_INCR equals

cd ./integration_test

dbt build

```

## add_clients model

Python 🐍 incremental model :

1. Creating a `ADD_CLIENTS` table if not exists in a `STG` schema 

2. And adding **random clients** using [Faker](https://faker.readthedocs.io/en/master/) library.

**ADD_CLIENTS Table**

| ID | FIRST_NAME  | LAST_NAME | BIRTHDATE  |    LOADED_AT     |

|----|-------------|----------:|-----------:|------------------:|

| 1  |     Amy     |  Patton   | 1985-10-08 |2023-04-28 15:01:10|

| 2  |     Eric    |  Ferguson | 1973-09-15 |2023-04-28 15:01:10|

| 3  |     Amy     |  Weston   | 1985-10-08 |2023-04-28 15:01:10|

> [!NOTE]

>`nb_clients` [variable](https://docs.getdbt.com/docs/using-variables) defines the number of random clients to add (default 30)

**Sample commands** 

```

# Add 30 random clients in add_clients model (create table if needed)

dbt run --select add_clients

# Add 100 random clients in add_clients model (create table if needed)

dbt run --select add_clients --vars '{nb_clients: 100}'

# Rebuild ADD_CLIENTS table with 30 new random clients

dbt run --select add_clients --full-refresh

```

## conso_client model

A sample model leveraging on `incremental_stream` custom materialization

1. Collecting lastest data ingested in [add_clients](#add_clients-model)

2. De-duplicating it based on `ID` with most recent `LOADED_AT` from [add_clients](#add_clients-model) stream or table

3. `MERGE` data in `CONSO_CLIENTS` table with `ID` as unique key 

**Sample commands** 

```

# Merge last clients inserted in add_clients model

dbt run --select conso_clients

# Drop CONSO_CLIENTS table and rebuild it based on ADD_CLIENTS table

dbt run --select conso_clients --full-refresh

```

## Limitations

Current known limitations : 

* When a `stream_ref` or a `stream_source` table is dropped and recreated [stream](https://docs.snowflake.com/en/user-guide/streams-intro) must be dropped. If not you will observe the following error

```

 091901 (01000): Base table 'DB_NAME.SCH_NAME.XXX_TABLE' dropped, cannot read from stream 'S_YYY_XXX_TABLE' in line 1 position 21

```

* `on_schema_change` parameter might work in certain conditions but is not currently supported  

## Credits

Thanks to [jeremiahhansen ❄️](https://github.com/jeremiahhansen) for the inspiring [implementation](https://github.com/jeremiahhansen/snowflake-helper-dbt) (done in 2020) of streams & tasks on Snowflake
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arnoN7/dbt-incremental-stream

Awesome Lists containing this project

README