Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/snowplow/data-models

⚠️ MAINTENANCE-ONLY MODE: Snowplow maintained SQL data models for working with Snowplow web and mobile behavioral data.
https://github.com/snowplow/data-models

bigquery redshift snowflake snowplow sql

Last synced: 3 months ago
JSON representation

⚠️ MAINTENANCE-ONLY MODE: Snowplow maintained SQL data models for working with Snowplow web and mobile behavioral data.

Awesome Lists containing this project

README

        

# MAINTENANCE-ONLY MODE

## ⚠️ For any new developments we highly recommend using our [dbt packages](https://docs.snowplow.io/docs/modeling-your-data/modeling-your-data-with-dbt/) instead of these data models. SQL Runner and associated data models are no longer under active development and will only receive bug fixes in the future. Our dbt packages have all the same features as these SQL runner data models, and many many more features as well as more overall packages. They also support Redshift, BigQuery, Snowflake, Databricks, and Postgres.

[![maintained]][tracker-classificiation] [![License][license-image]][license]

![snowplow-logo](media/snowplow_logo.png)

Snowplow is a scalable open-source platform for rich, high quality, low-latency data collection. It is designed to collect high quality, complete behavioral data for enterprise business.

# Snowplow Pipeline Overview

![snowplow-pipeline](media/snowplow_architecture.png)

The [Snowplow trackers][tracker-docs] enable highly customizable collection of raw, unopinionated event data. The pipeline validates these events against a JSONSchema - to guarantee a high quality dataset - and adds information via both standard and custom enrichments.

This data is then made available in-stream for real-time processing, and can also be loaded to blob storage and data warehouse for analysis.

The Snowplow atomic data acts as an immutable log of all the actions that occurred across your digital products. The data model takes that data and transforms it into a set of derived tables optimized for analysis. [Visit our documentation site][docs-what-is-dm] for further explanation on the data modeling process.

# Repo Contents

- [Web (v1)](web/v1)
- [Redshift](web/v1/redshift)
- [BigQuery](web/v1/bigquery)
- [Snowflake](web/v1/snowflake)
- [Mobile (v1)](mobile/v1)
- [Redshift](mobile/v1/redshift)
- [BigQuery](mobile/v1/bigquery)
- [Snowflake](mobile/v1/snowflake)

Documentation for the data models can be found on [our documentation site][docs-data-models].

# Prerequisites

These models are written in a format that is runnable via [SQL-runner][sql-runner] - available for download as a zip file from [Github Releases][sql-runner-github]. The BigQuery model requires >= v0.9.2, and the Snowflake model requires >= v0.9.3 of sql-runner.

Those who don't wish to use sql-runner to run models can use the -t and -o flags of the run_config.sh script to output the pure sql for a model according to how it has been configured for sql-runner.

They each also require a dataset of Snowplow events, generated by one of [the tracking SDKs][tracker-docs], passed through the validation and enrichment steps of the pipeline, and loaded to a database.

For the testing framework, Python3 is required. Install requirements with:

```bash
cd .tests
pip3 install -r requirements.txt
```

# Quick start

To run a model and tests end to end, run the `.scripts/e2e.sh` bash script.

![end-to-end](media/e2e.gif)

For a quick start guide to each individual model, and specific details on each module, see the README in the model's database-specific folder (eg. `web/v1/redshift`).

For detail on the structure of a model, see the README in the model's main folder (eg. `web/v1`).

For detail on using the helper scripts, see the README in `.scripts/`

# Running models in production

## Using SQL-runner

### Snowplow BDP

Snowplow BDP customers can configure jobs for SQL-runner in production via configuration files. [See our docs site for details on doing so](https://docs.snowplow.io/docs/modeling-your-data/configuring-and-running-data-models-via-snowplow-bdp/). The `configs/datamodeling.json` file in each model is an example configuration for the standard model. The `configs/example_with_custom.json` file is an example configuration with a customization.

### Open Source

For open-source users, the JSON files in `configs` folders can't be directly used, but serve as a representation of the dependencies for what to run. Open Source users using SQL-runner should instrument their jobs to run playbooks individually according to the dependencies specified.

For local use, the `.scripts/run_config.sh` script can be used to run a config - note that it does not resolve dependencies but runs the playbooks in order of appearance.

## dbt

For users using dbt we have the [snowplow-web](https://github.com/snowplow/dbt-snowplow-web) and [snowplow-mobile](https://github.com/snowplow/dbt-snowplow-mobile) dbt packages, allowing you to run the web and/or mobile models via dbt. These packages support Redshift, BigQuery, Snowflake, Databricks, and Postgres.

## Using other tools

For those who wish to use other tools, one may configure playbooks and config JSON files for the desired model, then use the `.scripts/run_configs.sh` script's `-p` and `-o` flags to fill templates and output pure SQL to file:

```bash
bash .scripts/run_config.sh -b ~/pathTo/sql-runner -c web/v1/bigquery/sql-runner/configs/example_with_custom.json -p -o tmp/sql;
```

This SQL can then be used directly or amended to suit the relevant tool.

# Copyright and license

The Snowplow Data Models project is Copyright 2020-2021 Snowplow Analytics Ltd.

Licensed under the [Apache License, Version 2.0][license] (the "License");
you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

[license]: http://www.apache.org/licenses/LICENSE-2.0
[license-image]: http://img.shields.io/badge/license-Apache--2-blue.svg?style=flat
[tracker-classificiation]: https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/tracker-maintenance-classification/
[maintained]: https://img.shields.io/static/v1?style=flat&label=Snowplow&message=Maintained&color=9e62dd&labelColor=9ba0aa&logo=

[tracker-docs]: https://docs.snowplow.io/docs/collecting-data/collecting-from-own-applications/
[docs-what-is-dm]: https://docs.snowplow.io/docs/modeling-your-data/what-is-data-modeling/
[docs-data-models]: https://docs.snowplow.io/docs/modeling-your-data/

[sql-runner]: https://github.com/snowplow/sql-runner
[sql-runner-github]: https://github.com/snowplow/sql-runner/releases/