An open API service indexing awesome lists of open source software.

https://github.com/datafold/demo


https://github.com/datafold/demo

Last synced: 9 months ago
JSON representation

Awesome Lists containing this project

README

          

# Datafold Demo Project

This repo contains a demo project suited to leveraging Datafold:
- dbt project that includes
- raw data (implemented via [seed CSV files](https://docs.getdbt.com/docs/building-a-dbt-project/seeds)) from a fictional app
- a few downstream models, as shown in the project DAG below
- several 'master' branches, corresponding to the various supported cloud data platforms
- `master` - 'primary' master branch, runs in Snowflake
- `master-databricks` - 'secondary' master branch, runs in Databricks, is reset to the `master` branch daily or manually when needed via the `branch_replication.yml` workflow
- `master-bigquery` - 'secondary' master branch, runs in BigQuery, is reset to the `master` branch daily or manually when needed via the `branch_replication.yml` workflow
- `master-dremio` - 'secondary' master branch, runs in Dremio, is reset to the `master` branch daily or manually when needed via the `branch_replication.yml` workflow
- several GitHub Actions workflows illustrating CI/CD best practices for dbt Core
- dbt PR job - is triggered on PRs targeting the `master` branch, runs dbt project in Snowflake
- dbt prod - is triggered on pushes into the `master` branch, runs dbt project in Snowflake
- dbt PR job (Databricks) - is triggered on PRs targeting the `master-databricks` branch, runs dbt project in Databricks
- dbt prod (Databricks) - is triggered on pushes into the `master-databricks` branch, runs dbt project in Databricks
- dbt PR job (BigQuery) - is triggered on PRs targeting the `master-bigquery` branch, runs dbt project in BigQuery
- dbt prod (BigQuery) - is triggered on pushes into the `master-bigquery` branch, runs dbt project in BigQuery
- dbt PR job (Dremio) - is triggered on PRs targeting the `master-dremio` branch, runs dbt project in BigQuery
- dbt prod (Dremio) - is triggered on pushes into the `master-dremio` branch, runs dbt project in BigQuery
- Apply monitors.yaml configuration to Datafold app - applies monitor-as-code configuration to Datafold application
- raw data generation tool to simulate a data flow typical for real existing projects



## Running this project in the pre-configured Datafold environment

### Code management
All actual changes should be commited to the `master` branch, other `master-...` branches are supposed to be reset to the `master` branch daily.

### CI demo
! To ensure the integrity and isolation of GitHub Actions workflows, it is advisable to create pull requests (PRs) for different 'master' branches from distinct commits. This practice helps prevent cross-PR leakage and ensures that workflows run independently.

#### Snowflake
To demonstrate Datafold experience in CI on Snowflake - one needs to create PRs targeting the `master` branch.
- production schema in Snowflake: `demo.core`
- PR schemas: `demo.pr_num_`

#### Databricks
To demonstrate Datafold experience in CI on Databricks - one needs to create PRs targeting the `master-databricks` branch.
- production schema in Databricks: `demo.default`
- PR schemas: `demo.pr_num_`

#### BigQuery
To demonstrate Datafold experience in CI on BigQuery - one needs to create PRs targeting the `master-bigquery` branch.
- production schema in BigQuery: `datafold-demo-429713.prod`
- PR schemas: `datafold-demo-429713.pr_num_`

#### Dremio
To demonstrate Datafold experience in CI on Dremio - one needs to create PRs targeting the `master-dremio` branch.
- production schema in Dremio: `"Alexey S3".alexeydremiobucket.prod`
- PR schemas: `"Alexey S3".alexeydremiobucket.pr_num_`

### Data replication demo

To demonstrate Datafold functionality for data replication monitoring, a pre-configured Postgres instance (simulates transactional database) is populated with 'correct raw data' (`analytics.data_source.subscription_created` table); the `subscription__created` seed CSV file contains 'corrupted raw data'.

### BI apps demo
- Looker view, explore, and dashboard are connected to the `fct__monthly__financials` model in Snowflake, Databricks, and BigQuery.
- Snowflake
- `fct__monthly__financials` view
- `fct__monthly__financials` explore
- `Monthly Financials (Demo, Snowflake)` dashboard
- Databricks
- `fct__monthly__financials_databricks` view
- `fct__monthly__financials_databricks` explore
- `Monthly Financials (Demo, Databricks)` dashboard
- BigQuery
- `fct__monthly__financials_bigquery` view
- `fct__monthly__financials_bigquery` explore
- `Monthly Financials (Demo, BigQuery)` dashboard

- Tableau data source, workbook, and dashboard are connected to the `fct__yearly__financials` model in Snowflake, Databricks, and BigQuery.
- Snowflake
- `FCT__YEARLY__FINANCIALS (DEMO.FCT__YEARLY__FINANCIALS) (CORE)` data source
- `Yearly Financials (Snowflake)` workbook
- `Yearly Financials Dashboard (Snowflake)` dashboard
- Databricks
- `fct__yearly__financials (demo.default.fct__yearly__financials) (default)` data source
- `Yearly Financials (Databricks)` workbook
- `Yearly Financials Dashboard (Databricks)` dashboard
- BigQuery
- `fct__yearly__financials (prod)` data source
- `Yearly Financials (BigQuery)` workbook
- `Yearly Financials Dashboard (BigQuery)` dashboard

- Power BI table, report, and dashboard are connected to the `fct__monthly__financials` model in Snowflake, Databricks, and BigQuery.
- Snowflake
- `FCT__MONTHLY__FINANCIALS` table
- `Monthly Financials Snowflake` report
- `Monthly Financials Snowflake` dashboard
- Databricks
- `fct__monthly__financials` table
- `fact-monthly-financials-databricks` report
- `Fact Monthly Financials Databricks` dashboard
- BigQuery
- `fct__monthly__financials` table
- `Monthly Financials BigQuery` report
- `Monthly Financials BigQuery` dashboard

### Datafold Demo Org structure
The corresponding Datafold Demo Org contains the following integrations:
- Common
- `datafold/demo` repository integration
- `Postgres` data connection for Cross-DB data diff monitors
- `Looker Public Demo` BI app integration
- `Power BI` BI app integration
- `Tableau Public Demo` BI app integration
- Snowflake specific
- `Snowflake` data connection
- `Coalesce-Demo` CI integration for the `Snowflake` data connection and the `master` branch
- Databricks specific
- `Databricks-Demo` data connection
- `Coalesce-Demo-Databricks` CI integration for the `Databricks-Demo` data connection and the `master-databricks` branch
- BigQuery specific
- `BigQuery - Demo` data connection
- `Coalesce-Demo-BigQuery` CI integration for the `BigQuery - Demo` data connection and the `master-bigquery` branch
- Dremio specific
- `Dremio-Demo` data connection
- `Coalesce-Demo-Dremio` CI integration for the `Dremio-Demo` data connection and the `master-dremio` branch

## Running this project in a custom environment
To get up and running with this project:
1. Install dbt using [these instructions](https://docs.getdbt.com/docs/installation).

2. Fork this repository.

3. Set up a profile called `demo` to connect to a data warehouse by following [these instructions](https://docs.getdbt.com/docs/configure-your-profile). You'll need `dev` and `prod` targets in your profile.

4. Ensure your profile is setup correctly from the command line:
```bash
$ dbt debug
```

5. Create your `prod` models:
```bash
$ dbt build --profile demo --target prod
```

With `prod` models created, you're clear to develop and diff changes between your `dev` and `prod` targets.

### Using Datafold with this project

Follow the [quickstart guide](https://docs.datafold.com/quickstart_guide) to integrate this project with Datafold.

## Generated data

### Generated files
- `datagen/feature_used_broken.csv` - copied to `seeds/feature__used.csv`
- `datagen/feature_used.csv`
- `datagen/org_created_broken.csv` - copied to `seeds/org__created.csv.csv`
- `datagen/org_created.csv`
- `datagen/signed_in_broken.csv` - copied to `seeds/signed__in.csv.csv`
- `datagen/signed_in.csv`
- `datagen/subscription_created_broken.csv` - copied to `seeds/subscription__created.csv.csv`
- `datagen/subscription_created.csv` - pushed to Postgres (`analytics.data_source.subscription_created` table)
- `datagen/user_created_broken.csv` - copied to `seeds/user__created.csv.csv`
- `datagen/user_created.csv`
- `datagen/persons_pool.csv` - pool of persons used for user/org generation

### Data generation scripts

- `datagen/data_generate.py` - main data generation script
- `datagen/data_to_postgres.sh` - pushes generated data to Postgres
- `datagen/persons_pool_replenish.py` - replenishes the pool of persons using ChatGPT
- `datagen/data_delete.sh` - deletes data for further re-generation
- `datagen/dremio__upload_seeds.py` - uploads seed files to Dremio (due to limitations in the starndard dbt-dremio connector)

### Data anomaly types
- zero on negative prices in the `subscription__created` seed
- corrupted emails in the `user__created` seed (user$somecompany.com)
- irregular spikes in the workday seasonal daily number of sign-ins in the `signed__in` seed
- `null` spikes in the `feature__used` seed
- schema change: a 'wandering' column appears ~weekly in the `signed__in` seed

## Other

### Known issues
- PR job fails when the 2nd commit is pushed to a PR branch targeting Databricks. Most likely related to: https://github.com/databricks/dbt-databricks/issues/691.