https://github.com/datafold/dagster-data-diff-demo
Datafold + Dagster demo to validate raw data replication from source to target tables across databases
https://github.com/datafold/dagster-data-diff-demo
dagster data-diffing datafold diff replication
Last synced: 3 months ago
JSON representation
Datafold + Dagster demo to validate raw data replication from source to target tables across databases
- Host: GitHub
- URL: https://github.com/datafold/dagster-data-diff-demo
- Owner: datafold
- License: mit
- Created: 2023-10-23T16:29:25.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-10-31T17:47:33.000Z (almost 2 years ago)
- Last Synced: 2025-03-13T03:12:02.177Z (7 months ago)
- Topics: dagster, data-diffing, datafold, diff, replication
- Language: Python
- Homepage: TODO: add link to blog
- Size: 601 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
![]()
# Datafold + Dagster: Better Together
This is a demo project for the Dagster + Datafold integration using [`data-diff`](https://github.com/datafold/data-diff#data-diff-compare-datasets-fast-within-or-across-sql-databases). The goal is to give you clear examples of how to use Dagster's [asset checks](https://docs.dagster.io/concepts/assets/asset-checks) to solve data replication problems in your data pipelines by validating the data diff between the source and target tables.
Learn more about Datafold: [here](https://www.datafold.com/data-replication)
Learn more about Dagster: [here](https://dagster.io/)
TODO: Add public loom video with gif thumbnail
## Demo Examples
[`simple_diff_demo.py`](data-diff-demo/data_diff_demo/assets/simple_diff_demo.py): Generates data in a duckdb source table, exports it to parquet, and creates a separate duckdb target table with intentional differences based on the parquet file. It runs a data diff between the source and target tables located in separate duckdb databases, and outputs the data diff as asset check metadata for easy review.
[`healing_diff_demo.py`](data-diff-demo/data_diff_demo/assets/healing_diff_demo.py): Generates data in a duckdb source table, exports it to parquet, and creates a separate duckdb target table with intentional differences based on the parquet file. It runs a data diff between the source and target tables located in separate duckdb databases, overwrites the target table diffs with the original source rows, and outputs the data diff as [asset observation](https://docs.dagster.io/concepts/assets/asset-observations) metadata for easy review.
[`postgres_to_snowflake_demo.py`](data-diff-demo/data_diff_demo/assets/postgres_to_snowflake_demo.py): Generates data in a Postgres source table, exports it to a pandas dataframe, and creates a Snowflake target table with intentional differences based on the dataframe. It runs a data diff between the source and target tables located in separate databases, and outputs the data diff as asset check metadata for easy review. Note: this will only work if you configure the Postgres and Snowflake environment variables below. If you don't run this example, you can still see the functioning examples above.
## Quick Start
```bash
# setup python dependencies
cd data-diff-demo
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -e ".[dev]"
source venv/bin/activate
```> Optional: This applies only to the assets contained in `postgres_to_snowflake_demo.py`. If you want a more realistic example, we recommend you define these configurations.
```
# define environment variables in a .env file in this directory: data-diff-demo/data_diff_demo/.env
# placeholder examples below for postgres and snowflakeSOURCE_DATABASE_HOST="ep-shrill-meadow-043325.us-west-2.aws.neon.tech"
SOURCE_DATABASE_PORT="5432"
SOURCE_DATABASE_NAME="neondb"
SOURCE_DATABASE_USER="sungwonchung3"
SOURCE_DATABASE_PASSWORD="asdfasdfasdf"
DESTINATION_SNOWFLAKE_ACCOUNT="ASDFASDFASDF"
DESTINATION_SNOWFLAKE_USER="sung"
DESTINATION_SNOWFLAKE_PASSWORD="ASDFASDFASDF"
DESTINATION_SNOWFLAKE_WAREHOUSE="INTEGRATION"
DESTINATION_SNOWFLAKE_DATABASE="DEMO"
DESTINATION_SNOWFLAKE_SCHEMA="DBT_SUNG"
DESTINATION_SNOWFLAKE_ROLE="DEMO_ROLE"
``````bash
# start dagster development server
dagster dev
```Open http://localhost:3000 in your browser
Click `Materialize all` in the top right corner of the Dagster UI to materialize all assets

You should see the following assets materialized with 2 asset checks intentionally failed

When you click to view the asset check metadata, you should see the following output

Now apply this template project to your own Dagster project and start using `data-diff` to validate real data pipelines!
## Interpreting the Data Diff Output
> How it works: `data-diff` uses built-in hash functions within the source to target databases to compare data and then outputs the differences in a human-readable format if hash mismatches are found. This is a fast and efficient way to compare data across databases. Performance is similar to a `SELECT COUNT(*)` query. [Learn More](https://docs.datafold.com/data_diff/cross-database_diffing/#high-level-algorithm)
`-`: original rows in source
`+`: modified/additional rows in target
In this example, there are 2 source rows that do not exist in the target table.
`- ('-1', '2023-10-23')`
`- ('-2', '2023-10-22')`
Example of source row modified in target table:
`- ('1', '2023-10-25')`
`+ ('1', '2023-10-23')`
