{"id":20936908,"url":"https://github.com/datafold/dagster-data-diff-demo","last_synced_at":"2025-07-09T19:38:15.462Z","repository":{"id":203137636,"uuid":"708914222","full_name":"datafold/dagster-data-diff-demo","owner":"datafold","description":"Datafold + Dagster demo to validate raw data replication from source to target tables across databases","archived":false,"fork":false,"pushed_at":"2023-10-31T17:47:33.000Z","size":615,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-13T03:12:02.177Z","etag":null,"topics":["dagster","data-diffing","datafold","diff","replication"],"latest_commit_sha":null,"homepage":"TODO: add link to blog","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datafold.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-10-23T16:29:25.000Z","updated_at":"2023-10-27T22:30:06.000Z","dependencies_parsed_at":"2023-10-23T18:45:21.483Z","dependency_job_id":"7cc80644-03ef-4dd7-acf5-764c02180b38","html_url":"https://github.com/datafold/dagster-data-diff-demo","commit_stats":null,"previous_names":["datafold/dagster-data-diff-demo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/datafold/dagster-data-diff-demo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafold%2Fdagster-data-diff-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafold%2Fdagster-data-diff-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafold%2Fdagster-data-diff-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafold%2Fdagster-data-diff-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datafold","download_url":"https://codeload.github.com/datafold/dagster-data-diff-demo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datafold%2Fdagster-data-diff-demo/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264502493,"owners_count":23618619,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dagster","data-diffing","datafold","diff","replication"],"created_at":"2024-11-18T22:29:47.915Z","updated_at":"2025-07-09T19:38:15.401Z","avatar_url":"https://github.com/datafold.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/dagster_and_datafold.png\"\u003e\n\u003c/p\u003e\n\n# Datafold + Dagster: Better Together\n\nThis is a demo project for the Dagster + Datafold integration using [`data-diff`](https://github.com/datafold/data-diff#data-diff-compare-datasets-fast-within-or-across-sql-databases). The goal is to give you clear examples of how to use Dagster's [asset checks](https://docs.dagster.io/concepts/assets/asset-checks) to solve data replication problems in your data pipelines by validating the data diff between the source and target tables.\n\nLearn more about Datafold: [here](https://www.datafold.com/data-replication)\n\nLearn more about Dagster: [here](https://dagster.io/)\n\nTODO: Add public loom video with gif thumbnail\n\n## Demo Examples\n\n[`simple_diff_demo.py`](data-diff-demo/data_diff_demo/assets/simple_diff_demo.py): Generates data in a duckdb source table, exports it to parquet, and creates a separate duckdb target table with intentional differences based on the parquet file. It runs a data diff between the source and target tables located in separate duckdb databases, and outputs the data diff as asset check metadata for easy review.\n\n[`healing_diff_demo.py`](data-diff-demo/data_diff_demo/assets/healing_diff_demo.py): Generates data in a duckdb source table, exports it to parquet, and creates a separate duckdb target table with intentional differences based on the parquet file. It runs a data diff between the source and target tables located in separate duckdb databases, overwrites the target table diffs with the original source rows, and outputs the data diff as [asset observation](https://docs.dagster.io/concepts/assets/asset-observations) metadata for easy review.\n\n[`postgres_to_snowflake_demo.py`](data-diff-demo/data_diff_demo/assets/postgres_to_snowflake_demo.py): Generates data in a Postgres source table, exports it to a pandas dataframe, and creates a Snowflake target table with intentional differences based on the dataframe. It runs a data diff between the source and target tables located in separate databases, and outputs the data diff as asset check metadata for easy review. Note: this will only work if you configure the Postgres and Snowflake environment variables below. If you don't run this example, you can still see the functioning examples above.\n\n\n## Quick Start\n\n```bash\n# setup python dependencies\ncd data-diff-demo\npython -m venv venv\nsource venv/bin/activate\npip install --upgrade pip\npip install -e \".[dev]\"\nsource venv/bin/activate\n```\n\n\u003e Optional: This applies only to the assets contained in `postgres_to_snowflake_demo.py`. If you want a more realistic example, we recommend you define these configurations.\n\n```\n# define environment variables in a .env file in this directory: data-diff-demo/data_diff_demo/.env\n# placeholder examples below for postgres and snowflake\n\nSOURCE_DATABASE_HOST=\"ep-shrill-meadow-043325.us-west-2.aws.neon.tech\"\nSOURCE_DATABASE_PORT=\"5432\"\nSOURCE_DATABASE_NAME=\"neondb\"\nSOURCE_DATABASE_USER=\"sungwonchung3\"\nSOURCE_DATABASE_PASSWORD=\"asdfasdfasdf\"\nDESTINATION_SNOWFLAKE_ACCOUNT=\"ASDFASDFASDF\"\nDESTINATION_SNOWFLAKE_USER=\"sung\"\nDESTINATION_SNOWFLAKE_PASSWORD=\"ASDFASDFASDF\"\nDESTINATION_SNOWFLAKE_WAREHOUSE=\"INTEGRATION\"\nDESTINATION_SNOWFLAKE_DATABASE=\"DEMO\"\nDESTINATION_SNOWFLAKE_SCHEMA=\"DBT_SUNG\"\nDESTINATION_SNOWFLAKE_ROLE=\"DEMO_ROLE\"\n```\n\n```bash\n# start dagster development server\ndagster dev\n```\n\nOpen http://localhost:3000 in your browser\n\nClick `Materialize all` in the top right corner of the Dagster UI to materialize all assets\n\n![](images/start.png)\n\nYou should see the following assets materialized with 2 asset checks intentionally failed\n\n![](images/end.png)\n\nWhen you click to view the asset check metadata, you should see the following output\n\n![](images/example_asset_check.png)\n\nNow apply this template project to your own Dagster project and start using `data-diff` to validate real data pipelines!\n\n\n## Interpreting the Data Diff Output\n\n\u003e How it works: `data-diff` uses built-in hash functions within the source to target databases to compare data and then outputs the differences in a human-readable format if hash mismatches are found. This is a fast and efficient way to compare data across databases. Performance is similar to a `SELECT COUNT(*)` query. [Learn More](https://docs.datafold.com/data_diff/cross-database_diffing/#high-level-algorithm)\n\n`-`: original rows in source\n\n`+`: modified/additional rows in target\n\nIn this example, there are 2 source rows that do not exist in the target table.\n\n`-   ('-1', '2023-10-23')`\n\n`-   ('-2', '2023-10-22')`\n\nExample of source row modified in target table:\n\n`-   ('1', '2023-10-25')`\n\n`+   ('1', '2023-10-23')`\n\n![](images/data_diff_output.png)","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatafold%2Fdagster-data-diff-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatafold%2Fdagster-data-diff-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatafold%2Fdagster-data-diff-demo/lists"}