{"id":13704108,"url":"https://github.com/jonathanneo/data-aware-orchestration","last_synced_at":"2025-05-05T09:33:07.265Z","repository":{"id":65533175,"uuid":"590481291","full_name":"jonathanneo/data-aware-orchestration","owner":"jonathanneo","description":"Data-aware orchestration with dagster, dbt, and airbyte","archived":false,"fork":false,"pushed_at":"2023-01-20T02:39:45.000Z","size":1425,"stargazers_count":29,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-08-03T21:04:41.430Z","etag":null,"topics":["data-orchestration"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jonathanneo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2023-01-18T14:16:01.000Z","updated_at":"2024-04-29T12:34:51.000Z","dependencies_parsed_at":"2023-02-11T23:15:59.991Z","dependency_job_id":null,"html_url":"https://github.com/jonathanneo/data-aware-orchestration","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonathanneo%2Fdata-aware-orchestration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonathanneo%2Fdata-aware-orchestration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonathanneo%2Fdata-aware-orchestration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jonathanneo%2Fdata-aware-orchestration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jonathanneo","download_url":"https://codeload.github.com/jonathanneo/data-aware-orchestration/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224439739,"owners_count":17311514,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-orchestration"],"created_at":"2024-08-02T21:01:04.348Z","updated_at":"2024-11-13T11:30:50.575Z","avatar_url":"https://github.com/jonathanneo.png","language":"Python","funding_links":[],"categories":["Sample Projects"],"sub_categories":[],"readme":"# Data-aware orchestration demo \n\nThis project demonstrates dagster's data-aware orchestration capability. \n\n### Concepts demonstrated\n\n- **dbt cross-project lineage**: dagster's ability to create a global dependency graph between different dbt projects. This is currently [not available in dbt](https://github.com/dbt-labs/dbt-core/discussions/5244). \n- **Object-level dependencies between different assets**: dagster's ability to create object-level dependencies between different assets like an airbyte table materialization, and a dbt model materialization. \n- **Freshness policy triggers**: Most data orchestration tools use cron schedules to trigger an entire DAG. Dagster reverses this approach and allows developers to define freshness policies on nodes so that upstream nodes can be triggered to deliver data to the target node on time. \n\n### Data assets \n\nThis project has the following data assets to orchestrate: \n1. An [airbyte](https://airbyte.com/) connection \n1. Two [dbt](https://www.getdbt.com/) projects \n\n![global-asset-lineage](docs/images/global-asset-lineage.png)\n\nThis project forks code from a demo prepared by [airbytehq's open-data-stack repo](https://github.com/airbytehq/open-data-stack/tree/main/dagster), and adds additional code to demonstrate newer concepts. \n\n# Getting started \n\n## Set up virtual environment \n\nA [Pipfile](./Pipfile) has been provided for use with [pipenv](https://pipenv.pypa.io/en/latest/) to define the python version to use for this virtual environment. \n\nAssuming you already have pipenv installed, to launch the virtual environment for this project, run the following commands: \n\n```bash\ncd my-dbt-dagster\npipenv shell \n```\n\nIf you wish to instead use your local python installation, just make sure that it is at least python 3.8 and above. \n\n## Install python dependencies \n\nTo install the python dependencies, run: \n\n```bash\ncd stargazer\npip install -e \".[dev]\"\n```\n\n## Set up local Postgres\n\nWe'll use a local postgres instance as the destination for our data. You can imagine the \"destination\" as a data warehouse (something like Snowflake).\n\nTo get a postgres instance with the required source and destination databases running on your machine, you can run:\n\n```bash\ndocker pull postgres\ndocker run --name local-postgres -p 5433:5432 -e POSTGRES_PASSWORD=postgres -d postgres\n```\n\nNote: I am mapping local port 5433 to the container's port 5432 as my local port 5432 is already in use. \n\n## Set up Airbyte\n\nNow, you'll want to get Airbyte running locally. The full instructions can be found [here](https://docs.airbyte.com/deploying-airbyte/local-deployment). \n\nThe steps are pretty simple. Run the following in a new terminal: \n\n```bash\ngit clone https://github.com/airbytehq/airbyte.git\ncd airbyte\ndocker-compose up\n```\n\nThis should take a couple of minutes to pull the images and run them. \n\n## Set up airbyte connection\n\nNow that airbyte is running locally, let's create the source, destination, and connection for a data integration pipeline on airbyte. \n\nFirst we set the environment variables we need: \n\n```bash\nexport AIRBYTE_PASSWORD=password\nexport AIRBYTE_PERSONAL_GITHUB_TOKEN=\u003cyour-token-goes-here\u003e\n```\nNote: \n- The default password for airbyte is `password`. \n- We'll need to [create](https://github.com/settings/tokens) a token `AIRBYTE_PERSONAL_GITHUB_TOKEN` for fetching the stargazers from the public repositories.\n\nAfter setting the environment variables, we can check if we have everything we need to let dagster create the airbyte source, destination, and connection by running: \n\n```bash\ncd stargazer\ndagster-me check --module assets_modern_data_stack.my_asset:airbyte_reconciler\n```\n\nThis will print out the assets that dagster will create in airbyte. For example: \n\n```\n+ fetch_stargazer:\n  + source: gh_awesome_de_list\n  + normalize data: True\n  + destination: postgres\n  + destination namespace: SAME_AS_SOURCE\n  + streams:\n    + stargazers:\n      + destinationSyncMode: append_dedup\n      + syncMode: incremental\n```\n\nIf you are happy with those assets being created in airbyte, then run the following to apply it: \n\n```bash\ndagster-me apply --module assets_modern_data_stack.my_asset:airbyte_reconciler\n```\n\n## Set up dbt\n\nWe have 2 dbt projects in the [stargazer](./stargazer/) folder: \n\n- [dbt_project_1](./stargazer/dbt_project_1/)\n- [dbt_project_2](./stargazer/dbt_project_2/)\n\nInstall the dbt dependencies required by both projects by running:\n\n```bash\ncd stargazer/dbt_project_1\ndbt deps \n```\n\n```bash\ncd stargazer/dbt_project_2\ndbt deps \n```\n\n## Start dagster \n\nWe're now ready to get dagster started. Dagster has two services that we need to run: \n- [dagit](https://docs.dagster.io/concepts/dagit/dagit): The web-based interface for viewing and interacting with Dagster objects.\n- [dagster daemon](https://docs.dagster.io/deployment/dagster-daemon): The service that manages schdeules, sensors, and run queuing. \n\nFor both services to communicate and have shared resources with one another, we need to create a shared directory:\n\n```bash \nmkdir ~\"/dagster_home\"\n```\n\nWe named our shared directory as `dagster_home` for simplicity. \n\nTo run the dagster daemon service, create a new terminal and run: \n\n```bash\nexport DAGSTER_HOME=~\"/dagster_home\"\nexport AIRBYTE_PASSWORD=password\nexport POSTGRES_PASSWORD=postgres\ndagster-daemon run -m assets_modern_data_stack.my_asset\n```\n\nTo run the dagit service, create a new terminal and run: \n\n```bash\nexport DAGSTER_HOME=~\"/dagster_home\"\nexport AIRBYTE_PASSWORD=password\nexport POSTGRES_PASSWORD=postgres\ndagit -m assets_modern_data_stack.my_asset\n```\n\nLaunch the dagit UI by going to [http://localhost:3000/](http://localhost:3000/). \n\nYou'll see the assets of airbyte, dbt that are created automatically in this demo.\n\n![deployment](/docs/images/deployment.png)\n\nActivate the schedule: \n\n![schedule](/docs/images/schedule.png)\n\nActivate the sensor: \n\n![sensor](/docs/images/sensor.png)\n\n# Interact with dagster \n\nNow you can sit back and watch the [global asset lineage](http://localhost:3000/asset-groups/) trigger based on the schedule and/or sensor trigger. \n\nYou'll notice the following behaviours: \n\n1. The airbyte assets will materialize every 30 minutes based on a schedule. \n\n![airbyte](/docs/images/airbyte-assets.png)\n\n2. The two dbt projects [dbt_project_1](./stargazer/dbt_project_1/) and [dbt_project_2](./stargazer/dbt_project_2/), are now seen as part of the same global asset lineage in dagster without any separation between dbt projects. \n\n![dbt-projects](/docs/images/dbt-projects.png)\n\n3. `mart_gh_cumulative` will materialize every 5 minutes because it's dbt model [mart_gh_cumulative.sql](stargazer/dbt_project_2/models/mart/mart_gh_cumulative.sql) has a freshness policy of `dagster_freshness_policy={\"maximum_lag_minutes\": 5}`. This in turn will also trigger the airbyte assets to be materialized first. \n\n![5-minute-freshness](/docs/images/5-minute-freshness.png)\n\n4. `mart_gh_join` and `mart_gh_stargazer`: by 09:00 AM UTC, these assets should incorporate all data up to 9 hours before that time. This is because a dbt project-level configuration has been set for [project 1](stargazer/dbt_project_1/dbt_project.yml) and [project 2](stargazer/dbt_project_2/dbt_project.yml) with a freshness policy of `maximum_lag_minutes: 540` and  `cron_schedule: \"0 9 * * *\"`. \n\n![9am-freshness](/docs/images/9am-freshness.png)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonathanneo%2Fdata-aware-orchestration","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjonathanneo%2Fdata-aware-orchestration","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjonathanneo%2Fdata-aware-orchestration/lists"}