{"id":13704142,"url":"https://github.com/spbail/dag-stack","last_synced_at":"2025-05-05T09:33:15.558Z","repository":{"id":52037576,"uuid":"352799190","full_name":"spbail/dag-stack","owner":"spbail","description":"Data pipeline with dbt, Airflow, Great Expectations","archived":false,"fork":false,"pushed_at":"2021-07-14T16:58:41.000Z","size":6223,"stargazers_count":155,"open_issues_count":2,"forks_count":32,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-08-03T21:04:41.654Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spbail.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-29T22:17:03.000Z","updated_at":"2024-07-12T02:32:06.000Z","dependencies_parsed_at":"2022-09-06T21:40:09.310Z","dependency_job_id":null,"html_url":"https://github.com/spbail/dag-stack","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spbail%2Fdag-stack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spbail%2Fdag-stack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spbail%2Fdag-stack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spbail%2Fdag-stack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spbail","download_url":"https://codeload.github.com/spbail/dag-stack/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224439769,"owners_count":17311522,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T21:01:04.699Z","updated_at":"2024-11-13T11:30:54.944Z","avatar_url":"https://github.com/spbail.png","language":"HTML","readme":"# dag-stack\n\nDemo data pipeline with dbt, Airflow, Great Expectations.\n\nSee another possible architecture for this at https://github.com/astronomer/airflow-dbt-demo\n\n---\n\n### ☕ Buy me a coffee ☕\n\nIf you enjoy this workshop and want to say thanks, you can buy me a coffee here: https://www.buymeacoffee.com/sambail\nThank you 😄\n\n---\n\n\n## How to run\n\n This repo contains a runnable demo using [Astronomer](https://www.astronomer.io/) (containerized Airflow), which is a convenient option to run everything in a Docker container.\n* Install the Astronomer CLI (containerized Airflow), [instructions here](https://www.astronomer.io/docs/cloud/stable/develop/cli-quickstart)\n  * *Note:* If you only want to run Airflow locally for development, you do not need to sign up for an Astronomer Cloud account. Simply follow the instructions to install the Astronomer CLI.\n* Run `astro dev start` to start up the Airflow Docker containers\n  * I had to follow the [Docker config instructions here](https://forum.astronomer.io/t/buildkit-not-supported-by-daemon-error-command-docker-build-t-airflow-astro-bcb837-airflow-latest-failed-failed-to-execute-cmd-exit-status-1/857) to handle a \"buildkit not supported\" error\n  * I also had to reduce the number of `AIRFLOW__WEBSERVER__WORKERS` in the Dockerfile as well as allocate more resources to Docker in order for the webserver to run on my very old very slow laptop :) (2013 MacBook Air ftw)\n  * Thanks to [this post](https://dev.to/corissa/2-critical-things-about-dbt-0-19-0-installation-20j) for the `agate` version pin to work with dbt\n* This will start up the Airflow scheduler, webserver, and a Postgres database\n* Once the webserver is up (takes about 30 seconds), you can access the Airflow web UI at `localhost:8080`\n* You can run `astro dev stop` to stop the container again\n\nYou can also run the DAG in this repo with a **standard Airflow installation** if you want. You'll have to install the relevant dependencies (Airflow, dbt, Great Expectations, the respective operators, etc) and probably handle some more configurations to get it to work.\n\n## Development\n\nIn order to develop the dbt DAG and Great Expectations locally instead of in the containers (for faster dev loops), I created a new virtual environment with and installed relevant packages wit `pip install -r requirements.txt`\n\n**dbt setup**\n\n- Ran `dbt init dbt` to create the dbt directory in this repo\n- I copied `~/.dbt/profiles.yml` into the root of this project and added the Astronomer postgres creds to have a database available -- **you wouldn't use this database in production or keep the file in the repo, this is just a shortcut for this demo!!**\n- The `profiles.yml` target setup allows me to run the dbt pipeline both locally and within the container:\n  - Container:\n    - connect to shell within the scheduler container\n    - run `cd dbt`\n    - run `dbt run --profiles-dir /usr/local/airflow --target astro_dev`\n  - Local:\n    - run `cd dbt`\n    - run `dbt run --profiles-dir /Users/sam/code/dag-stack --target local_dev`\n\n**Great Expectations setup**\n\n- Ran `great_expectations init` to create the great_expectations directory in this repo\n- Created Datasources for the `data` directory and the Astronomer postgres database using the `great_expectations datasource new` command\n  - Note that I have two Datasources for the different host names, similar to the two dbt targets\n  - I copied the credentials from `uncommitted/config_variables.yml` into the `datasources` section in `great_expectations.yml` for the purpose of this demo, since the `uncommitted` directory is git-ignored\n- Created new Expectation Suites using `great_expectations suite scaffold` against the `data_dir` and `postgres_local` Datasources and manually tweaked the scaffold output a little using `suite edit`\n\n**Airflow DAG setup**\n\n- I'm using the custom dbt and Great Expectations Airflow operators, but this could also be done with Python and bash operators\n- Note that the source data and loaded data validation both use the same Expectation Suite, which is a neat feature of Great Expectations -- a test suite can be run against any data asset to assert the same properties\n\n## Serving dbt and Great Expectations docs\n\n- The DAG contains example tasks that copy the docs for each framework into the `include` folder in the container which is mapped to the host machine, so you can inspect them manually\n- In production (and when deploying the container to Astronomer Cloud), both docs could (should) be copied to and hosted on an external service, e.g. on Netlify or in an S3 bucket\n\n## Additional resources\n\nThis repo is based on several existing resources:\n- [Great Expectations Airflow + dbt tutorial](https://github.com/superconductive/ge_tutorials/tree/main/ge_dbt_airflow_tutorial) (which I had originally built)\n- [The example DAGs in the Great Expectations Airflow Provider](https://github.com/great-expectations/airflow-provider-great-expectations/blob/main/great_expectations_provider/example_dags/example_great_expectations_dag.py) (which I also originally built haha)\n- [Building a Scalable Analytics Architecture with Airflow and dbt](https://www.astronomer.io/blog/airflow-dbt-1)\n","funding_links":["https://www.buymeacoffee.com/sambail"],"categories":["Sample Projects","HTML"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspbail%2Fdag-stack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspbail%2Fdag-stack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspbail%2Fdag-stack/lists"}