{"id":23934092,"url":"https://github.com/GokuMohandas/data-engineering","last_synced_at":"2025-09-11T16:33:45.358Z","repository":{"id":59185506,"uuid":"535543420","full_name":"GokuMohandas/data-engineering","owner":"GokuMohandas","description":"Construct a modern data stack and orchestration the workflows to create high quality data for analytics and ML applications.","archived":false,"fork":false,"pushed_at":"2022-09-12T12:30:16.000Z","size":43,"stargazers_count":216,"open_issues_count":2,"forks_count":37,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-06-10T06:49:56.787Z","etag":null,"topics":["airflow","data-engineering","data-warehouse","dbt","etl","machine-learning","mlops","orchestration"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GokuMohandas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-09-12T06:47:18.000Z","updated_at":"2025-06-07T00:10:31.000Z","dependencies_parsed_at":"2022-09-13T02:30:49.148Z","dependency_job_id":null,"html_url":"https://github.com/GokuMohandas/data-engineering","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/GokuMohandas/data-engineering","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Fdata-engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Fdata-engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Fdata-engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Fdata-engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GokuMohandas","download_url":"https://codeload.github.com/GokuMohandas/data-engineering/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GokuMohandas%2Fdata-engineering/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274670688,"owners_count":25328288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-11T02:00:13.660Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","data-engineering","data-warehouse","dbt","etl","machine-learning","mlops","orchestration"],"created_at":"2025-01-06T00:30:10.988Z","updated_at":"2025-09-11T16:33:44.978Z","avatar_url":"https://github.com/GokuMohandas.png","language":"Jupyter Notebook","readme":"# Data Engineering for Machine Learning\n\nLearn data engineering fundamentals by constructing a modern data stack for analytics and machine learning applications. We'll also learn how to orchestrate our data workflows and programmatically execute tasks to prepare our high quality data for downstream consumers (analytics, ML, etc.)\n\n\u003cdiv align=\"left\"\u003e\n    \u003ca target=\"_blank\" href=\"https://madewithml.com\"\u003e\u003cimg src=\"https://img.shields.io/badge/Subscribe-40K-brightgreen\"\u003e\u003c/a\u003e\u0026nbsp;\n    \u003ca target=\"_blank\" href=\"https://github.com/GokuMohandas/Made-With-ML\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/GokuMohandas/Made-With-ML.svg?style=social\u0026label=Star\"\u003e\u003c/a\u003e\u0026nbsp;\n    \u003ca target=\"_blank\" href=\"https://www.linkedin.com/in/goku\"\u003e\u003cimg src=\"https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn\u0026logo=linkedin\u0026style=social\"\u003e\u003c/a\u003e\u0026nbsp;\n    \u003ca target=\"_blank\" href=\"https://twitter.com/GokuMohandas\"\u003e\u003cimg src=\"https://img.shields.io/twitter/follow/GokuMohandas.svg?label=Follow\u0026style=social\"\u003e\u003c/a\u003e\n    \u003cbr\u003e\n\u003c/div\u003e\n\n\u003cbr\u003e\n\n👉 \u0026nbsp;This repository contains the code that complements the [data stack](https://madewithml.com/courses/mlops/data-stack/) and [orchestration](https://madewithml.com/courses/mlops/orchestration/) lessons which is a part of the [MLOps course](https://github.com/GokuMohandas/mlops-course). If you haven't already, be sure to check out the lessons because all the concepts are covered extensively and tied to data engineering best practices for building the data stack for ML systems.\n\n\u003cdiv align=\"left\"\u003e\n\u003ca target=\"_blank\" href=\"https://madewithml.com/courses/mlops/data-stack/\"\u003e\u003cimg src=\"https://img.shields.io/badge/📖 Read-lesson-9cf\"\u003e\u003c/a\u003e\u0026nbsp;\n\u003ca href=\"https://github.com/GokuMohandas/data-engineering\" role=\"button\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=\u0026amp;message=View%20On%20GitHub\u0026amp;color=586069\u0026amp;logo=github\u0026amp;labelColor=2f363d\"\u003e\u003c/a\u003e\u0026nbsp;\n\u003c/div\u003e\n\n\u003cbr\u003e\n\n## Data stack\n- [Set up](#setup)\n- [Extract via Airbyte](#extract-via-airbyte)\n- [Load into BigQuery](#load-into-bigquery)\n- [Transform via dbt-cloud](#transform-via-dbt-cloud)\n- [Applications](#applications)\n\n## Orchestration\n- [Set up Airflow](#set-up-airflow)\n- [Extract and load](#extract-and-load)\n- [Validate via GE](#validate-via-ge)\n- [Transform via dbt-core](#transform-via-dbt-core)\n\n### Setup\n\nAt a high level, we're going to:\n\n1. [**E**xtract and **L**oad](#extract_and_load) data from [sources](#sources) to [destinations](#destinations).\n2. [**T**ransform](#transform) for downstream [applications](#applications).\n\nThis process is more commonly known as ELT, but there are variants such as ETL and reverse ETL, etc. They are all essentially the same underlying workflows but have slight differences in the order of data flow and where data is processed and stored.\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg width=\"800\" src=\"https://madewithml.com/static/images/mlops/data_stack/data.png\" alt=\"data stack\"\u003e\n\u003c/div\u003e\n\n### Extract via airbyte\n\nThe first step in our data pipeline is to extract data from a source and load it into the appropriate destination. While we could construct custom scripts to do this manually or on a schedule, an ecosystem of data ingestion tools have already standardized the entire process. They all come equipped with connectors that allow for extraction, normalization, cleaning and loading between sources and destinations. And these pipelines can be scaled, monitored, etc. all with very little to no code.\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg width=\"600\" src=\"https://madewithml.com/static/images/mlops/data_stack/pipelines.png\" alt=\"ingestion pipelines\"\u003e\n\u003c/div\u003e\n\nWe're going to use the open-source tool [Airbyte](https://airbyte.com/) to create connections between our data sources and destinations. Let's set up Airbyte and define our data sources. As we progress in this lesson, we'll set up our destinations and create connections to extract and load data.\n\n1. Ensure that we still have Docker installed from our [Docker lesson](https://madewithml.com/courses/mlops/docker) but if not, download it [here](https://www.docker.com/products/docker-desktop/). For Windows users, be sure to have these [configurations](https://docs.airbyte.com/deploying-airbyte/local-deployment/#deploy-on-windows) enabled.\n2. In a parent directory, outside our project directory for the MLOps course, execute the following commands to load the Airbyte repository locally and launch the service.\n```bash\ngit clone https://github.com/airbytehq/airbyte.git\ncd airbyte\ndocker-compose up\n```\n3. After a few minutes, visit [http://localhost:8000/](http://localhost:8000/) to view the launched Airbyte service.\n\n#### Sources\n\nWe'll start our ELT process by defining the data source in Airbyte:\n\n1. On our [Airbyte UI](http://localhost:8000/), click on `Sources` on the left menu. Then click the `+ New source` button on the top right corner.\n2. Click on the `Source type` dropdown and choose `File`. This will open a view to define our file data source.\n```yaml\nName: Projects\nURL: https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.csv\nFile Format: csv\nStorage Provider: HTTPS: Public Web\nDataset Name: projects\n```\n3. Click the `Set up source` button and our data source will be tested and saved.\n4. Repeat steps 1-3 for our tags data source as well:\n```yaml\nName: Tags\nURL: https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/tags.csv\nFile Format: csv\nStorage Provider: HTTPS: Public Web\nDataset Name: tags\n```\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg width=\"1000\" src=\"https://madewithml.com/static/images/mlops/data_stack/sources.png\" alt=\"data sources\"\u003e\n\u003c/div\u003e\n\n### Load into BigQuery\n\nOnce we know the source we want to extract data from, we need to decide the destination to load it. The choice depends on what our downstream applications want to be able to do with the data. And it's also common to store data in one location (ex. data lake) and move it somewhere else (ex. data warehouse) for specific processing.\n\n#### Set up Google BigQuery\n\nOur destination will be a [data warehouse](#data-warehouse) since we'll want to use the data for downstream analytical and machine learning applications. We're going to use [Google BigQuery](https://cloud.google.com/bigquery) which is free under Google Cloud's [free tier](https://cloud.google.com/bigquery/pricing#free-tier) for up to 10 GB storage and 1TB of queries (which is significantly more than we'll ever need for our purpose).\n\n1. Log into your [Google account](https://accounts.google.com/signin){:target=\"_blank} and then head over to [Google CLoud](https://cloud.google.com/). If you haven't already used Google Cloud's free trial, you'll have to sign up. It's free and you won't be autocharged unless you manually upgrade your account. Once the trial ends, we'll still have the free tier which is more than plenty for us.\n2. Go to the [Google BigQuery page](https://console.cloud.google.com/bigquery){:target=\"_blank} and click on the `Go to console` button.\n3. We can create a new project by following these [instructions](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console) which will lead us to the [create project page](https://console.cloud.google.com/projectcreate).\n```yaml\nProject name: made-with-ml  # Google will append a unique ID to the end of it\nLocation: No organization\n```\n4. Once the project has been created, refresh the page and we should see it (along with few other default projects from Google).\n\n```bash\n# Google BigQuery projects\n├── made-with-ml-XXXXXX   👈 our project\n├── bigquery-publicdata\n├── imjasonh-storage\n└── nyc-tlc\n```\n\n#### Define BigQuery destination in Airbyte\n\nNext, we need to establish the connection between Airbyte and BigQuery so that we can load the extracted data to the destination. In order to authenticate our access to BigQuery with Airbyte, we'll need to create a service account and generate a secret key. This is basically creating an identity with certain access that we can use for verification. Follow these [instructions](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#iam-service-account-keys-create-console) to create a service and generate the key file (JSON). Note down the location of this file because we'll be using it throughout this lesson. For example ours is `/Users/goku/Downloads/made-with-ml-XXXXXX-XXXXXXXXXXXX.json`.\n\n1. On our [Airbyte UI](http://localhost:8000/), click on `Destinations` on the left menu. Then click the `+ New destination` button on the top right corner.\n2. Click on the `Destination type` dropdown and choose `BigQuery`. This will open a view to define our file data source.\n```yaml\nName: BigQuery\nDefault Dataset ID: mlops_course  # where our data will go inside our BigQuery project\nProject ID: made-with-ml-XXXXXX  # REPLACE this with your Google BiqQuery Project ID\nCredentials JSON: SERVICE-ACCOUNT-KEY.json  # REPLACE this with your service account JSON location\nDataset location: US  # select US or EU, all other options will not be compatible with dbt later\n```\n3. Click the `Set up destination` button and our data destination will be tested and saved.\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg width=\"1000\" src=\"https://madewithml.com/static/images/mlops/data_stack/destinations.png\" alt=\"data destinations\"\u003e\n\u003c/div\u003e\n\n#### Connecting File source to BigQuery destination\n\nNow we're ready to create the connection between our sources and destination:\n\n1. On our [Airbyte UI](http://localhost:8000/), click on `Connections` on the left menu. Then click the `+ New connection` button on the top right corner.\n2. Under `Select a existing source`, click on the `Source` dropdown and choose `Projects` and click `Use existing source`.\n3. Under `Select a existing destination`, click on the `Destination` dropdown and choose `BigQuery` and click `Use existing destination`.\n```yaml\nConnection name: Projects \u003c\u003e BigQuery\nReplication frequency: Manual\nDestination Namespace: Mirror source structure\nNormalized tabular data: True  # leave this selected\n```\n4. Click the `Set up connection` button and our connection will be tested and saved.\n5. Repeat the same for our `Tags` source with the same `BigQuery` destination.\n\n\u003e Notice that our sync mode is `Full refresh | Overwrite` which means that every time we sync data from our source, it'll overwrite the existing data in our destination. As opposed to `Full refresh | Append` which will add entries from the source to bottom of the previous syncs.\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg width=\"1000\" src=\"https://madewithml.com/static/images/mlops/data_stack/connections.png\" alt=\"data connections\"\u003e\n\u003c/div\u003e\n\n#### Data sync\n\nOur replication frequency is `Manual` because we'll trigger the data syncs ourselves:\n\n1. On our [Airbyte UI](http://localhost:8000/), click on `Connections` on the left menu. Then click the `Projects \u003c\u003e BigQuery` connection we set up earlier.\n2. Press the `🔄 Sync now` button and once it's completed we'll see that the projects are now in our BigQuery data warehouse.\n3. Repeat the same with our `Tags \u003c\u003e BigQuery` connection.\n\n```bash\n# Inside our data warehouse\nmade-with-ml-XXXXXX               - Project\n└── mlops_course                  - Dataset\n│   ├── _airbyte_raw_projects     - table\n│   ├── _airbyte_raw_tags         - table\n│   ├── projects                  - table\n│   └── tags                      - table\n```\n\n\u003eIn our [orchestration lesson](https://madewithml.com/courses/mlops/orchestration), we'll use Airflow to programmatically execute the data sync.\n\nWe can easily explore and query this data using SQL directly inside our warehouse:\n\n1. On our BigQuery project page, click on the `🔍 QUERY` button and select `In new tab`.\n2. Run the following SQL statement and view the data:\n```sql linenums=\"1\"\nSELECT *\nFROM `made-with-ml-XXXXXX.mlops_course.projects`\nLIMIT 1000\n```\n\n\u003cdiv class=\"output_subarea output_html rendered_html output_result\" dir=\"auto\"\u003e\u003cdiv\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eid\u003c/th\u003e\n      \u003cth\u003ecreated_on\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003edescription\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e6\u003c/td\u003e\n      \u003ctd\u003e2020-02-20 06:43:18\u003c/td\u003e\n      \u003ctd\u003eComparison between YOLO and RCNN on real world...\u003c/td\u003e\n      \u003ctd\u003eBringing theory to experiment is cool. We can ...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e7\u003c/td\u003e\n      \u003ctd\u003e2020-02-20 06:47:21\u003c/td\u003e\n      \u003ctd\u003eShow, Infer \u0026amp; Tell: Contextual Inference for C...\u003c/td\u003e\n      \u003ctd\u003eThe beauty of the work lies in the way it arch...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e9\u003c/td\u003e\n      \u003ctd\u003e2020-02-24 16:24:45\u003c/td\u003e\n      \u003ctd\u003eAwesome Graph Classification\u003c/td\u003e\n      \u003ctd\u003eA collection of important graph embedding, cla...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e15\u003c/td\u003e\n      \u003ctd\u003e2020-02-28 23:55:26\u003c/td\u003e\n      \u003ctd\u003eAwesome Monte Carlo Tree Search\u003c/td\u003e\n      \u003ctd\u003eA curated list of Monte Carlo tree search papers...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e19\u003c/td\u003e\n      \u003ctd\u003e2020-03-03 13:54:31\u003c/td\u003e\n      \u003ctd\u003eDiffusion to Vector\u003c/td\u003e\n      \u003ctd\u003eReference implementation of Diffusion2Vec (Com...\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\u003c/div\u003e\n\n### Transform via dbt-cloud\n\nOnce we've extracted and loaded our data, we need to transform the data so that it's ready for downstream applications. These transformations are different from the [preprocessing](https://madewithml.com/courses/mlops/preprocessing#transformations) we've seen before but are instead reflective of business logic that's agnostic to downstream applications. Common transformations include defining schemas, filtering, cleaning and joining data across tables, etc. While we could do all of these things with SQL in our data warehouse (save queries as tables or views), dbt delivers production functionality around version control, testing, documentation, packaging, etc. out of the box. This becomes crucial for maintaining observability and high quality data workflows.\n\n\u003cdiv class=\"ai-center-all mb-4\"\u003e\n    \u003cimg width=\"500\" src=\"https://madewithml.com/static/images/mlops/data_stack/transform.png\" alt=\"data transform\"\u003e\n\u003c/div\u003e\n\n\u003e In addition to data transformations, we can also process the data using large-scale analytics engines like Spark, Flink, etc. We'll learn more about batch and stream processing in our [systems design lesson](https://madewithml.com/courses/mlops/systems-design#processing).\n\n### dbt Cloud\n\nNow we're ready to transform our data in our data warehouse using [dbt](https://www.getdbt.com/). We'll be using a developer account on dbt Cloud (free), which provides us with an IDE, unlimited runs, etc.\n\n\u003e We'll learn how to use the [dbt-core](https://github.com/dbt-labs/dbt-core) in our [orchestration lesson](https://madewithml.com/courses/mlops/orchestration/). Unlike dbt Cloud, dbt core is completely open-source and we can programmatically connect to our data warehouse and perform transformations.\n\n1. Create a [free account](https://www.getdbt.com/signup/) and verify it.\n2. Go to [https://cloud.getdbt.com/](https://cloud.getdbt.com/) to get set up.\n3. Click `continue` and choose `BigQuery` as the database.\n4. Click `Upload a Service Account JSON file` and upload our file to autopopulate everything.\n5. Click the `Test` \u003e `Continue`.\n6. Click `Managed` repository and name it `dbt-transforms` (or anything else you want).\n7. Click `Create` \u003e `Continue` \u003e `Skip and complete`.\n8. This will open the project page and click `\u003e_ Start Developing` button.\n9. This will open the IDE where we can click `🗂 initialize your project`.\n\nNow we're ready to start developing our models:\n\n1. Click the `···` next to the `models` directory on the left menu.\n2. Click `New folder` called `models/labeled_projects`.\n3. Create a `New file` under `models/labeled_projects` called `labeled_projects.sql`.\n4. Repeat for another file under `models/labeled_projects` called `schema.yml`.\n\n```bash\ndbt-cloud-XXXXX-dbt-transforms\n├── ...\n├── models\n│   ├── example\n│   └── labeled_projects\n│   │   ├── labeled_projects.sql\n│   │   └── schema.yml\n├── ...\n└── README.md\n```\n\n### Joins\n\nInside our `models/labeled_projects/labeled_projects.sql` file we'll create a view that joins our project data with the appropriate tags. This will create the labeled data necessary for downstream applications such as machine learning models. Here we're joining based on the matching id between the projects and tags:\n\n```sql linenums=\"1\"\n-- models/labeled_projects/labeled_projects.sql\nSELECT p.id, created_on, title, description, tag\nFROM `made-with-ml-XXXXXX.mlops_course.projects` p  -- REPLACE\nLEFT JOIN `made-with-ml-XXXXXX.mlops_course.tags` t  -- REPLACE\nON p.id = t.id\n```\n\nWe can view the queried results by clicking the `Preview` button and view the data lineage as well.\n\n### Schemas\n\nInside our `models/labeled_projects/schema.yml` file we'll define the schemas for each of the features in our transformed data. We also define several tests that each feature should pass. View the full list of [dbt tests](https://docs.getdbt.com/docs/building-a-dbt-project/tests) but note that we'll use [Great Expectations](https://madewithml.com/courses/mlops/testing/#expectations) for more comprehensive tests when we orchestrate all these data workflows in our [orchestration lesson](https://madewithml.com/courses/mlops/orchestration/).\n\n\n```yaml linenums=\"1\"\n# models/labeled_projects/schema.yml\n\nversion: 2\n\nmodels:\n    - name: labeled_projects\n      description: \"Tags for all projects\"\n      columns:\n          - name: id\n            description: \"Unique ID of the project.\"\n            tests:\n                - unique\n                - not_null\n          - name: title\n            description: \"Title of the project.\"\n            tests:\n                - not_null\n          - name: description\n            description: \"Description of the project.\"\n            tests:\n                - not_null\n          - name: tag\n            description: \"Labeled tag for the project.\"\n            tests:\n                - not_null\n\n```\n\n### Runs\n\nAt the bottom of the IDE, we can execute runs based on the transformations we've defined. We'll run each of the following commands and once they finish, we can see the transformed data inside our data warehouse.\n\n```bash\ndbt run\ndbt test\n```\n\nOnce these commands run successfully, we're ready to move our transformations to a production environment where we can insert this view in our data warehouse.\n\n### Jobs\n\nIn order to apply these transformations to the data in our data warehouse, it's best practice to create an [Environment](https://docs.getdbt.com/guides/legacy/managing-environments) and then define [Jobs](https://docs.getdbt.com/guides/getting-started/building-your-first-project/schedule-a-job):\n\n1. Click `Environments` on the left menu \u003e `New Environment` button (top right corner) and fill out the details:\n```yaml\nName: Production\nType: Deployment\n...\nDataset: mlops_course\n```\n2. Click `New Job` with the following details and then click `Save` (top right corner).\n```yaml\nName: Transform\nEnvironment: Production\nCommands: dbt run\n          dbt test\nSchedule: uncheck \"RUN ON SCHEDULE\"\n```\n3. Click `Run Now` and view the transformed data in our data warehouse under a view called `labeled_projects`.\n\n```bash\n# Inside our data warehouse\nmade-with-ml-XXXXXX               - Project\n└── mlops_course                  - Dataset\n│   ├── _airbyte_raw_projects     - table\n│   ├── _airbyte_raw_tags         - table\n│   ├── labeled_projects          - view\n│   ├── projects                  - table\n│   └── tags                      - table\n```\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg width=\"800\" src=\"https://madewithml.com/static/images/mlops/data_stack/dbt_run.png\" alt=\"dbt run\"\u003e\n\u003c/div\u003e\n\n\n\u003e There is so much more to dbt so be sure to check out their [official documentation](https://docs.getdbt.com/docs/building-a-dbt-project/documentation) to really customize any workflows. And be sure to check out our [orchestration lesson](https://madewithml.com/courses/mlops/orchestration) where we'll programmatically create and execute our dbt transformations.\n\n\n### Applications\n\nHopefully we created our data stack for the purpose of gaining some actionable insight about our business, users, etc. Because it's these use cases that dictate which sources of data we extract from, how often and how that data is stored and transformed. Downstream applications of our data typically fall into one of these categories:\n\n- `data analytics`: use cases focused on reporting trends, aggregate views, etc. via charts, dashboards, etc.for the purpose of providing operational insight for business stakeholders.\n- `machine learning`: use cases centered around using the transformed data to construct predictive models (forecasting, personalization, etc.).\n\n```bash\n!pip install google-cloud-bigquery==1.21.0 -q\n```\n```python\nfrom google.cloud import bigquery\nfrom google.oauth2 import service_account\n\n# Replace these with your own values\nproject_id = \"made-with-ml-XXXXXX\"\nSERVICE_ACCOUNT_KEY_JSON = \"/Users/goku/Downloads/made-with-ml-XXXXXX-XXXXXXXXXXXX.json\"\n\n# Establish connection\ncredentials = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_KEY_JSON)\nclient = bigquery.Client(credentials= credentials, project=project_id)\n\n# Query data\nquery_job = client.query(\"\"\"\n   SELECT *\n   FROM mlops_course.labeled_projects\"\"\")\nresults = query_job.result()\nresults.to_dataframe().head()\n```\n\n\u003cdiv class=\"output_subarea output_html rendered_html output_result\" dir=\"auto\"\u003e\u003cdiv\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eid\u003c/th\u003e\n      \u003cth\u003ecreated_on\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003edescription\u003c/th\u003e\n      \u003cth\u003etag\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e1994.0\u003c/td\u003e\n      \u003ctd\u003e2020-07-29 04:51:30\u003c/td\u003e\n      \u003ctd\u003eUnderstanding the Effectivity of Ensembles in ...\u003c/td\u003e\n      \u003ctd\u003eThe report explores the ideas presented in Dee...\u003c/td\u003e\n      \u003ctd\u003ecomputer-vision\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e1506.0\u003c/td\u003e\n      \u003ctd\u003e2020-06-19 06:26:17\u003c/td\u003e\n      \u003ctd\u003eUsing GitHub Actions for MLOps \u0026amp; Data Science\u003c/td\u003e\n      \u003ctd\u003eA collection of resources on how to facilitate...\u003c/td\u003e\n      \u003ctd\u003emlops\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e807.0\u003c/td\u003e\n      \u003ctd\u003e2020-05-11 02:25:51\u003c/td\u003e\n      \u003ctd\u003eIntroduction to Machine Learning Problem Framing\u003c/td\u003e\n      \u003ctd\u003eThis course helps you frame machine learning (...\u003c/td\u003e\n      \u003ctd\u003emlops\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e1204.0\u003c/td\u003e\n      \u003ctd\u003e2020-06-05 22:56:38\u003c/td\u003e\n      \u003ctd\u003eSnaked: Classifying Snake Species using Images\u003c/td\u003e\n      \u003ctd\u003eProof of concept that it is possible to identi...\u003c/td\u003e\n      \u003ctd\u003ecomputer-vision\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e1706.0\u003c/td\u003e\n      \u003ctd\u003e2020-07-04 11:05:28\u003c/td\u003e\n      \u003ctd\u003ePokeZoo\u003c/td\u003e\n      \u003ctd\u003eA deep learning based web-app developed using ...\u003c/td\u003e\n      \u003ctd\u003ecomputer-vision\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\u003c/div\u003e\n\n### Set up Airflow\n\nNow it's time to programmatically execute the workflows we set up above. We'll be using [Airflow](https://airflow.apache.org/) to author, schedule, and monitor our workflows. If you're not familiar with orchestration, be sure to check out the [lesson](https://madewithml.com/courses/mlops/orchestration/) first.\n\nTo install and run Airflow, we can either do so [locally](https://airflow.apache.org/docs/apache-airflow/stable/start/local.html) or with [Docker](https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html). If using `docker-compose` to run Airflow inside Docker containers, we'll want to allocate at least 4 GB in memory.\n\n```bash\n# Configurations\nexport AIRFLOW_HOME=${PWD}/airflow\nAIRFLOW_VERSION=2.3.3\nPYTHON_VERSION=\"$(python --version | cut -d \" \" -f 2 | cut -d \".\" -f 1-2)\"\nCONSTRAINT_URL=\"https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt\"\n\n# Install Airflow (may need to upgrade pip)\npip install \"apache-airflow==${AIRFLOW_VERSION}\" --constraint \"${CONSTRAINT_URL}\"\n\n# Initialize DB (SQLite by default)\nairflow db init\n```\n\nThis will create an `airflow` directory with the following components:\n\n```bash\nairflow/\n├── logs/\n├── airflow.cfg\n├── airflow.db\n├── unittests.cfg\n└── webserver_config.py\n```\n\nWe're going to edit the [airflow.cfg](https://github.com/GokuMohandas/data-engineering/blob/main/airflow/airflow.cfg) file to best fit our needs:\n```bash\n# Inside airflow.cfg\nenable_xcom_pickling = True  # needed for Great Expectations airflow provider\nload_examples = False  # don't clutter webserver with examples\n```\n\nAnd we'll perform a reset to implement these configuration changes.\n\n```bash\nairflow db reset -y\n```\n\nNow we're ready to initialize our database with an admin user, which we'll use to login to access our workflows in the webserver.\n\n```bash\n# We'll be prompted to enter a password\nairflow users create \\\n    --username admin \\\n    --firstname FIRSTNAME \\\n    --lastname LASTNAME \\\n    --role Admin \\\n    --email EMAIL\n```\n\n#### Webserver\n\nOnce we've created a user, we're ready to launch the webserver and log in using our credentials.\n\n```bash\n# Launch webserver\nsource venv/bin/activate\nexport AIRFLOW_HOME=${PWD}/airflow\nairflow webserver --port 8080  # http://localhost:8080\n```\n\nThe webserver allows us to run and inspect workflows, establish connections to external data storage, manager users, etc. through a UI. Similarly, we could also use Airflow's [REST API](https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html) or [Command-line interface (CLI)](https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html) to perform the same operations. However, we'll be using the webserver because it's convenient to visually inspect our workflows.\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg src=\"https://madewithml.com/static/images/mlops/orchestration/webserver.png\" width=\"700\" alt=\"airflow webserver\"\u003e\n\u003c/div\u003e\n\nWe'll explore the different components of the webserver as we learn about Airflow and implement our workflows.\n\n#### Scheduler\n\nNext, we need to launch our scheduler, which will execute and monitor the tasks in our workflows. The schedule executes tasks by reading from the metadata database and ensures the task has what it needs to finish running. We'll go ahead and execute the following commands on the *separate terminal* window:\n\n```bash\n# Launch scheduler (in separate terminal)\nsource venv/bin/activate\nexport AIRFLOW_HOME=${PWD}/airflow\nexport OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES\nairflow scheduler\n```\n\n### Extract and load\n\nWe're going to use the Airbyte connections we set up [above](#extract-via-airbyte) but this time we're going to programmatically trigger the data syncs with Airflow. First, let's ensure that Airbyte is running on a separate terminal in it's repository:\n\n```bash\ngit clone https://github.com/airbytehq/airbyte.git  # skip if already create in data-stack lesson\ncd airbyte\ndocker-compose up\n```\n\nNext, let's install the required packages and establish the connection between Airbyte and Airflow:\n\n```bash\npip install apache-airflow-providers-airbyte==3.1.0\n```\n\n1. Go to the [Airflow webserver](http://localhost:8080/) and click `Admin` \u003e `Connections` \u003e ➕\n2. Add the connection with the following details:\n```yaml\nConnection ID: airbyte\nConnection Type: HTTP\nHost: localhost\nPort: 8000\n```\n\n\u003e We could also establish connections [programmatically](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#connection-cli){:target=“_blank”} but it’s good to use the UI to understand what’s happening under the hood.\n\nIn order to execute our extract and load data syncs, we can use the [`AirbyteTriggerSyncOperator`](https://airflow.apache.org/docs/apache-airflow-providers-airbyte/stable/operators/airbyte.html):\n\n```python linenums=\"1\"\n# airflow/dags/workflows.py\n@dag(...)\ndef dataops():\n    \"\"\"Production DataOps workflows.\"\"\"\n    # Extract + Load\n    extract_and_load_projects = AirbyteTriggerSyncOperator(\n        task_id=\"extract_and_load_projects\",\n        airbyte_conn_id=\"airbyte\",\n        connection_id=\"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX\",  # REPLACE\n        asynchronous=False,\n        timeout=3600,\n        wait_seconds=3,\n    )\n    extract_and_load_tags = AirbyteTriggerSyncOperator(\n        task_id=\"extract_and_load_tags\",\n        airbyte_conn_id=\"airbyte\",\n        connection_id=\"XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX\",  # REPLACE\n        asynchronous=False,\n        timeout=3600,\n        wait_seconds=3,\n    )\n\n    # Define DAG\n    extract_and_load_projects\n    extract_and_load_tags\n```\n\nWe can find the `connection_id` for each Airbyte connection by:\n\n1. Go to our [Airbyte webserver](http://localhost:8000/) and click `Connections` on the left menu.\n2. Click on the specific connection we want to use and the URL should be like this:\n```bash\nhttps://demo.airbyte.io/workspaces/\u003cWORKSPACE_ID\u003e/connections/\u003cCONNECTION_ID\u003e/status\n```\n3. The string in the `CONNECTION_ID` position is the connection's id.\n\nWe can trigger our DAG right now and view the extracted data be loaded into our BigQuery data warehouse but we'll continue developing and execute our DAG once the entire DataOps workflow has been defined.\n\n\n### Validate via GE\n\nThe specific process of where and how we extract our data can be bespoke but what's important is that we have validation at every step of the way. We'll once again use [Great Expectations](https://greatexpectations.io/), as we did in our [testing lesson](https://madewithml.com/courses/mlops/testing#data), to [validate](https://madewithml.com/courses/mlops/testing#expectations) our extracted and loaded data before transforming it.\n\nWith the Airflow concepts we've learned so far, there are many ways to use our data validation library to validate our data. Regardless of what data validation tool we use (ex. [Great Expectations](https://greatexpectations.io/), [TFX](https://www.tensorflow.org/tfx/data_validation/get_started), [AWS Deequ](https://github.com/awslabs/deequ), etc.) we could use the BashOperator, PythonOperator, etc. to run our tests. However, Great Expectations has a [Airflow Provider package](https://github.com/great-expectations/airflow-provider-great-expectations) to make it even easier to validate our data. This package contains a [`GreatExpectationsOperator`](https://registry.astronomer.io/providers/great-expectations/modules/greatexpectationsoperator) which we can use to execute specific checkpoints as tasks.\n\n```bash\npip install airflow-provider-great-expectations==0.1.1 great-expectations==0.15.19\ngreat_expectations init\n```\n\nThis will create the following directory within our data-engineering repository:\n\n```bash\ntests/great_expectations/\n├── checkpoints/\n├── expectations/\n├── plugins/\n├── uncommitted/\n├── .gitignore\n└── great_expectations.yml\n```\n\n#### Data source\n\nBut first, before we can create our tests, we need to define a new `datasource` within Great Expectations for our Google BigQuery data warehouse. This will require several packages and exports:\n\n```bash\npip install pybigquery==0.10.2 sqlalchemy_bigquery==1.4.4\nexport GOOGLE_APPLICATION_CREDENTIALS=/Users/goku/Downloads/made-with-ml-XXXXXX-XXXXXXXXXXXX.json  # REPLACE\n```\n\n```bash\ngreat_expectations datasource new\n```\n```bash\nWhat data would you like Great Expectations to connect to?\n    1. Files on a filesystem (for processing with Pandas or Spark)\n    2. Relational database (SQL) 👈\n```\n```bash\nWhat are you processing your files with?\n1. MySQL\n2. Postgres\n3. Redshift\n4. Snowflake\n5. BigQuery 👈\n6. other - Do you have a working SQLAlchemy connection string?\n```\n\nThis will open up an interactive notebook where we can fill in the following details:\n```yaml\ndatasource_name = “dwh\"\nconnection_string = “bigquery://made-with-ml-359923/mlops_course”\n```\n\n#### Suite\n\nNext, we can create a [suite of expectations](https://madewithml.com/courses/mlops/testing#suites) for our data assets:\n\n```bash\ngreat_expectations suite new\n```\n\n```bash\nHow would you like to create your Expectation Suite?\n    1. Manually, without interacting with a sample batch of data (default)\n    2. Interactively, with a sample batch of data 👈\n    3. Automatically, using a profiler\n```\n```bash\nSelect a datasource\n    1. dwh 👈\n```\n```bash\nWhich data asset (accessible by data connector \"default_inferred_data_connector_name\") would you like to use?\n    1. mlops_course.projects 👈\n    2. mlops_course.tags\n```\n```bash\nName the new Expectation Suite [mlops.projects.warning]: projects\n```\n\nThis will open up an interactive notebook where we can define our expectations. Repeat the same for creating a suite for our tags data asset as well.\n\nExpectations for `mlops_course.projects`:\n\n```python linenums=\"1\"\n# data leak\nvalidator.expect_compound_columns_to_be_unique(column_list=[\"title\", \"description\"])\n```\n```python linenums=\"1\"\n# id\nvalidator.expect_column_values_to_be_unique(column=\"id\")\n\n# create_on\nvalidator.expect_column_values_to_not_be_null(column=\"created_on\")\n\n# title\nvalidator.expect_column_values_to_not_be_null(column=\"title\")\nvalidator.expect_column_values_to_be_of_type(column=\"title\", type_=\"STRING\")\n\n# description\nvalidator.expect_column_values_to_not_be_null(column=\"description\")\nvalidator.expect_column_values_to_be_of_type(column=\"description\", type_=\"STRING\")\n```\n\nExpectations for `mlops_course.tags`:\n\n```python linenums=\"1\"\n# id\nvalidator.expect_column_values_to_be_unique(column=\"id\")\n\n# tag\nvalidator.expect_column_values_to_not_be_null(column=\"tag\")\nvalidator.expect_column_values_to_be_of_type(column=\"tag\", type_=\"STRING\")\n```\n\n#### Checkpoints\n\nOnce we have our suite of expectations, we're ready to check [checkpoints](https://madewithml.com/courses/mlops/testing#checkpoints) to execute these expectations:\n\n```bash\ngreat_expectations checkpoint new projects\n```\n\nThis will, of course, open up an interactive notebook. Just ensure that the following information is correct (the default values may not be):\n```yaml\ndatasource_name: dwh\ndata_asset_name: mlops_course.projects\nexpectation_suite_name: projects\n```\n\nAnd repeat the same for creating a checkpoint for our tags suite.\n\n#### Tasks\n\nWith our checkpoints defined, we're ready to apply them to our data assets in our warehouse.\n\n```python linenums=\"1\"\nGE_ROOT_DIR = Path(BASE_DIR, \"great_expectations\")\n\n@dag(...)\ndef dataops():\n    ...\n    validate_projects = GreatExpectationsOperator(\n        task_id=\"validate_projects\",\n        checkpoint_name=\"projects\",\n        data_context_root_dir=GE_ROOT_DIR,\n        fail_task_on_validation_failure=True,\n    )\n    validate_tags = GreatExpectationsOperator(\n        task_id=\"validate_tags\",\n        checkpoint_name=\"tags\",\n        data_context_root_dir=GE_ROOT_DIR,\n        fail_task_on_validation_failure=True,\n    )\n\n    # Define DAG\n    extract_and_load_projects \u003e\u003e validate_projects\n    extract_and_load_tags \u003e\u003e validate_tags\n```\n\n### Transform via dbt-core\n\nOnce we've validated our extracted and loaded data, we're ready to [transform](https://madewithml.com/courses/mlops/data-stack#transform) it. Our DataOps workflows are not specific to any particular downstream application so the transformation must be globally relevant (ex. cleaning missing data, aggregation, etc.). Just like in our [data stack lesson](https://madewithml.com/courses/mlops/data-stack), we're going to use [dbt](https://www.getdbt.com/) to transform our data. However, this time, we're going to do everything programmatically using the open-source [dbt-core](https://github.com/dbt-labs/dbt-core) package.\n\nIn the root of our data-engineering repository, initialize our dbt directory with the following command:\n```bash\ndbt init dbf_transforms\n```\n```bash\nWhich database would you like to use?\n[1] bigquery 👈\n```\n```bash\nDesired authentication method option:\n[1] oauth\n[2] service_account 👈\n```\n```yaml\nkeyfile: /Users/goku/Downloads/made-with-ml-XXXXXX-XXXXXXXXXXXX.json  # REPLACE\nproject (GCP project id): made-with-ml-XXXXXX  # REPLACE\ndataset: mlops_course\nthreads: 1\njob_execution_timeout_seconds: 300\n```\n```bash\nDesired location option:\n[1] US  👈  # or what you picked when defining your dataset in Airbyte DWH destination setup\n[2] EU\n```\n\n#### Models\n\nWe'll prepare our dbt models as we did using the [dbt Cloud IDE](https://madewithml.com/courses/mlops/data-stack#dbt-cloud) in the previous lesson.\n\n```bash\ncd dbt_transforms\nrm -rf models/example\nmkdir models/labeled_projects\ntouch models/labeled_projects/labeled_projects.sql\ntouch models/labeled_projects/schema.yml\n```\n\nand add the following code to our model files:\n\n```sql linenums=\"1\"\n-- models/labeled_projects/labeled_projects.sql\nSELECT p.id, created_on, title, description, tag\nFROM `made-with-ml-XXXXXX.mlops_course.projects` p  -- REPLACE\nLEFT JOIN `made-with-ml-XXXXXX.mlops_course.tags` t  -- REPLACE\nON p.id = t.id\n```\n\n```yaml linenums=\"1\"\n# models/labeled_projects/schema.yml\n\nversion: 2\n\nmodels:\n    - name: labeled_projects\n      description: \"Tags for all projects\"\n      columns:\n          - name: id\n            description: \"Unique ID of the project.\"\n            tests:\n                - unique\n                - not_null\n          - name: title\n            description: \"Title of the project.\"\n            tests:\n                - not_null\n          - name: description\n            description: \"Description of the project.\"\n            tests:\n                - not_null\n          - name: tag\n            description: \"Labeled tag for the project.\"\n            tests:\n                - not_null\n\n```\n\nAnd we can use the BashOperator to execute our dbt commands like so:\n\n```python linenums=\"1\"\nDBT_ROOT_DIR = Path(BASE_DIR, \"dbt_transforms\")\n\n@dag(...)\ndef dataops():\n    ...\n    # Transform\n    transform = BashOperator(task_id=\"transform\", bash_command=f\"cd {DBT_ROOT_DIR} \u0026\u0026 dbt run \u0026\u0026 dbt test\")\n\n    # Define DAG\n    extract_and_load_projects \u003e\u003e validate_projects\n    extract_and_load_tags \u003e\u003e validate_tags\n    [validate_projects, validate_tags] \u003e\u003e transform\n```\n\n#### Validate\n\nAnd of course, we'll want to validate our transformations beyond dbt's built-in methods, using great expectations. We'll create a suite and checkpoint as we did above for our projects and tags data assets.\n```bash\ngreat_expectations suite new  # for mlops_course.labeled_projects\n```\n\nExpectations for `mlops_course.labeled_projects`:\n\n```python linenums=\"1\"\n# data leak\nvalidator.expect_compound_columns_to_be_unique(column_list=[\"title\", \"description\"])\n```\n\n```python linenums=\"1\"\n# id\nvalidator.expect_column_values_to_be_unique(column=\"id\")\n\n# create_on\nvalidator.expect_column_values_to_not_be_null(column=\"created_on\")\n\n# title\nvalidator.expect_column_values_to_not_be_null(column=\"title\")\nvalidator.expect_column_values_to_be_of_type(column=\"title\", type_=\"STRING\")\n\n# description\nvalidator.expect_column_values_to_not_be_null(column=\"description\")\nvalidator.expect_column_values_to_be_of_type(column=\"description\", type_=\"STRING\")\n\n# tag\nvalidator.expect_column_values_to_not_be_null(column=\"tag\")\nvalidator.expect_column_values_to_be_of_type(column=\"tag\", type_=\"STRING\")\n```\n\n```bash\ngreat_expectations checkpoint new labeled_projects\n```\n\n```yaml\ndatasource_name: dwh\ndata_asset_name: mlops_course.labeled_projects\nexpectation_suite_name: labeled_projects\n```\n\nand just like how we added the validation task for our extracted and loaded data, we can do the same for our transformed data in Airflow:\n\n```python linenums=\"1\"\n@dag(...)\ndef dataops():\n    ...\n    # Transform\n    transform = BashOperator(task_id=\"transform\", bash_command=f\"cd {DBT_ROOT_DIR} \u0026\u0026 dbt run \u0026\u0026 dbt test\")\n    validate_transforms = GreatExpectationsOperator(\n        task_id=\"validate_transforms\",\n        checkpoint_name=\"labeled_projects\",\n        data_context_root_dir=GE_ROOT_DIR,\n        fail_task_on_validation_failure=True,\n    )\n\n    # Define DAG\n    extract_and_load_projects \u003e\u003e validate_projects\n    extract_and_load_tags \u003e\u003e validate_tags\n    [validate_projects, validate_tags] \u003e\u003e transform \u003e\u003e validate_transforms\n```\n\n\u003chr\u003e\n\nNow we have our entire DataOps DAG define and executing it will prepare our data from extraction to loading to transformation (and with validation at every step of the way) for [downstream applications](https://madewithml.com/courses/mlops/data-stack#applications).\n\n\u003cdiv class=\"ai-center-all\"\u003e\n    \u003cimg src=\"https://madewithml.com/static/images/mlops/orchestration/dataops.png\" width=\"700\" alt=\"dataops\"\u003e\n\u003c/div\u003e\n\n\n## Learn more\n\nLearn a lot more about data engineering, including infrastructure that we haven't covered in code here and how it's poised for downstream analytics and machine learning applications in our [data stack](https://madewithml.com/courses/mlops/data-stack/), [orchestration](https://madewithml.com/courses/mlops/orchestration/) and [feature store](https://madewithml.com/courses/mlops/feature-store/) lessons.","funding_links":[],"categories":["Building"],"sub_categories":["Workflows"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGokuMohandas%2Fdata-engineering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGokuMohandas%2Fdata-engineering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGokuMohandas%2Fdata-engineering/lists"}