{"id":25750490,"url":"https://github.com/riju18/apache-airflow-fundamentals","last_synced_at":"2026-03-04T10:31:19.810Z","repository":{"id":279569224,"uuid":"934316111","full_name":"riju18/Apache-Airflow-Fundamentals","owner":"riju18","description":"Quick start setup, best practices for constructing DAGs, configuration, deployment, and utilizing essential operators for effective workflows are all covered in this repository's useful manual for using Apache Airflow. ","archived":false,"fork":false,"pushed_at":"2025-03-16T05:41:08.000Z","size":13,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-16T06:24:46.643Z","etag":null,"topics":["apache-airflow","gcp","gcp-composer","python","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/riju18.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-17T16:17:34.000Z","updated_at":"2025-03-16T05:41:13.000Z","dependencies_parsed_at":"2025-02-26T09:27:24.580Z","dependency_job_id":"22e11edd-b285-4d19-a2b5-d75104b6a21e","html_url":"https://github.com/riju18/Apache-Airflow-Fundamentals","commit_stats":null,"previous_names":["riju18/apache-airflow-fundamentals"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/riju18/Apache-Airflow-Fundamentals","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riju18%2FApache-Airflow-Fundamentals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riju18%2FApache-Airflow-Fundamentals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riju18%2FApache-Airflow-Fundamentals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riju18%2FApache-Airflow-Fundamentals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/riju18","download_url":"https://codeload.github.com/riju18/Apache-Airflow-Fundamentals/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riju18%2FApache-Airflow-Fundamentals/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30078308,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T08:01:56.766Z","status":"ssl_error","status_checked_at":"2026-03-04T08:00:42.919Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","gcp","gcp-composer","python","sql"],"created_at":"2025-02-26T13:16:52.120Z","updated_at":"2026-03-04T10:31:19.730Z","avatar_url":"https://github.com/riju18.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# index\n\n+ [What is Airflow](#airflow)\n+ [Benefits](#benefits)\n+ [Core Components](#core-components)\n+ [Core Concept](#core-concept)\n+ [Other Concepts](#other-concepts)\n+ [How Airflow works](#how-airflow-works)\n+ [Usage of Airflow](#usage-of-airflow)\n+ [Define a DAG](#define_dag)\n+ [Condition on Task](#branching)\n+ [Pool](#pool)\n+ [Airflow Webserver Problem](#airflow-webserver-problem)\n+ [Interact with Sqlite3](#interact-with-sqlite3)\n+ [Deploy](#deploy)\n+ [DAG Optimization](#dag-optimization)\n+ [Amazing Airflow Operators](#airflow_operators)\n+ [Airflow Security](#airflow_roles)\n+ [version](#version)\n\n# airflow\n\n+ It's an Orchestrator to **execute a task** at **right time** in **right way** in the **right order**.\n\n# **benefits**\n\n+ **Dynamic**\n  + Everything we can do here in Python so the advantages are limitless.\n+ **Scalability**\n  + It's possible to run as many task as we want in parallel.\n+ **UI**\n  + Monitor the data pipeline.\n  + able to retry our tasks.\n  + Data profiling:\n    + run sql queries\n    + show data in chart\n+ **Extensible**\n\n# core-components\n\n+ **Web server\u003c/span\u003e**\n  + Flask server with Gunicorn the UI.\n\n+ **Scheduler**\n\n+ **Metastore**\n  + It's related to DB where all the metadata related to airflow itself but also related to our data pipeline, plans, tasks \u0026 so on will be stored.\n\n+ **Executor\u003c/span\u003e**\n  + It defines how our tasks are going to be executed.\n  + Type\n    **SequentialExecutor**: It executes tasks one after another.\n    **LocalExecutor**: It can execute task parallely.\n\n+ **Worker**\n  + It defines where the task will be executed.\n\n# **core-concept**\u003c/span\u003e\n\n+ **DAG**: Depends on one another but has No loop.\n\n+ **Operator**: It's kind of wrapper around the task. Ex: we want to connect to our DB, insert data in it, we'll use an operator to do that.**One operator for one Task.**\n  + **Action** : It executes fn or cmd.\n  \n  + **Transfer** : It allows to transfer data from src to destination.\n  \n  + **Sensor** : It waits for something to happen before moving to next task.\n    + **poke_interval(sec)**\u003c/span\u003e : Every n seconds the given task should wait.\n    \n    + **timeout(sec)**\u003c/span\u003e : Max time limit to wait.\n    \n    + **softfail(bollean)**\u003c/span\u003e : If set to **true**, will marked the task as skipped on failure.\n\n+ **Backfilling \u0026 catchup\u003c/span\u003e** : It basically fetches the data from previous missing dates. By default it is **True**. When catchup is set to **True** then the dag will run from last run date \u0026 when it is **false** then the dag will be triggered from current date.  \n  \n# **other-concepts**\u003c/span\u003e\n\n+ **Task Instance\u003c/span\u003e**\n\n+ **Workflow\u003c/span\u003e** : It's the combination of all concepts.\n\n+ **Hook\u003c/span\u003e** : It embodies a connection to a remote server, service or platform. It's used to transafer data between source to destination.\n\n+ **Pool\u003c/span\u003e** : priority of task/worker.\n\n+ **plugin\u003c/span\u003e** : Airflow provides advantages to create custom plugin like operator, hook, sensor etc.\n\n+ **.airflowignore\u003c/span\u003e** : Dag names we want to ignore. File must be put in dags dir.\n\n+ **zommbies/undeeds\u003c/span\u003e** : theory.\n\n# how-airflow-works\n\n+ **Single node architecture**\n  ```mermaid\n    flowchart LR\n\n    web_server --\u003e metastore\n    scheduler --\u003e metastore\n    executor_queue \u003c--\u003e metastore\n    ```\n  + How it works\n      ```mermaid\n      flowchart LR\n\n      web_server \u003c--parse_python_files --\u003e folder_dags\n      scheduler\u003c--parse_python_files --\u003e folder_dags\n      scheduler--parse_the_info--\u003e metastore\n      executor--runs_the_task_and_update_metadata--\u003e metastore\n      ```\n+ **Multi node architecture**\n  + wip...\n\n# usage-of-airflow\n\n+ **airflow dir architecture**\n  + **airflow.cfg\u003c/span\u003e** : airflow configuration\n\n    ```\n    load_examples: True/False\n    sql_alchemy_conn: sqlite/MySQL/postgres connection string\n    ```\n\n  + **airflow.db** : DB information\n  + **logs** : log information\n  + **webserver_config.py** : webserver configuration\n  + **make a dir named ```dags```**\n+ **airflow -h** : all available cmd\n+ **DB**\n  + **initialize the metastore/db(for the 1st time)**\n\n    ```\n    airflow db init (deprecated in 2.7)\n    airflow db migrate\n    ```\n\n  + **Update db version. Ex: 1.10.x to 2.2.x**\n\n    ```\n    airflow db upgrade (deprecated in 2.7)\n    airflow db migrate\n    ```\n\n  + **reset the DB**\n\n    ```\n    airflow db reset\n    ```\n\n  + **Check the status after changing the configuration.**\n\n    ```\n    airflow db check\n    ```\n\n+ **UI**\n  + **running the UI**\n\n    ```\n    airflow webserver\n    ```\n\n+ **Connection**\n  + **It returns all the connection name \u0026 detail.**\n\n    ```\n    airflow connections\n    ```\n\n+ **Create a user**\n\n  ```\n  airflow users create -u uname -f firstname -l lastname -p password  -e email -r role[Admin, Viewer, User, Op, Public]\n  ```\n\n+ **Enable scheduler**\n\n    ```\n    airflow scheduler\n    ```\n\n+ **Dags**\n  + **All dag list**\n\n    ```\n    airflow dag list\n    ```\n\n  + **Exact info of that particular task**\n\n    ```\n    airflow tasks list dagName\n    ```\n  \n  + **export DAG dependecy as img/pdf or anything**\n    ```sh\n    sudo apt-get install graphviz\n    ```\n\n    ```sh\n    airflow dags show dag_name --save file_name.pdf\n    ```\n\n+ **Test**\n  + **It shows the task is success/fail. It's a good practice to test every task before deploy.**\n\n    ```\n    airflow tasks test dag_id task_id date\n    ```\n\n+ **Tasks**:\n  + **Sequential ordering** : task1 \u003e\u003e task2 \u003e\u003e task3 \u003e\u003e task4\n  + **Parallel ordering** : task1 \u003e\u003e [task2, task3] \u003e\u003e task4\n  + **Trigger a DAG from another DAG**\n      ```python\n      from airflow.models import DAG\n      from airflow.operators.trigger_dagrun import TriggerDagRunOperator\n      from airflow.operators.bash import BashOperator\n      from airflow.utils.edgemodifier import Label\n      from datetime import timedelta, datetime\n\n      default_args = {\n          'owner': 'admin',\n          'email_on_failure': False,\n          'email_on_retry': False,\n          'email_on_success': False,\n          'email': 'samrat.mitra@vivasoftltd.com',\n          'retries': 1,\n          'retry_delay': timedelta(seconds=10)\n      }\n\n      with DAG(dag_id='DAG2'\n              , default_args=default_args\n              , description='A simple Test Dag which runs every 2 min inerval'\n              , start_date=datetime(2023, 11, 8)\n              , schedule_interval=None  # only once\n              , catchup=False):\n          \n          # Tasks\n          # ==============\n\n          task1 = BashOperator(task_id='task1'\n                              , bash_command='sleep 1')\n          \n          task2 = BashOperator(task_id='task2'\n                              , bash_command='sleep 2')\n          \n          task3 = BashOperator(task_id='task3'\n                              , bash_command='sleep 3')\n        \n          # trigger DAG\n          trigger_child_dag1 = TriggerDagRunOperator(task_id='trigger_child_dag1',\n                                                    trigger_dag_id='DAG1',\n                                                    execution_date='{{ds}}',\n                                                    reset_dag_run=True,\n                                                    wait_for_completion=True,\n                                                    poke_interval=2\n                                                    )\n\n        # Task flow\n\n        trigger_child_dag1 \u003e\u003e task1 \u003e\u003e task2 \u003e\u003e task3\n      ```\n  + **ScaleUp task**\n    1) **airflow.cfg** :\n       + **executor**: What kind of execution (sequential or parallel)\n       + **sql_alchemy_conn**: DB connection\n       + **parallelism**: 1....n (How many tasks will be executed in parallel for the entire airflow instance.)\n       + **dag_concurrency**: 1....n (How many tasks can be run in parallel for a given DAG.)\n       + **max_active_run_per_dag**:1....n (How many DAG will run in parallel at a single time.)\n    2) **celery**: Distribute \u0026 execute the task asynchronously.\n\n        ```\n        pip install 'apache-airflow[celery]'\n        ```\n\n       + \u003cspan style=\"color: red;\"\u003e**caution**\u003c/span\u003e : **Celery can't be used with sqlite. Use MySQL/Postgres.**\n    3) **redis(in memory DB)**:\n        + installation:\n          + [link](https://phoenixnap.com/kb/install-redis-on-ubuntu-20-04)\n          + cmd:\n\n            ```\n              sudo apt update\n              sudo apt intall redis-server -y\n              sudo nano /etc/redis/redis.conf\n              change supervised no to systemd\n              run redis server: sudo systemctl restart redis.service\n              check server status: sudo systemctl status redis.service\n            ```\n\n        + airflow.cfg:\n          + executor: CeleryExecutor\n          + broker_url: redis url (localhost or IP)\n          + result_backend: sql_alchemy_conn\n    4) **airflow redis package**\n\n        ```\n        pip3 install 'apache-airflow-providers-redis'\n        ```\n\n    5) The UI which allows to monitor workers by which the task is executed.\n\n        ```\n        airflow celery flower\n        ```\n\n        ```\n          Problem: flower import error\n          Solution: pip install --upgrade apache-airflow-providers-celery==2.0.0 (for airflow 2.1.0)\n        ```\n\n       + ip: localhost:5555\n       + Add worker in celery\n\n          ```\n         airflow celery worker\n          ```\n\n    7) **TaskGroup**: To run similliar kinds of taks parallelly\n\n        ```python\n        from airflow.utils.task_group import TaskGroup\n        ```\n\n    8) **Xcom**: It's used to push/pull data\n    9) **Trigger**: conditional task execution\n        + [documentation](https://tinyurl.com/bddfzajn)\n\n# define_dag\n\n- [DAG all params](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/dag/index.html)\n\n  ```python\n  from datetime import datetime, timedelta\n\n  from airflow import DAG\n  from airflow.models import Variable\n\n  from utility.ms_teams_notification import send_fail_notification_teams_message,\\\n        send_success_notification_teams_message\n\n  default_args = {\n    'owner' : 'admin',\n    'email_on_failure': True,\n    'email_on_retry': False,\n    'retries': int(Variable.get(\"no_of_retry\")),\n    'retry_delay': timedelta(seconds=int(Variable.get(\"task_retry_delay_in_sec\"))),\n    'on_failure_callback': send_fail_notification_teams_message,\n    'on_success_callback': send_success_notification_teams_message,\n    'is_paused_upon_creation': True  # by default the DAG will be disabled\n    }\n\n  with DAG(dag_id='TrggerFileTransferAndIngestionDAG'\n        , dag_display_name='Trigger File Transfer And Ingestion DAG'\n         , default_args=default_args\n         , description=f'Trigger SFTPfileTransferDefaultSource, SFTPfileTransferSaviyntIDM and KCCIngestDataToBigQuery DAG'\n         , start_date=datetime(2025, 2, 21)\n         , schedule_interval='0 17 * * *'  # every day at 17\n         , tags=['bigquery', 'schedule', 'daily']\n         , catchup=False\n         , owner_links={\"admin\": \"mailto:username@gmail.com\"}\n         # or, owner_links={\"admin\": \"https://www.example.com\"}\n         , fail_stop=True  # the downstream tasks will skipped instead of getting failed\n         , dagrun_timeout=timedelta(seconds=10)\n         ):\n         # define tasks\n         pass\n  ```\n\n# branching\n\n```python\nfrom airflow.operators.python import PythonOperator, BranchPythonOperator\n\n\"\"\"\n--\u003e define the DAG\n  --\u003e then use the code\n\"\"\"\n\ndef _connect_to_api()-\u003e str:\n  \"\"\"\n  try to establish API connection\n\n  --\u003e returns task id according to status code\n  --\u003e if status_code == 200:\n      returns next task id\n  --\u003e else:\n      returns termination task id\n  \"\"\"\n\n  falcon = CSPMRegistration(client_id=Variable.get('falcon_client_id'),\n                            client_secret=Variable.get('falcon_client_secret')\n                            )\n  status_code = falcon.get_policy_settings(cloud_platform=cloud_provide_list).get('status_code', None)\n  if status_code:\n      if int(status_code) == 200:\n          return 'get_api_data'  # next task\n      else:\n          return 'invalid_connection'  # termination task\n\nconnect_to_api = BranchPythonOperator(task_id='connect_to_api',\n                                          task_display_name='🌐 connect to API',\n                                          python_callable=_connect_to_api,\n                                          trigger_rule=\"all_success\")\n\n# task #2: invalid connection\ninvalid_connection = PythonOperator(task_id='invalid_connection',\n                                    task_display_name='🚫 Invalid connection',\n                                    python_callable=lambda: print('API connectiin is failed'),\n                                    trigger_rule=\"all_success\") \n\n# task #3: fetch API data\nget_api_data = PythonOperator(task_id='get_api_data',\n                            python_callable=_get_api_data)\n\n# taskflow\nconnect_to_api \u003e\u003e invalid_connection\nconnect_to_api \u003e\u003e get_api_data\n```\n\n# pool\n\u003e `Pools` determine which tasks can run `concurrently` within the defined limits.\n\n\u003e `Priority Weights` decide the `order of execution` when multiple tasks are waiting in a queue\n\n```python\n# task 1:\nsleep1 = BashOperator(task_id='sleep1'\n                      , task_display_name='😴 sleep 1'\n                      , bash_command='sleep 1'\n                      , pool='my_pool'\n                      , priority_weight=1)\n        \n# task 2:\nsleep2 = BashOperator(task_id='sleep2'\n                      , task_display_name='😴 sleep 2'\n                      , bash_command='sleep 2'\n                      , pool='my_pool'\n                      , priority_weight=2\n                      , weight_rule='downstream')\n```\n\u003e Here, task1 will run before task2 if both are waiting for execution in my_pool.\n\n**Example Scenario**\n  \u003e You have a pool (api_calls) with 3 slots. You have 5 tasks assigned to this pool, each with different priority weights:\n\n| Task ID\t | Pool Name | Priority Weight\n| -------- | -------   | ---------------\n| Task A   | api_calls | 10\n| Task B   | api_calls | 5\n| Task C   | api_calls | 20\n| Task D   | api_calls | 15\n| Task E   | api_calls | 5\n\n\u003e Since the pool has only `3 slots`, only `3 tasks` can run at a time.\n\u003e `Tasks C (20), D (15), and A (10)` will run first, because they have the `highest` priority.\n\u003e `Tasks B (5) and E (5)` will remain in the queue until a `slot is free`.\n\n\n# airflow-webserver-problem\n\n+ **Problem**: server is running in PID: 4006 or whatever\n  + **Solution**\n\n      ```\n      kill -9 PID\n      ```\n\n# interact-with-sqlite3\n\n+ **sAccess DB**\n  + **sqlite path/db_name.db** -\u003e To access DB\n  + **All table list**\n\n    ```\n    .tables\n    ```\n\n  + **select * from tableName\u003c/span\u003e** -\u003e Particular table\n\n# deploy\n\n+ **GCP Composer**\n  + create a vpc\n    - subnet creation mode: ```custom```\n    - add a subnet\n    - private google access: ```on```\n  + create an env in GCP composer \u0026 upload the files in **DAG** folder\n    1. give proper role to **default service** account(*-compute@developer.gserviceaccount.com)\n        - cloud sql client\n        - editor\n        - Eventarc Event Receiver\n    2. create a **service account**\n    3. goto ```IAM```\n    4. click the checkbox in middle right side\n    5. find the cloud_composer_service_account like ```service-*@cloudcomposer-accounts.iam.gserviceaccount.com```, click checkbox and click Edit principal\n        - Cloud Composer API Service Agent\n        - Cloud Composer v2 API Service Agent Extension\n    6. Click ```GRANT ACCESS```\n        - Add principals\n          - select the created service account(```step #2```)\n        - Assign roles\n          - Cloud Composer v2 API Service Agent Extension\n          - Eventarc Event Receiver\n          - save\n    7. goto ```Service Accounts```\n        - select the created service account\n        - goto ```permissions```\n          - select ```*-compute@developer.gserviceaccount.com```\n            - role: ```Editor```\n          - select ```*@mxs-cmdatalake-prd.iam.gserviceaccount.com```\n            -  role: ```Cloud Composer v2 API Service Agent Extension``` and ```Service Account Token Creator```\n          - select ```service-*@cloudcomposer-accounts.iam.gserviceaccount.com```\n            - role: ```Cloud Composer API Service Agent```, ```Cloud Composer v2 API Service Agent Extension``` and ```Service Account Admin```\n    \n    8. **bind**\n        ```sh\n        gcloud iam service-accounts add-iam-policy-binding \\\n        weselect-data-dev@we-select-data-dev-422614.iam.gserviceaccount.com \\\n        --member serviceAccount:service-126779322718@cloudcomposer-accounts.iam.gserviceaccount.com \\\n        --role roles/composer.ServiceAgentV2Ext\n        ```\n    9. **create**\n        - console\n\n          ```sh\n          gcloud composer environments create env_name \\\n          --location us-central1 \\\n          --image-version composer-2.7.1-airflow-2.7.3 \\\n          --service-account \"weselect-data-dev@we-select-data-dev-422614.iam.gserviceaccount.com\"\n          ```\n    10. [doc](https://cloud.google.com/composer/docs/composer-2/create-environments)\n  \n  + if composer in ```private``` env:\n    1. goto ```cloud NAT```\n    2. create ```cloud NAT gateway```\n    3. NAT type ```public```\n    4. Select Cloud Router\n        - network: vpc\n        - region: as same as ```composer```\n        - cloud router: create a new router\n    5. Network service tier: ```Standard```(**for dev**)\n\n        \n  + **how to access the DB from `GCP composer`**:\n    + GCP composer uses the **PostgreSQL** by default which is kept in **GKE**\n    + steps:\n      1. get the GKE cluster name from\n          ```mermaid\n          flowchart LR\n\n          composer --\u003e env_name --\u003e environment_configuration\n          ```\n      2. full sqlAlchemy conn from\n          ```mermaid\n          flowchart LR\n\n          composer --\u003e env_name --\u003e airflow_webserver --\u003e Admin --\u003e Configurations\n          ```\n      3. save 2 IPs from composer sql proxy service\n          ```mermaid\n          flowchart TB\n\n          composer --\u003e env_name --\u003e environment_configuration --\u003e GKE_cluster --\u003e details --\u003e Networking --\u003e service_ingress --\u003e airflow_sqlproxy_service\n\n          airflow_sqlproxy_service --\u003e cluster_ip\n\n          airflow_sqlproxy_service --\u003e serving_pods_endpoint\n          ```\n      4. create virtual machine with same region and airflow_sqlproxy_service(cluster_ip)\n      5. Finally, execute psql cmd to get the db details\n          ```sh\n          # get the dbname, user, password, port from sqlAlchemy connection\n\n          psql -h airflow_sqlproxy_service_serving_pods_endpoint -p 3306 -U root -p password -d db_name\n          ```\n  \n  + **User authentication**\n    - composer config:\n      ```mermaid\n      flowchart LR\n\n      composerName --\u003e overwrite_airflow_congfig --\u003e rbac_user_role:viewer\n      ```\n    \n    - create new user\n      ```bash\n      gcloud composer environments run example-environment \\\n      --location us-central1 \\\n      users create -- \\\n      -r Op \\\n      -e \"example-user@example.com\" \\\n      -u \"example-user@example.com\" \\\n      -f \"Name\" \\\n      -l \"Surname\" \\\n      --use-random-password\n      ```\n    - update user role\n      ```bash\n        gcloud composer environments run ENVIRONMENT_NAME \\\n      --location LOCATION \\\n      users add-role -- -e USER_EMAIL -r Admin\n      ```\n\n# dag-optimization\n\n- keep tasks `atomic`\n\n- use a `static` start date \n\n- change the `name` of the DAG when u change the `start date` \n\n- Don't import `airflow variable` outside `methods/operators`, use it directly.\n\n- Break down a big pipeline into smaller pipelines/tasks, not a single task or pipeline.\n\n- Use `template fields`, `variable`, and `macros`.\n\n- Executor\n    + use `LocalExecutor/CeleryExecutor/DaskExecutor/\n/KubernetesExecutor/CeleryKubernetesExecutor`(in Cloud we can ignore it)\n\n- **idempotency**: operation can be applied multiple times without changing any result \n\n- Never pull/process large dataset using pandas/any library in airflow \n\n- For dataOps use `dbt`/`sqlmesh`/`whatever`.\n\n- use `TaskGroup` to run similliar kinds of tasks simultaneously\n\n- use loop to create dynamic task for similiar type of task \n\n- Make proper calculation of `parallelism, max_active_tasks_per_dag and max_active_runs_per_dag`\n\n- XCOM\n  \u003e XCOMs have size limitations. With Postgres, we can't share more than `1Gb` of data in a XCOM.\n  - **custom xcom backend**\n    - Instead of the `Airflow DB` to store your data, you can use an `S3 bucket/GCP cloud storage`.\n    - No more limitations 🙌 \n    - **configurations**\n      - Install the `common io`,  and `amazon providers/GCP providers`\n      - Create a connection to your S3 bucket in UI\n      - Define the `XCOM_BACKEND` setting to\n`airflow.providers.common.io.xcom.backend.XComObjectStorageBackend`\n      - **XCOM_OBJECTSTORAGE_THRESHOLD**=`1048576`, Anything above `1MB` will be stored in S3, otherwise in the DB\n\n      - **xcom_objectstorage_path**:\n        - for aws: `s3://conn_id@bucket/path`\n        - for local: `Users/riju/airflow/xcoms` --\u003e absolute path\n      \n      - **xcom_objectstorage_threshold**: -1 if \u003e threshold value, 0 for any value\n\n- pool\n  - Use pools for resource-constrained tasks (e.g., API calls, database queries).\n  - Assign meaningful priority weights to ensure critical tasks run first.\n  - Do not rely on pools alone; also set dag_concurrency and max_active_runs.\n  - Avoid setting all tasks with the same priority weight, or they will be scheduled randomly.\n\n\n- `N.B:` Airflow is an Orchestrator. Don't ever process large amount of data via `airflow`. Use corresponding tool/software/library/framework (e.g., `spark`)\n\n# airflow_operators\n\n- **SQLExecuteQueryOperator**\n  \u003e Execute any SQL query from any SQL DB.\n  \u003e [doc](https://tinyurl.com/uy26ncne)\n  ```python\n  from datetime import datetime, timedelta\n\n  from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator\n\n  # task 1:\n  execute_sql_query = SQLExecuteQueryOperator(task_id='execute_sql_query'\n                                              , task_display_name='get sample data'\n                                              , conn_id='postgres_local_connection'\n                                              , sql='SELECT * FROM PUBLIC.ACTOR LIMIT 1;'\n                                              , show_return_value_in_logs=True)\n  ```\n\n- Execute large SQL file\n\n  ```python\n  from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator\n\n  QUERY_SQL_PATH = 'utility/sql/sp_dev_score_tmp.sql' \n\n  create_or_replace_score_generation_sp = BigQueryInsertJobOperator(\n    task_id=\"create_or_replace_score_generation_sp\", \n    task_display_name=\"🛢 Create/replace dev score generations\", \n    gcp_conn_id='dev_gcp_service_account',\n    location=Variable.get('dataset_location'),\n    configuration={\n      \"query\": {\n        \"query\": \"{% include '\" + QUERY_SQL_PATH + \"' %}\",\n        \"useLegacySql\": False,\n        \"priority\": \"BATCH\",\n        }\n        }\n                                                             )\n  ```\n\n# airflow_roles\n\n- **public**\n  \u003e Public users (anonymous) don’t have any permissions.\n\n- **Viewer**\n  \u003e Viewer users have limited read permissions.\n\n- **USer**\n  \u003e User users have Viewer permissions plus additional permissions.\n\n- **Op**\n  \u003e Op users have User permissions plus additional permissions. \n\n- **Admin**\n  \u003e Admin users have all possible permissions, including granting or revoking permissions from other users. Admin users have Op permission plus additional permissions.\n\n- [doc](https://airflow.apache.org/docs/apache-airflow-providers-fab/stable/auth-manager/access-control.html)\n\n# version\n\n+ **2.9.0**","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friju18%2Fapache-airflow-fundamentals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Friju18%2Fapache-airflow-fundamentals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friju18%2Fapache-airflow-fundamentals/lists"}