https://github.com/riju18/apache-airflow-fundamentals
https://github.com/riju18/apache-airflow-fundamentals
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/riju18/apache-airflow-fundamentals
- Owner: riju18
- Created: 2025-02-17T16:17:34.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-02-26T08:21:42.000Z (3 months ago)
- Last Synced: 2025-02-26T09:27:19.435Z (3 months ago)
- Size: 8.79 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# index
+ [What is Airflow](#airflow)
+ [Benefits](#benefits)
+ [Core Components](#core-components)
+ [Core Concept](#core-concept)
+ [Other Concepts](#other-concepts)
+ [How Airflow works](#how-airflow-works)
+ [Usage of Airflow](#usage-of-airflow)
+ [Define a DAG](#define_dag)
+ [Condition on Task](#branching)
+ [Airflow Webserver Problem](#airflow-webserver-problem)
+ [Interact with Sqlite3](#interact-with-sqlite3)
+ [Deploy](#deploy)
+ [DAG Optimization](#dag-optimization)
+ [Amazing Airflow Operators](#airflow_operators)
+ [Airflow Security](#airflow_roles)
+ [version](#version)# airflow
+ It's an Orchestrator to **execute a task** at **right time** in **right way** in the **right order**.
# **benefits**
+ **Dynamic**
+ Everything we can do here in Python so the advantages are limitless.
+ **Scalability**
+ It's possible to run as many task as we want in parallel.
+ **UI**
+ Monitor the data pipeline.
+ able to retry our tasks.
+ Data profiling:
+ run sql queries
+ show data in chart
+ **Extensible**# core-components
+ **Web server**
+ Flask server with Gunicorn the UI.+ **Scheduler**
+ **Metastore**
+ It's related to DB where all the metadata related to airflow itself but also related to our data pipeline, plans, tasks & so on will be stored.+ **Executor**
+ It defines how our tasks are going to be executed.
+ Type
**SequentialExecutor**: It executes tasks one after another.
**LocalExecutor**: It can execute task parallely.+ **Worker**
+ It defines where the task will be executed.# **core-concept**
+ **DAG**: Depends on one another but has No loop.
+ **Operator**: It's kind of wrapper around the task. Ex: we want to connect to our DB, insert data in it, we'll use an operator to do that.**One operator for one Task.**
+ **Action** : It executes fn or cmd.
+ **Transfer** : It allows to transfer data from src to destination.
+ **Sensor** : It waits for something to happen before moving to next task.
+ **poke_interval(sec)** : Every n seconds the given task should wait.
+ **timeout(sec)** : Max time limit to wait.
+ **softfail(bollean)** : If set to **true**, will marked the task as skipped on failure.+ **Backfilling & catchup** : It basically fetches the data from previous missing dates. By default it is **True**. When catchup is set to **True** then the dag will run from last run date & when it is **false** then the dag will be triggered from current date.
# **other-concepts**+ **Task Instance**
+ **Workflow** : It's the combination of all concepts.
+ **Hook** : It embodies a connection to a remote server, service or platform. It's used to transafer data between source to destination.
+ **Pool** : priority of task/worker.
+ **plugin** : Airflow provides advantages to create custom plugin like operator, hook, sensor etc.
+ **.airflowignore** : Dag names we want to ignore. File must be put in dags dir.
+ **zommbies/undeeds** : theory.
# how-airflow-works
+ **Single node architecture**
```mermaid
flowchart LRweb_server --> metastore
scheduler --> metastore
executor_queue <--> metastore
```
+ How it works
```mermaid
flowchart LRweb_server <--parse_python_files --> folder_dags
scheduler<--parse_python_files --> folder_dags
scheduler--parse_the_info--> metastore
executor--runs_the_task_and_update_metadata--> metastore
```
+ **Multi node architecture**
+ wip...# usage-of-airflow
+ **airflow dir architecture**
+ **airflow.cfg** : airflow configuration```
load_examples: True/False
sql_alchemy_conn: sqlite/MySQL/postgres connection string
```+ **airflow.db** : DB information
+ **logs** : log information
+ **webserver_config.py** : webserver configuration
+ **make a dir named ```dags```**
+ **airflow -h** : all available cmd
+ **DB**
+ **initialize the metastore/db(for the 1st time)**```
airflow db init (deprecated in 2.7)
airflow db migrate
```+ **Update db version. Ex: 1.10.x to 2.2.x**
```
airflow db upgrade (deprecated in 2.7)
airflow db migrate
```+ **reset the DB**
```
airflow db reset
```+ **Check the status after changing the configuration.**
```
airflow db check
```+ **UI**
+ **running the UI**```
airflow webserver
```+ **Connection**
+ **It returns all the connection name & detail.**```
airflow connections
```+ **Create a user**
```
airflow users create -u uname -f firstname -l lastname -p password -e email -r role[Admin, Viewer, User, Op, Public]
```+ **Enable scheduler**
```
airflow scheduler
```+ **Dags**
+ **All dag list**```
airflow dag list
```+ **Exact info of that particular task**
```
airflow tasks list dagName
```
+ **export DAG dependecy as img/pdf or anything**
```sh
sudo apt-get install graphviz
``````sh
airflow dags show dag_name --save file_name.pdf
```+ **Test**
+ **It shows the task is success/fail. It's a good practice to test every task before deploy.**```
airflow tasks test dag_id task_id date
```+ **Tasks**:
+ **Sequential ordering** : task1 >> task2 >> task3 >> task4
+ **Parallel ordering** : task1 >> [task2, task3] >> task4
+ **Trigger a DAG from another DAG**
```python
from airflow.models import DAG
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.operators.bash import BashOperator
from airflow.utils.edgemodifier import Label
from datetime import timedelta, datetimedefault_args = {
'owner': 'admin',
'email_on_failure': False,
'email_on_retry': False,
'email_on_success': False,
'email': '[email protected]',
'retries': 1,
'retry_delay': timedelta(seconds=10)
}with DAG(dag_id='DAG2'
, default_args=default_args
, description='A simple Test Dag which runs every 2 min inerval'
, start_date=datetime(2023, 11, 8)
, schedule_interval=None # only once
, catchup=False):
# Tasks
# ==============task1 = BashOperator(task_id='task1'
, bash_command='sleep 1')
task2 = BashOperator(task_id='task2'
, bash_command='sleep 2')
task3 = BashOperator(task_id='task3'
, bash_command='sleep 3')
# trigger DAG
trigger_child_dag1 = TriggerDagRunOperator(task_id='trigger_child_dag1',
trigger_dag_id='DAG1',
execution_date='{{ds}}',
reset_dag_run=True,
wait_for_completion=True,
poke_interval=2
)# Task flow
trigger_child_dag1 >> task1 >> task2 >> task3
```
+ **ScaleUp task**
1) **airflow.cfg** :
+ **executor**: What kind of execution (sequential or parallel)
+ **sql_alchemy_conn**: DB connection
+ **parallelism**: 1....n (How many tasks will be executed in parallel for the entire airflow instance.)
+ **dag_concurrency**: 1....n (How many tasks can be run in parallel for a given DAG.)
+ **max_active_run_per_dag**:1....n (How many DAG will run in parallel at a single time.)
2) **celery**: Distribute & execute the task asynchronously.```
pip install 'apache-airflow[celery]'
```+ **caution** : **Celery can't be used with sqlite. Use MySQL/Postgres.**
3) **redis(in memory DB)**:
+ installation:
+ [link](https://phoenixnap.com/kb/install-redis-on-ubuntu-20-04)
+ cmd:```
sudo apt update
sudo apt intall redis-server -y
sudo nano /etc/redis/redis.conf
change supervised no to systemd
run redis server: sudo systemctl restart redis.service
check server status: sudo systemctl status redis.service
```+ airflow.cfg:
+ executor: CeleryExecutor
+ broker_url: redis url (localhost or IP)
+ result_backend: sql_alchemy_conn
4) **airflow redis package**```
pip3 install 'apache-airflow-providers-redis'
```5) The UI which allows to monitor workers by which the task is executed.
```
airflow celery flower
``````
Problem: flower import error
Solution: pip install --upgrade apache-airflow-providers-celery==2.0.0 (for airflow 2.1.0)
```+ ip: localhost:5555
+ Add worker in celery```
airflow celery worker
```7) **TaskGroup**: To run similliar kinds of taks parallelly
```python
from airflow.utils.task_group import TaskGroup
```8) **Xcom**: It's used to push/pull data
9) **Trigger**: conditional task execution
+ [documentation](https://tinyurl.com/bddfzajn)# define_dag
```python
from datetime import datetime, timedeltafrom airflow import DAG
from airflow.models import Variablefrom utility.ms_teams_notification import send_fail_notification_teams_message,\
send_success_notification_teams_messagedefault_args = {
'owner' : 'admin',
'email_on_failure': True,
'email_on_retry': False,
'retries': int(Variable.get("no_of_retry")),
'retry_delay': timedelta(seconds=int(Variable.get("task_retry_delay_in_sec"))),
'on_failure_callback': send_fail_notification_teams_message,
'on_success_callback': send_success_notification_teams_message
}with DAG(dag_id='TrggerFileTransferAndIngestionDAG'
, dag_display_name='Trigger File Transfer And Ingestion DAG'
, default_args=default_args
, description=f'Trigger SFTPfileTransferDefaultSource, SFTPfileTransferSaviyntIDM and KCCIngestDataToBigQuery DAG'
, start_date=datetime(2025, 2, 21)
, schedule_interval='0 17 * * *' # every day at 17
, tags=['bigquery', 'schedule', 'daily']
, catchup=False
, owner_links={"admin": "mailto:[email protected]"}
# or, owner_links={"admin": "https://www.example.com"}
):
# define tasks
pass
```# branching
- wip...
# airflow-webserver-problem
+ **Problem**: server is running in PID: 4006 or whatever
+ **Solution**```
kill -9 PID
```# interact-with-sqlite3
+ **sAccess DB**
+ **sqlite path/db_name.db** -> To access DB
+ **All table list**```
.tables
```+ **select * from tableName** -> Particular table
# deploy
+ **GCP Composer**
+ create a vpc
- subnet creation mode: ```custom```
- add a subnet
- private google access: ```on```
+ create an env in GCP composer & upload the files in **DAG** folder
1. give proper role to **default service** account(*[email protected])
- cloud sql client
- editor
- Eventarc Event Receiver
2. create a **service account**
3. goto ```IAM```
4. click the checkbox in middle right side
5. find the cloud_composer_service_account like ```service-*@cloudcomposer-accounts.iam.gserviceaccount.com```, click checkbox and click Edit principal
- Cloud Composer API Service Agent
- Cloud Composer v2 API Service Agent Extension
6. Click ```GRANT ACCESS```
- Add principals
- select the created service account(```step #2```)
- Assign roles
- Cloud Composer v2 API Service Agent Extension
- Eventarc Event Receiver
- save
7. goto ```Service Accounts```
- select the created service account
- goto ```permissions```
- select ```*[email protected]```
- role: ```Editor```
- select ```*@mxs-cmdatalake-prd.iam.gserviceaccount.com```
- role: ```Cloud Composer v2 API Service Agent Extension``` and ```Service Account Token Creator```
- select ```service-*@cloudcomposer-accounts.iam.gserviceaccount.com```
- role: ```Cloud Composer API Service Agent```, ```Cloud Composer v2 API Service Agent Extension``` and ```Service Account Admin```
8. **bind**
```sh
gcloud iam service-accounts add-iam-policy-binding \
weselect-data-dev@we-select-data-dev-422614.iam.gserviceaccount.com \
--member serviceAccount:service-126779322718@cloudcomposer-accounts.iam.gserviceaccount.com \
--role roles/composer.ServiceAgentV2Ext
```
9. **create**
- console```sh
gcloud composer environments create env_name \
--location us-central1 \
--image-version composer-2.7.1-airflow-2.7.3 \
--service-account "weselect-data-dev@we-select-data-dev-422614.iam.gserviceaccount.com"
```
10. [doc](https://cloud.google.com/composer/docs/composer-2/create-environments)
+ if composer in ```private``` env:
1. goto ```cloud NAT```
2. create ```cloud NAT gateway```
3. NAT type ```public```
4. Select Cloud Router
- network: vpc
- region: as same as ```composer```
- cloud router: create a new router
5. Network service tier: ```Standard```(**for dev**)
+ **how to access the DB from `GCP composer`**:
+ GCP composer uses the **PostgreSQL** by default which is kept in **GKE**
+ steps:
1. get the GKE cluster name from
```mermaid
flowchart LRcomposer --> env_name --> environment_configuration
```
2. full sqlAlchemy conn from
```mermaid
flowchart LRcomposer --> env_name --> airflow_webserver --> Admin --> Configurations
```
3. save 2 IPs from composer sql proxy service
```mermaid
flowchart TBcomposer --> env_name --> environment_configuration --> GKE_cluster --> details --> Networking --> service_ingress --> airflow_sqlproxy_service
airflow_sqlproxy_service --> cluster_ip
airflow_sqlproxy_service --> serving_pods_endpoint
```
4. create virtual machine with same region and airflow_sqlproxy_service(cluster_ip)
5. Finally, execute psql cmd to get the db details
```sh
# get the dbname, user, password, port from sqlAlchemy connectionpsql -h airflow_sqlproxy_service_serving_pods_endpoint -p 3306 -U root -p password -d db_name
```
+ **User authentication**
- composer config:
```mermaid
flowchart LRcomposerName --> overwrite_airflow_congfig --> rbac_user_role:viewer
```
- create new user
```bash
gcloud composer environments run example-environment \
--location us-central1 \
users create -- \
-r Op \
-e "[email protected]" \
-u "[email protected]" \
-f "Name" \
-l "Surname" \
--use-random-password
```
- update user role
```bash
gcloud composer environments run ENVIRONMENT_NAME \
--location LOCATION \
users add-role -- -e USER_EMAIL -r Admin
```# dag-optimization
- keep tasks `atomic`
- use a `static` start date
- change the `name` of the DAG when u change the `start date`
- Don't import `airflow variable` outside `methods/operators`, use it directly.
- Break down a big pipeline into smaller pipelines/tasks, not a single task or pipeline.
- Use `template fields`, `variable`, and `macros`.
- Executor
+ use `LocalExecutor/CeleryExecutor/DaskExecutor/
/KubernetesExecutor/CeleryKubernetesExecutor`(in Cloud we can ignore it)- **idempotency**: operation can be applied multiple times without changing any result
- Never pull/process large dataset using pandas/any library in airflow
- For dataOps use `dbt`/`sqlmesh`/`whatever`.
- use `TaskGroup` to run similliar kinds of tasks simultaneously
- use loop to create dynamic task for similiar type of task
- Make proper calculation of `parallelism, max_active_tasks_per_dag and max_active_runs_per_dag`
- `N.B:` Airflow is an Orchestrator. Don't ever process large amount of data via `airflow`. Use corresponding tool/software/library/framework (e.g., `spark`)
# airflow_operators
- **SQLExecuteQueryOperator**
> Execute any SQL query from any SQL DB.
> [doc](https://tinyurl.com/uy26ncne)
```python
from datetime import datetime, timedeltafrom airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
# task 1:
execute_sql_query = SQLExecuteQueryOperator(task_id='execute_sql_query'
, task_display_name='get sample data'
, conn_id='postgres_local_connection'
, sql='SELECT * FROM PUBLIC.ACTOR LIMIT 1;'
, show_return_value_in_logs=True)
```# airflow_roles
- **public**
> Public users (anonymous) don’t have any permissions.- **Viewer**
> Viewer users have limited read permissions.- **USer**
> User users have Viewer permissions plus additional permissions.- **Op**
> Op users have User permissions plus additional permissions.- **Admin**
> Admin users have all possible permissions, including granting or revoking permissions from other users. Admin users have Op permission plus additional permissions.- [doc](https://airflow.apache.org/docs/apache-airflow-providers-fab/stable/auth-manager/access-control.html)
# version
+ **2.9.0**