{"id":25249787,"url":"https://github.com/mateuseap/apache-airflow","last_synced_at":"2025-08-22T02:09:07.588Z","repository":{"id":180321423,"uuid":"664934536","full_name":"mateuseap/apache-airflow","owner":"mateuseap","description":"Learning the fundamental concepts of Apache Airflow","archived":false,"fork":false,"pushed_at":"2023-07-15T18:43:59.000Z","size":491,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T22:11:40.222Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mateuseap.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-11T04:42:16.000Z","updated_at":"2023-08-13T22:45:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"e5b805a0-118b-44d7-bc2c-a9e7b4d3cf8c","html_url":"https://github.com/mateuseap/apache-airflow","commit_stats":null,"previous_names":["mateuseap/apache-airflow"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mateuseap/apache-airflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mateuseap%2Fapache-airflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mateuseap%2Fapache-airflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mateuseap%2Fapache-airflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mateuseap%2Fapache-airflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mateuseap","download_url":"https://codeload.github.com/mateuseap/apache-airflow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mateuseap%2Fapache-airflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271574431,"owners_count":24783319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-12T03:50:53.082Z","updated_at":"2025-08-22T02:09:07.556Z","avatar_url":"https://github.com/mateuseap.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Apache Airflow\n\nLearning the fundamental concepts of Apache Airflow\n\n    .\n    ├── materials\n    │   └── *\n    ├── .gitignore\n    └── README.md\n\n➡️ Course: [Apache Airflow: The Hands-On Guide](https://www.udemy.com/course/the-ultimate-hands-on-course-to-master-apache-airflow/)\n\n## The basics of Apache Airflow\n\n### What is Apache Airflow?\n**Apache Airflow** is an open-source platform to programmatically author, schedule, and monitor workflows.\n\nPros:\n\n- It is **dynamic**, everything is done in Python.\n- It is **scalable**.\n- Has a **user interface**.\n- It is **extensible**, so you can add plugins to it on your own.\n\n### Core Components\n- **Web server** \n    - Flask server with Gunicorn serving the UI.\n- **Scheduler**\n    - Daemon in charge of scheduling workflows.\n- **Metastore**\n    - Database where metadata is stored.\n- **Executor**\n    - Class defining **how** your tasks should be executed.\n- **Worker**\n    - Process/subprocess **executing** your task.\n\n### What is a DAG?\n\n**DAG** stands for **directed acyclic graph**, in other words, it is a directed graph with no loops. Here is an example:\n\n![Directed Acyclic Graph](./assets/DAG.png)\n\nBasically, a DAG in Apache Airflow represents a **data pipeline**. The **vertices** of a DAG represent the **tasks**, and the **directed edges** represent dependencies between the tasks.\n\n### Operators\n\n**Operators** are the building blocks of Airflow DAGs. They contain the logic of how data is processed in a pipeline. Each task in a DAG is defined by instantiating an operator. There are many different types of operators available in Airflow.\n\n- **Action Operators**\n    - Operators in charge of executing something.\n- **Transfer Operators**\n    - Operators that allow transfer data from a source to a destination.\n- **Sensor Operators**\n    - Operators that wait for something to happen before moving forward.\n\n### Task instances\n\nApache Airflow **task instances** are defined as a representation for a specific run of a task and categorized by a collection of a DAG, a task, and a point in time.\n\n### Workflow\n\nA **workflow** in Apache Airflow is the combination of all the concepts we've learned about before in this README file.\n\n### What Apache Airflow is not?\n\nIt is not a data streaming solution, nor a data processing framework. You should not process TB or GB of data in Airflow. Apache Airflow is a way to trigger external tools; it's an excellent orchestrator.\n\n## How Apache Airflow works?\n\n### One Node Architecture\n\n![One Node Architecture](./assets/One%20Node%20Architecture.png)\n\n### Multi Nodes Architecture (Celery)\n\n![Multi Nodes Architecture (Celery)](./assets/Multi%20Nodes%20Architecture%20(Celery).png)\n\n## Apache Airflow dependencies\n\n### Extra dependencies\n\nThe ```apache-airflow``` PyPI basic package only installs what’s needed to get started. Additional packages can be installed depending on what will be useful in your environment. For instance, if you don’t need connectivity with **Postgres**, you won’t have to go through the trouble of installing the ```postgres-devel``` yum package, or whatever equivalent applies on the distribution you are using.\n\nMost of the **extra dependencies** are linked to a corresponding **provider package**. For example “amazon” extra has a corresponding ```apache-airflow-providers-amazon``` provider package to be installed. When you install Airflow with such extras, the necessary provider packages are installed automatically (latest versions from PyPI for those packages). However, you can freely upgrade and install provider packages independently from the main Airflow installation.\n\nFor the list of the extras and what they enable, check out the Airflow documentation: [Reference for package extras](https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html)\n\n### Provider packages\n\nThe Airflow 2.0 is delivered in multiple, separate, but connected **provider packages**. The core of Airflow scheduling system is delivered as ```apache-airflow``` package and there are around 60 provider packages which can be installed separately as so called **Airflow provider packages**. The default Airflow installation doesn’t have many integrations and you have to install them yourself.\n\nFor more informations about it, check out the Airflow documentation: [Provider packages](https://airflow.apache.org/docs/apache-airflow-providers/index.html)\n\n### Differences between extras and providers\n\n**Extras** and **providers** are different things, though many extras are leading to installing providers. Extras are standard Python setuptools feature that allows to add additional set of dependencies as optional features to “core” Apache Airflow, while providers packages are just one of the type of such optional features, but not all optional features of Apache Airflow have corresponding providers.\n\n## Apache Airflow files\n\nAfter installing the Apache Airflow, you'll get the following files inside your ```airflow``` folder:\n\n    .\n    ├── airflow.cfg\n    ├── airflow.db\n    ├── logs\n    │   └── *\n    └── webserver_config.py\n\n- **airflow.cfg**\n    - Stores all configuration settings of Apache Airflow.\n- **airflow.db**\n    - Corresponds to the SQLite database of Apache Airflow.\n- **logs**\n    - Folder that stores the logs of the ```scheduler``` and the ```tasks```.\n- **webserver_config.py**\n    - File used to configure the web server, more specifically, used to configure the way the users are authenticated in Apache Airflow **user interface**.\n\n## Running Apache Airflow locally using Docker\n\nThe fastest way is by using a **docker image** of Apache Airflow, you can use the **Dockerfile** that is inside [materials/section-2/](/materials/section-2/Dockerfile) folder to build your first Apache Airflow docker image. After getting the Dockerfile, run the below commands:\n\n```bash\n# Build a docker image from the Dockerfile in the current directory and name it 'airflow-basic'\ndocker build -t airflow-basic .\n\n# Create a docker container named 'airflow' using the 'airflow-basic' docker image, also binds the container port 8080 with our local port 8080. This docker container will run in the background because of the '-d' param\ndocker run --name airflow -d -p 8080:8080 airflow-basic\n```\n\nAfter running the above commands and waiting a little bit of time, you can view the Apache Airflow **user interface** by openning [http://localhost:8080/](http://localhost:8080/) in your web browser:\n\n![Apache Airflow UI](./assets/Apache%20Airflow%20UI.png)\n\nYou'll be able to login using the below credentials:\n\n- Username: **admin**\n- Password: **admin**\n\nThere is a lot of useful commands that you can view in [materials/section-2/docs/cli_commands.txt](/materials/section-2/docs/cli_commands.txt) file, I recommend you take a look in it! Also, I strongly recommend to learn about Docker ([Docker reference documentation](https://docs.docker.com/reference/)), it's really important.\n\n## Apache Airflow CLI\n\nThere are some operations that you can't do by using the friendly Apache Airflow user interface, you'll need to use the **Apache Airflow CLI (command language interface)**. The first thing you have to know is that the commands in Apache Airflow CLI are **grouped**, most of the commands are separated in different groups according to the **resources** they interact to. Below is listed some useful commands:\n\n- ```airflow -h```\n    - List all available groups and commands.\n- ```airflow db init```\n    - Initialize the **metadabase** and generate the files and folders needed by Apache Airflow.\n- ```airflow db reset```\n    - Delete all the **metadata** in the Apache Airflow **metadatabase**.\n- ```airflow db upgrade```\n    - Upgrade the schemas that are in the Apache Airflow **metadatabase**.\n- ```airflow webserver```\n    - Start the **web server** and the **user interface** of Apache Airflow.\n- ```airflow scheduler```\n    - Start the **scheduler**.\n- ```airflow celery worker```\n    - Says that the machine is a Apache Airflow **worker**, so you can run **tasks** on it.\n- ```airflow dags list```\n    - List all the **dags**.\n- ```airflow dags trigger [dag_id] -e [execution_date]```\n    - Trigger an especific **dag** and set it's **execution date**.\n- ```airflows dags list-runs -d [dag_id]```\n    - List all **runs** of an specific **dag**.\n- ```airflow dags backfill -s [start_date] -e [end_date] [dag_id] --reset-dagruns```\n    - Retry already past triggered **dag runs** of an specific **dag** between two dates (ex: ```[start_date] -\u003e 2021-01-01``` and ```[end_date] -\u003e 2021-01-05```).\n- ```airflow tasks list [dag_id]```\n    - List all the **tasks** of a given **dag**.\n- ```airflow tasks test [dag_id] [task_id] [execution_date]```\n    - Check if an specific **task** of a given **dag** works. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmateuseap%2Fapache-airflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmateuseap%2Fapache-airflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmateuseap%2Fapache-airflow/lists"}