Airflow Concepts: Difference between revisions
Line 29: | Line 29: | ||
{{External|https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#relationships}} | {{External|https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#relationships}} | ||
The task dependencies, or their relationships, are a key part in using Tasks. If a task B has a dependency on task A (A → B), it is said that A is upstream of B and B is downstream of A. The dependencies are the directed edges of the directed acyclic graph. | The task dependencies, or their relationships, are a key part in using Tasks. If a task B has a dependency on task A (A → B), it is said that A is upstream of B and B is downstream of A. The dependencies are the directed edges of the directed acyclic graph. | ||
The upstream term has a very strict semantics: an upstream task is the task that is '''directly''' preceding the other task. This concept does not describe the tasks that are higher in the task hierarchy (they are not a direct parent of the task). Same constrains apply to a downstream task, which need to be a direct child of the other task. | |||
==Task Types== | ==Task Types== |
Revision as of 02:22, 11 July 2022
External
Internal
Workflow
DAG
The edges can be labeled in the UI.
SubDAG
A DAG is made of tasks among which there are relations of dependency. The DAG is not concerned about what happens inside the tasks, it is only concerned about how to run them: order, retries, timeouts. etc.
Declaring a DAG
- Via a context manager.
- With the
DAG()
constructor. - With the
@dag
decorator. TO PARSE: https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#the-dag-decorator.
Control Flow
Dynamic DAG
The DAGs can be purely declarative, or they can be declared in Python code, by adding tasks dynamically.
Task
A Task is the basic unit of execution in Airflow. Every task must be assigned to a DAG to run. Tasks have dependencies on each other. There could be upstream dependencies (if B depends on A, A → B, then A is an upstream dependency of B).
Task Dependencies
The task dependencies, or their relationships, are a key part in using Tasks. If a task B has a dependency on task A (A → B), it is said that A is upstream of B and B is downstream of A. The dependencies are the directed edges of the directed acyclic graph.
The upstream term has a very strict semantics: an upstream task is the task that is directly preceding the other task. This concept does not describe the tasks that are higher in the task hierarchy (they are not a direct parent of the task). Same constrains apply to a downstream task, which need to be a direct child of the other task.
Task Types
Airflow has three types of tasks: Operator, Sensor, which is a subclass of Operator, and TaskFlow-decorated Task. All these are subclasses of Airflow's BaseOperator
. Operators and Sensor are templates, and when one is called in a DAG, it is made into a Task.
Operator
An Operator is a predefined task template.
Sensor
A Sensor is a subclass of Operator that wait for an external event to happen.
TaskFlow-decorated Task
Decorated with @task
. A custom Python function packaged up as a Task.
Task Assignment to DAG
Passing Data between Tasks
Tasks pass data among each other using:
- XComs, when the amount of metadata to be exchanged is small.
- Uploading and downloading large files from a storage service.
TaskGroup
This is a pure UI concept.
XComs
"Cross-communications".