Airflow Concepts
External
Internal
Workflow
DAG
The edges can be labeled in the UI.
SubDAG
A DAG is made of tasks among which there are relations of dependency. The DAG is not concerned about what happens inside the tasks, it is only concerned about how to run them: order, retries, timeouts. etc.
Declaring a DAG
- Via a context manager.
- With the
DAG()
constructor. - With the
@dag
decorator. TO PARSE: https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#the-dag-decorator.
DAG Run
A DAG instantiates in a DAG Run at runtime.
Control Flow
Dynamic DAG or Dynamic Task Mapping
The DAGs can be purely declarative, or they can be declared in Python code, by adding tasks dynamically. For more details, see:
DAG File Processing
Task
A Task is the basic unit of execution in Airflow. Every task must be assigned to a DAG to run. Tasks have dependencies on each other. There could be upstream dependencies (if B depends on A, A → B, then A is an upstream dependency of B). To be scheduled, a task have all its dependencies met.
Task Relationships
The task relationships, are a key part in using Tasks.
There are two types of relationships: dependency and
Upstream and Downstream Dependency
If a task B has a dependency on task A (A → B), it is said that A is upstream of B and B is downstream of A. The dependencies are the directed edges of the directed acyclic graph.
The upstream term has a very strict semantics: an upstream task is the task that is directly preceding the other task. This concept does not describe the tasks that are higher in the task hierarchy (they are not a direct parent of the task). Same constrains apply to a downstream task, which need to be a direct child of the other task.
Previous and Next
There may also be instances of the same task, but for different data intervals - from other runs of the same DAG. These are previous and next.
Task Types
Airflow has three types of tasks: Operator, Sensor, which is a subclass of Operator, and TaskFlow-decorated Task. All these are subclasses of Airflow's BaseOperator
. Operators and Sensor are templates, and when one is called in a DAG, it is made into a Task.
Operator
An Operator is a predefined task template.
Sensor
A Sensor is a subclass of Operator that wait for an external event to happen.
Also see Deferrable Operators and Triggers.
TaskFlow-decorated Task
Decorated with @task
. A custom Python function packaged up as a Task. For more details see:
Task Assignment to DAG
Task Instance
The same way a DAG is instantiated at runtime into a DAG Run, the tasks under a DAG are instantiated into Task Instances.
Task States
none
The task has not yet been queued for execution because its dependencies are not yet met.
scheduled
The task dependencies have been met, and the scheduled has determined that the task should run.
queued
The task has been assigned to an executor and it is awaiting a worker.
running
The task is running on a worker or on a local/synchronous executor.
success
The task finished running without errors.
shutdown
The task was externally requested to shut down when it was running.
restarting
The task was externally requested to restart when it was running.
failed
The task had an error during execution and failed to run.
skipped
The task was skipped due to branching, LatestOnly or similar.
upstream_failed
An upstream task failed and the Trigger Rule says we needed it.
up_for_retry
The task failed, but has retry attempts left and will be rescheduled.
up_for_reschedule
The task is a sensor that is in reschedule mode.
sensing
The task is a Smart Sensor.
deferred
The task has been deferred to a trigger.
removed
The task has vanished from the DAG since the run started.
Task Lifecycle
The normal lifecycle of a task instance is none → scheduled → queued → running → success.
Passing Data between Tasks
Tasks pass data among each other using:
- XComs, when the amount of metadata to be exchanged is small.
- Uploading and downloading large files from a storage service.
TaskGroup
This is a pure UI concept.
Task Timeout
Task SLA
An SLA, or a Service Level Agreement, is an expectation for the maximum time a Task should take.
Zombie/Undead Tasks
Per-Task Executor Configuration
Deferrable Operators and Triggers
XComs
Variables
Variables are an Airflow runtime configuration concept. Variables are maintained in a general key/value store, which is global, and which can be queried from the tasks.
Programming model:
from airflow.models import Variable
# Normal call style
some_variable = Variable.get("some_variable")
# Auto-deserializes a JSON value
some_other_variable = Variable.get("some_other_variable", deserialize_json=True)
# Returns the value of default_var (None) if the variable is not set
some_variable_2 = Variable.get("some_variable_2", default_var=None)
The variables can be used from templates.
Variables are global and should be only used for overall configuration that covers the entire installation. To pass data to and from tasks, XComs should be used instead.
Workload
Scheduler
Executor
Executors are the mechanism by which task instances get run. All executors have a common API and they are "pluggable", meaning they can be swapped based on operational needs.
There is no need to run a separate executor process (though you can). For local executors, the executor’s logic runs inside the scheduler process. If a scheduler is running, then the executor is running.
Executor Types
Local Executors
Local executors run tasks locally inside the scheduler process.
- Debug Executor
- Local Executor
- Sequential Executor. Airflow comes configured with the Sequential Executor by default.
Remote Executors
Remote executors run tasks remotely, usually via a pool of workers.
- Celery Executor
- CeleryKubernetes Executor
- Dask Executor
- Kuberentes Executor
- LocalKubernetes Executor