Airflow Concepts: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 57: Line 57:
====<tt>none</tt>====
====<tt>none</tt>====
The task has not yet been queued for execution because its dependencies are not yet met.
The task has not yet been queued for execution because its dependencies are not yet met.
====<tt>scheduled</tt>====
====<tt>scheduled</tt>====
The task dependencies have been met, and the scheduled has determined that the task should run.
====<tt>queued</tt>====
====<tt>queued</tt>====
The task has been assigned to an [[#Executor|executor]] and it is awaiting a [[#Worker|worker]].
====<tt>running</tt>====
====<tt>running</tt>====
The task is running on a [[#Worker|worker]] or on a local/synchronous executor.
====<tt>success</tt>====
====<tt>success</tt>====
The task finished running without errors.
====<tt>shutdown</tt>====
====<tt>shutdown</tt>====
The task was externally requested to shut down when it was running.
====<tt>restarting</tt>====
====<tt>restarting</tt>====
The task was externally requested to restart when it was running.
====<tt>failed</tt>====
====<tt>failed</tt>====
The task had an error during execution and failed to run.
====<tt>skipped</tt>====
====<tt>skipped</tt>====
The task was skipped due to branching, LatestOnly or similar.
====<tt>upstream_failed</tt>====
====<tt>upstream_failed</tt>====
An upstream task failed and the Trigger Rule says we needed it.
====<tt>up_for_retry</tt>====
====<tt>up_for_retry</tt>====
The task failed, but has retry attempts left and will be rescheduled.
====<tt>up_for_reschedule</tt>====
====<tt>up_for_reschedule</tt>====
The task is a [[#Sensor|sensor]] that is in reschedule mode.
====<tt>sensing</tt>====
====<tt>sensing</tt>====
The task is a Smart Sensor.
====<tt>deferred</tt>====
====<tt>deferred</tt>====
The task has been deferred to a trigger.
====<tt>removed</tt>====
====<tt>removed</tt>====
The task has vanished from the DAG since the run started.


==Passing Data between Tasks==
==Passing Data between Tasks==

Revision as of 19:24, 11 July 2022

External

Internal

Workflow

DAG

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html
Graph Concepts | Directed Acyclic Graph

The edges can be labeled in the UI.

SubDAG

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#concepts-subdags

A DAG is made of tasks among which there are relations of dependency. The DAG is not concerned about what happens inside the tasks, it is only concerned about how to run them: order, retries, timeouts. etc.

Declaring a DAG

DAG Run

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#concepts-dag-run

A DAG instantiates in a DAG Run at runtime.

Control Flow

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#control-flow

Dynamic DAG

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#dynamic-dags

The DAGs can be purely declarative, or they can be declared in Python code, by adding tasks dynamically.

Task

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html

A Task is the basic unit of execution in Airflow. Every task must be assigned to a DAG to run. Tasks have dependencies on each other. There could be upstream dependencies (if B depends on A, A → B, then A is an upstream dependency of B). To be scheduled, a task have all its dependencies met.

Task Dependencies

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#relationships

The task dependencies, or their relationships, are a key part in using Tasks. If a task B has a dependency on task A (A → B), it is said that A is upstream of B and B is downstream of A. The dependencies are the directed edges of the directed acyclic graph.

The upstream term has a very strict semantics: an upstream task is the task that is directly preceding the other task. This concept does not describe the tasks that are higher in the task hierarchy (they are not a direct parent of the task). Same constrains apply to a downstream task, which need to be a direct child of the other task.

Task Types

Airflow has three types of tasks: Operator, Sensor, which is a subclass of Operator, and TaskFlow-decorated Task. All these are subclasses of Airflow's BaseOperator. Operators and Sensor are templates, and when one is called in a DAG, it is made into a Task.

Operator

https://airflow.apache.org/docs/apache-airflow/stable/concepts/operators.html

An Operator is a predefined task template.

Sensor

https://airflow.apache.org/docs/apache-airflow/stable/concepts/sensors.html

A Sensor is a subclass of Operator that wait for an external event to happen.

TaskFlow-decorated Task

https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html

Decorated with @task. A custom Python function packaged up as a Task.

Task Assignment to DAG

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#dag-assignment

Task Instance

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#task-instances

The same way a DAG is instantiated at runtime into a DAG Run, the tasks under a DAG are instantiated into Task Instances.

Task States

none

The task has not yet been queued for execution because its dependencies are not yet met.

scheduled

The task dependencies have been met, and the scheduled has determined that the task should run.

queued

The task has been assigned to an executor and it is awaiting a worker.

running

The task is running on a worker or on a local/synchronous executor.

success

The task finished running without errors.

shutdown

The task was externally requested to shut down when it was running.

restarting

The task was externally requested to restart when it was running.

failed

The task had an error during execution and failed to run.

skipped

The task was skipped due to branching, LatestOnly or similar.

upstream_failed

An upstream task failed and the Trigger Rule says we needed it.

up_for_retry

The task failed, but has retry attempts left and will be rescheduled.

up_for_reschedule

The task is a sensor that is in reschedule mode.

sensing

The task is a Smart Sensor.

deferred

The task has been deferred to a trigger.

removed

The task has vanished from the DAG since the run started.

Passing Data between Tasks

Tasks pass data among each other using:

  • XComs, when the amount of metadata to be exchanged is small.
  • Uploading and downloading large files from a storage service.

TaskGroup

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#concepts-taskgroups

This is a pure UI concept.

XComs

https://airflow.apache.org/docs/apache-airflow/stable/concepts/taskflow.html

"Cross-communications".

Workload

Scheduler

https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html

Executor

https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html

Worker

Metadata Database

Connections & Hooks

https://airflow.apache.org/docs/apache-airflow/stable/concepts/connections.html

Pool

https://airflow.apache.org/docs/apache-airflow/stable/concepts/pools.html