Airflow Concepts

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

External

Internal

Workflow

DAG

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html
https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#dags
Graph Concepts | Directed Acyclic Graph

The edges can be labeled in the UI.

DAG Name

When the DAG is declared with the DAG() constructor, the name is the first argument of the constructor. When the DAG is declared with the @dag decorator, the name of the DAG is the name of the function.

Declaring a DAG

DAGs are declared in Airflow Python script files, which are just configuration file specifying the DAG's structure as code. The actual tasks defined in it will run in a different context from the context of the script. Different tasks run on different workers at different points in time, which means that the script cannot be used to communicate between tasks. People sometimes think of the DAG definition file as a place where they can do some actual data processing. That is not the case at all. The script’s purpose is to define a DAG object. It needs to evaluate quickly (seconds, not minutes) since the scheduler will execute it periodically to reflect the changes if any.

With @dag Decorator

Declare a DAG with @dag decorator

With DAG() Constructor

Declare a DAG with a DAG() constructor

Via ContextManager

The DAG can be declared via a context manager.

DAG Configuration Parameters

Programming model usage example:

DAG Configuration Parameters Programming Model

schedule_interval

start_date

Required parameter.

catchup

max_active_runs

dagrun_timeout

default_args

DAG Run

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#concepts-dag-run
https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#dag-runs
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html

A DAG instantiates in a DAG Run at runtime. A DAG Run has an associated execution context.

External Triggers

https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#external-triggers

Control Flow

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#control-flow

Dynamic DAG or Dynamic Task Mapping

The DAGs can be purely declarative, or they can be declared in Python code, by adding tasks dynamically. For more details, see:

Dynamic Task Mapping

DAG File Processing

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dagfile-processing.html

DAG Serialization

https://airflow.apache.org/docs/apache-airflow/stable/dag-serialization.html

DAG Scope

https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#scope

DAG Configuration

Default Arguments

https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#default-arguments

Task

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html
https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#tasks

A Task is the basic unit of execution in Airflow. Every task must be assigned to a DAG to run. Tasks have dependencies on each other. There could be upstream dependencies (if B depends on A, A → B, then A is an upstream dependency of B). To be scheduled, a task have all its dependencies met.

Task Relationships

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#relationships

The task relationships, are a key part in using Tasks.

There are two types of relationships: dependency and

Upstream and Downstream Dependency

If a task B has a dependency on task A (A → B), it is said that A is upstream of B and B is downstream of A. The dependencies are the directed edges of the directed acyclic graph.

The upstream term has a very strict semantics: an upstream task is the task that is directly preceding the other task. This concept does not describe the tasks that are higher in the task hierarchy (they are not a direct parent of the task). Same constrains apply to a downstream task, which need to be a direct child of the other task.

Previous and Next

There may also be instances of the same task, but for different data intervals - from other runs of the same DAG. These are previous and next.

Task Types

Airflow has three types of tasks: Operator, Sensor, which is a subclass of Operator, and TaskFlow-decorated Task. All these are subclasses of Airflow's BaseOperator. Operators and Sensor are templates, and when one is called in a DAG, it is made into a Task.

Operator

https://airflow.apache.org/docs/apache-airflow/stable/concepts/operators.html
https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#operators

An Operator is a predefined task template.

Sensor

https://airflow.apache.org/docs/apache-airflow/stable/concepts/sensors.html
https://airflow.apache.org/docs/apache-airflow/stable/concepts/smart-sensors.html
https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#sensors

A Sensor is a subclass of Operator that wait for an external event to happen.

Also see Deferrable Operators and Triggers.

@task Task Decorator, TaskFlow-decorated Task

Decorated with @task. A custom Python function packaged up as a Task. For more details see:

TaskFlow

Task Assignment to DAG

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#dag-assignment

Task Instance

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#task-instances
https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#task-instances

The same way a DAG is instantiated at runtime into a DAG Run, the tasks under a DAG are instantiated into task instances.

The task instance is passed as the first argument of the methods annotated with @task, which provides the task behavior.

Task States

none

The task has not yet been queued for execution because its dependencies are not yet met.

scheduled

The task dependencies have been met, and the scheduled has determined that the task should run.

queued

The task has been assigned to an executor and it is awaiting a worker.

running

The task is running on a worker or on a local/synchronous executor.

success

The task finished running without errors.

shutdown

The task was externally requested to shut down when it was running.

restarting

The task was externally requested to restart when it was running.

failed

The task had an error during execution and failed to run.

skipped

The task was skipped due to branching, LatestOnly or similar.

upstream_failed

An upstream task failed and the Trigger Rule says we needed it.

up_for_retry

The task failed, but has retry attempts left and will be rescheduled.

up_for_reschedule

The task is a sensor that is in reschedule mode.

sensing

The task is a Smart Sensor.

deferred

The task has been deferred to a trigger.

removed

The task has vanished from the DAG since the run started.

Task Lifecycle

https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#task-lifecycle

The normal lifecycle of a task instance is nonescheduledqueuedrunningsuccess.

Task Configuration and Data Exchange

Variables

https://airflow.apache.org/docs/apache-airflow/stable/concepts/variables.html
https://airflow.apache.org/docs/apache-airflow/stable/howto/variable.html
https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#variables

Variables are an Airflow runtime configuration concept. Variables are maintained in a general key/value store, which is global and shared by the entire Airflow instance, and which can be queried from the tasks. While the variables can be created via API, they can also be created and updated via UI, gy going to Admin → Variables. Internally, the variables are stored in the variable table. They can be encrypted.

Programming model:

from airflow.models import Variable

# Normal call style
some_variable = Variable.get("some_variable")

# Auto-deserializes a JSON value
some_other_variable = Variable.get("some_other_variable", deserialize_json=True)

# Returns the value of default_var (None) if the variable is not set
some_variable_2 = Variable.get("some_variable_2", default_var=None)

The variables can be used from templates.

Variables are global and should be only used for overall configuration that covers the entire installation. To pass data to and from tasks, XComs should be used instead.

Params

https://airflow.apache.org/docs/apache-airflow/stable/concepts/params.html

Params are used to provide runtime configuration to tasks. When a DAG is started manually, its Params can be modified before the DAG run starts.

DAG-level Params

Task-level Params

https://airflow.apache.org/docs/apache-airflow/stable/concepts/params.html#task-level-params

XComs

Tasks pass data among each other using:

  • XComs, when the amount of metadata to be exchanged is small.
  • Uploading and downloading large files from a storage service.

More details:

XComs

Execution Context

https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#accessing-current-context

What is the execution context?

An execution context is associated with a DAG run. The current execution context can be retrieved from a task, during the task execution. The context is not accessible during pre_execute or post_execute..

Programming model:

from airflow.operators.python import task, get_current_context

@task
def some_task():
    context = get_current_context()
    a = context["a"]

Airflow populates the context with keys like "dag", "task", and the associated values:

Key Value Note
conf airflow.configuration.AirflowConfigParser instance
dag DAG instance
dag_run DAG Run instance
data_interval_start 2022-07-16T04:12:58.698129+00:00
data_interval_end 2022-07-16T04:12:58.698129+00:00
ds 2022-07-16
ds_nodash 20220716
execution_date 2022-07-16T04:12:58.698129+00:00
inlets
logical_date 2022-07-16T04:12:58.698129+00:00
macros
next_ds 2022-07-16
next_ds_nodash 20220716
next_execution_date 2022-07-16T04:12:58.698129+00:00
outlets
params
prev_data_interval_start_success 2022-07-16T03:38:00.597026+00:00
prev_data_interval_end_success 2022-07-16T03:38:00.597026+00:00
prev_ds 2022-07-16
prev_ds_nodash 20220716
prev_execution_date 2022-07-16T04:12:58.698129+00:00
prev_execution_date_success
prev_start_date_success 2022-07-16T03:38:01.381156+00:00
run_id manual__2022-07-16T04:12:58.698129+00:00
task
task_instance TaskInstance instance
task_instance_key_str task_experiment_2__task_d__20220716
test_mode False
ti TaskInstance instance See task_instance.
tomorrow_ds 2022-07-17
tomorrow_ds_nodash 20220717
ts 2022-07-16T04:12:58.698129+00:00
ts_nodash 20220716T041258
ts_nodash_with_tz 20220716T041258.698129+0000
var {'json': None, 'value': None}
conn None
yesterday_ds 2022-07-15
yesterday_ds_nodash 20220715
templates_dict None

Is the context a valid method to exchange data between tasks? How about task being executed in parallel and concurrently mutating the context? Is the context thread safe?

TaskGroup

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html#concepts-taskgroups

This is a pure UI concept.

Task Timeout

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#timeouts

Task SLA

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#slas

An SLA, or a Service Level Agreement, is an expectation for the maximum time a Task should take.

Zombie/Undead Tasks

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#zombie-undead-tasks

Per-Task Executor Configuration

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#executor-configuration

Deferrable Operators and Triggers

https://airflow.apache.org/docs/apache-airflow/stable/concepts/deferring.html

Task Logging

See:

Airflow Logging and Monitoring | Logging for Tasks

Branching

https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#branching

Workload

Scheduler

https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html

Executor

https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html

Executors are the mechanism by which task instances get run. All executors have a common API and they are "pluggable", meaning they can be swapped based on operational needs.

There is no need to run a separate executor process (though you can). For local executors, the executor’s logic runs inside the scheduler process. If a scheduler is running, then the executor is running.

Executor Types

Local Executors

Local executors run tasks locally inside the scheduler process.

Remote Executors

Remote executors run tasks remotely, usually via a pool of workers.

Worker

Metadata Database

Connections & Hooks

https://airflow.apache.org/docs/apache-airflow/stable/concepts/connections.html

Connection

https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html

Hook

https://airflow.apache.org/docs/apache-airflow/stable/concepts/connections.html#hooks
https://airflow.apache.org/docs/apache-airflow/stable/python-api-ref.html#pythonapi-hooks

Pool

https://airflow.apache.org/docs/apache-airflow/stable/concepts/pools.html

Macros

https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#templates-ref

Timetables

https://airflow.apache.org/docs/apache-airflow/stable/concepts/timetable.html

Priority Weights

https://airflow.apache.org/docs/apache-airflow/stable/concepts/priority-weight.html

Cluster Policies

https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html

Plugins

https://airflow.apache.org/docs/apache-airflow/stable/plugins.html

Security

Airflow Security

Logging and Monitoring

Logging and Monitoring

Integration

https://airflow.apache.org/docs/apache-airflow/stable/integration.html

Python Module Management

https://airflow.apache.org/docs/apache-airflow/stable/modules_management.html

Trigger Rule

https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#trigger-rules

Jinja Support

https://airflow.apache.org/docs/apache-airflow/2.0.0/concepts.html#jinja-templating

Airflow Programming Model

Airflow Programming Model