Kubernetes Pod and Container Concepts
External
- https://kubernetes.io/docs/concepts/workloads/pods/ (fully synced ✓)
Internal
Overview
A pod is the fundamental, atomic compute unit created and managed by Kubernetes. An application is deployed as one or more equivalent pods. There are various strategies to partition applications to pods. A pod groups together one or more containers. There are several types of containers: application containers, init containers and ephemeral containers. Pods are deployed on worker nodes. A pod has a well-defined lifecycle with several phases, and the pod's containers can only be in one of a well-defined number of states. Kubernetes learns of what happens with a container by container probes.
Pod
A pod is a group of one or more containers Kubernetes deploys and manages a compute unit, and the specification for how to run the containers. Kubernetes will not manage compute entities with smaller granularity, such as containers or processes. From a resource footprint perspective, a pod is bigger than a container, but smaller than a Virtual Machine. The containers of a pod are atomically deployed and managed as a group. A useful mental model when thinking of a pod is that of a logical host, where all its containers share a context. A pod contains one or more application containers and zero or more init containers.
The equivalent Amazon ECS construct is the task.
Pod Manifest
A pod manifest or a workload resource manifest includes a pod template.
Pod Operation Atomicity
Atomic Success or Failure
The deployment of a pod is an atomic operation. This means that a pod is either entirely deployed, with all its containers co-located on the same node, or not deployed at all. There will never be a situation where a partially deployed pod will be servicing application requests.
All Containers of a Pod are Scheduled on the Same Node
A pod can be scheduled on one node and one node only - regardless of many containers the pod has. All containers in the pod will be always co-located and co-scheduled on the same node. Only when all pod resources are ready the pod becomes available and application traffic is directed to it.
The containers in a pod share a virtual network device - a unique IP -, storage, in form of filesystem volumes and access to shared memory. From this perspective, a pod can be thought of as an application-specific logical host with all its processes (containers) sharing the network stack and the storage available to the host. In a pre-container world, these processes would have run on the same physical or virtual host. In line with this analogy, the pod cannot span hosts. The pod's containers are relatively tightly coupled and run within the shared context provided by the pod. The shared context of a pod is a set of Linux namespaces and cgroups. Within a pod's contexts, individual containers may have further sub-isolations applied.
Pods enable data sharing and communication among their constituent containers.
Networking
Each pod is assigned a unique IP address in the pod network. Inside the pod, every container share the network namespace, including the IP address and network ports. and can communicate among themselves using localhost
. When containers in the pod communicate with entities outside the pod, the must coordinate how they use shared network resources such as ports. The containers in a pod can also communicate within each other using standard inter-process communication like System V semaphores and POSIX shared memory. Containers in different pods have distinct IP addresses and cannot communicate via IPC primitives without special configuration. In this case, containers belonging to different pods that want to communicate with each other must use IP networking to communicate.
More details about networking in:
Pod Hostname
Containers within a pod see the system hostname as being the same as the configured name
for the pod.
Storage
The files that are created in the root filesystem of a container are stored in the writable layer of the container, which is discarded when the container exits. This makes these files ephemeral, they get discarded as part of the writable layer when the container is stopped or it fails. If the containers of a pod intend to store state beyond their existence, they can use the volumes provided by the pod. A pod can specify a set of shared storage volumes. All containers in the pod can access shared volumes.
The most common way to provide storage to pods is in form of Pesistent Volumes, which is a type of cluster-level Kubernetes resource. Persistent volumes can be shared among the containers of a pod and also among different pods. The volumes are declared in the pod specification section of the pod manifest (.spec.volumes
). The volume declarations are shared by all containers of that pod. A volume is mounted inside a container as a container volume mount. The volume mounts are specific to a container, and are declared in the .spec.containers[*].volumeMounts field
. Each container in the pod must independently specify where to mount each volume. A process in a container sees a filesystem view composed from their container image and volumes. The container image is at the root of the filesystem hierarchy, and any volumes are mounted at the specified paths within the image.
Also see:
Security Context
Security restrictions and privileges for constituent containers, such as running the container in privileged mode, can be set at the pod level, by defining a security context. More details about pod and container security concepts are available in:
Single-Container Pods vs. Multi-Container Pods
Pods are used in two main ways: pods that run a single container and pods that run multiple containers that work together.
The most common case is to declare a single container in a pod. In this case the pod is an extra wrapper around one container - Kubernetes manages the pod instead of managing the container directly. Even if a pod can accommodate multiple containers, the preferred way to scale an application is to add more one-container pods, instead of adding more containers in a pod.
There are advanced use cases - for example, service meshes - that require running multiple containers inside a pod. Containers share a pod when they execute tightly-coupled workloads, provide complementary functionality and need to share resources. Configuring two or more containers in the same pod guarantees that the containers will be run on the same node. Some commonly accepted use cases for collocated containers are service meshes and logging. A typical patter for which this arrangement is common is the sidecar pattern.
Each container of a multi-container pod can be exposed externally on its individual port. The containers share the pod's network namespace, thus the TCP and UDP port ranges.
Pod State
Pods should not maintain state, they should be handled as expendable. Kubernetes treats pods as static, largely immutable - changes cannot be made to a pod definition while the pod is running - and expendable, they do not maintain state when they are destroyed and recreated. Therefore, they are managed as workload resources backed by controllers, such as deployments or jobs, not directly by users, though pods can be started and managed individually, if the user wishes so. To modify a pod configuration, the current pod must be terminated, and a new one with a modified base image and/or configuration must be created.
In case the pods maintain state, Kubernetes provides a specialized workload resource names stateful set.
Pod Lifecycle
Pods are usually created by the controllers which manage workload resources, but they can also be created individually. A pod instance is created from a pod template, which can exist by itself in a pod manifest or it can be a part of a workload resource manifest. During their creation phase, the pods are assigned a unique ID (UID).
Once created, a pods is scheduled to run on a node: all its containers are scheduled on the same node. Once scheduled on the node, the pod remains on that node until:
- the pod finishes execution
- the pod resource is deleted
- the pod is evicted for lack of resources
- the node fails. If the node fails, all pods running on the node are scheduled for deletion after a timeout period.
This is another way of saying that a pod is scheduled once in its lifetime. Once the pod is scheduled (assigned) to a node, the pod will run on that node until one of the conditions listed above are met. The pods do not "self-heal" by themselves, if conditions like node failure or eviction occur, the pods are deleted, and another higher level abstraction, the workload resource and its controller, starts equivalent pods on other nodes. A given pod as defined by its UID is never rescheduled to a different node. Instead, that pod can be replaced by a new, almost-identical pod, even with the same name if desired, but with a different UID. Included objects, such as volumes, have the same life cycle as their enclosing pod: they exist as long as the specific pod, with the exact UID, exists. If that pod is deleted for any reasons, and a quasi-independent replacement is created, the related objects - the volume, for example - is also destroyed and created anew.
This lifecycle is reflected in the pod's phases: Pending, Running, Succeeded, Failed or Unknown. While the pod is running, and any of its containers fail, the kubelet will attempt to restart the failed container, depending on its configuration. To be able to do that, the kubelet tracks the pod's containers states.
If the template a set of pods was created based on changes, the workload resource controller that created the pods detects the change and creates new pods while the old pods are deleted, rather than updating or patching the existing pods.
It is possible to manage pods directly, by updating some of the fields of a running pod, in place with kubectl patch
or kubectl replace
. However, updating the pods in place has limitations. Most of pod metadata (namespace
, name
, uid
, creationTimestamp
, etc.) is immutable. generation
can only be incremented. More details on in-place pod updates are available here: Pod Update and Replacement.
Pod Phases
The pod phase is reflected by the .status.phase
field of the pod status. The phase is a simple high-level summary of where the pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or pod state, nor it is intended to be a comprehensive state machine. The number and meanings of pod phase values are tightly guarded. Other than what is documented here, nothing should be assumed about pods that have a given phase value:
apiVersion: v1
kind: Pod
status:
phase: Running
Pending
The pod has been accepted by the Kubernetes cluster, but one or more containers has not been set up and made ready to run. "Pending" includes time a pod spends waiting to be scheduled ad well s time spent downloading container images over the network.
Running
A pod transitions to the Running
phase if it has been bound to a node, all of its containers have been created, and at least one of its primary containers is still running or it is in process of starting or restarting.
Succeeded
A pod transitions to Succeeded
phase if all of its primary containers have terminated successfully, and will not be restarted.
Failed
A pod transitions to Failed
phase if all of its primary containers have terminated, and al least one container has terminated in failure - exited with non-zero status or it was terminated by the system. If a node dies or it is disconnected from the cluster, Kubernetes applies a policy for setting the phase of all its pods to "Failed".
Unknown
The state of the pod cannot be obtained, usually due an error in communication with the node where the pod should be running.
Terminating Pods
When a pod is being deleted it is shown as "Terminating" by some kubectl commands. "Terminating" is not one of the pod phases. A pod is granted a term to terminate gracefully, which defaults to 30 seconds. Pods can be forcefully terminated by using the --force
flag.
Pod Status and Conditions
In the Kubernetes API, pods have both a specification .spec
and an actual status .status
, which includes, among other status elements, a set of pod conditions listed below. It is possible to inject custom readiness information into the condition data for a pod, if that makes sense for the application (TODO: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate)
Pod Conditions
The pod conditions are listed in .status.conditions
as an array:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:51:54Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:52:28Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:52:28Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:51:54Z"
status: "True"
type: PodScheduled
Available conditions:
PodScheduled
The pod has been scheduled to a node.
ContainersReady
All containers in the pod are ready.
Initialized
All init containers have started successfully.
Ready
Pods and Nodes
Once bound to a node, a pod will never be detached from the node and re-bound to another node. The IP address of the node a pod is bound to can be retrieved by pulling the pod metadata, and searching the status for "hostIP". The name of the node can be found in the specification, searching for "nodeName".
apiVersion: v1
kind: Pod
metadata:
name: [...]
spec:
nodeName: ip-10-0-12-209.us-west-2.compute.internal
[...]
status:
hostIP: 10.0.12.209
[...]
Pod Placement
There are situations when we want to schedule specific pods to specific nodes - for example a pod running an application that has special memory requirements only some of the nodes can satisfy. Pods can be configured to scheduled on a specific node, defined by the node name, or on nodes that match a specific node selector.
To assign a pod to nodes that match a node selector, add the "nodeSelector" element in the pod configuration, with a value consisting in key/value pairs. After a successful placement, either by a replication controller or by a DaemonSet, the pod records the successful node selector expression as part of its definition, which can be rendered with kubectl get pod -o yaml
. Once bound to a node, a pod will never be relocated to another node.
Pod Security
Pod Horizontal Scaling
Every pod is meant to run a single instance of a given application. If the application needs to scale to sustain more load, multiple pods should be started. In Kubernetes, this is typically referred to as replication. The equivalent pod instances are referred to as replicas. They are usually created and managed as a group by a workload resource and its controller.
Static Pods
A static pod is managed directly by the kubelet process on a specific node, without the API server observing them. The kubelet directly supervises each static pod and restarts it if it fails, in contrast to regular pods, which are managed by the control plane through a workload resource of some sort. Static pods are always bound to one kubelet on a specific note. The main use for static pods is to run a self-hosted control plane components such as the API server, etcd, the scheduler, etc. The kubelet automatically tries to create a mirror pod on the Kubernetes API server for each static pod. This means the static pods running on a node are visible on the API server, but cannot be controlled from there. The specification of a static pod cannot refer to other API objects such as service accounts, config maps, secrets, etc.
Pods and Workload Resources
Pods and Containers
A pod and its containers have independent lifecycles. A pod is not a process, but an environment for running containers. Containers can be restarted in a pod, but a pod is never restarted: if a pod is gone, it is never resurrected. In the best case, another quasi-identical pod is created to take its place - more details available in the "Pod Lifecycle" section above.
Pod Operations
Container
Once the scheduler assigns a pod to a node, the kubelet starts creating containers for the pod, using the node's container runtime. There are thee possible container states: Waiting
, Running
or Terminated
. The kubectl describe pod <pod-name>
command shows the state for each container within the pod.
Container Types
Application Container
The application container is also referred to as "primary container".
Init Container
Ephemeral Container
Container States
The container states are tracked by the kubelet, who may restart failed containers, depending on the configuration. This way, a pod is kept running. Container states can also be used as triggers for container lifecycle hooks.
Waiting
If a container is not in either Running
or Terminated
state, it is in Waiting
, where is still running the operations it requires in order to complete start up: pulling the container image from the registry or applying Secret data. kubectl
query will give a "Reason" field when a container is in Waiting
state.
Running
The container is executing without issues. If there was a "postStart" hook configured, it has already executed and finished.
Terminated
The container began execution and the either ran to completion or failed for some reason. When queried with kubectl
the result will show a reason, an exit code and the start and finish time for the container. If a container has a "preStop" hook configured, that runs before the container enters in the Terminated
state.
Container Restart Policy
Pods are never restarted. Constituents containers may be restarted by the kubelet, subject to the pod's restart policy, as configured in the .spec.restartPolicy
. The possible values are Always
, OnFailure
and Never
. The default value is Always
. The restart policy applies to all containers in the pod. It only refers to restarts of the containers by the kubelet on the same node. After containers in a pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, ...) that is capped at five minutes. Once a container is executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.
Container Probes and Pod Health
A probe is a diagnostic performed periodically by the kubelet on a container. Each container can declare a set of probes - liveness, readiness and startup - that are used to evaluate the the health of individual containers and the pod as a whole. Summarize of a relationship between container probe result and overall pod situation.
Container Lifecycle Hooks
TODO: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/.