Kubernetes Pod and Container Concepts
External
- https://kubernetes.io/docs/concepts/workloads/pods/ (fully synced ✓)
Internal
Overview
A pod is the fundamental, atomic compute unit created and managed by Kubernetes. An application is deployed as one or more equivalent pods. There are various strategies to partition applications to pods. A pod groups together one or more containers. There are several types of containers: application containers, init containers and ephemeral containers. Pods are deployed on worker nodes. A pod has a well-defined lifecycle with several phases, and the pod's containers can only be in one of a well-defined number of states. Kubernetes learns of what happens with a container by container probes.
Pod
A pod is a group of one or more containers Kubernetes deploys and manages a compute unit, and the specification for how to run the containers. Kubernetes will not manage compute entities with smaller granularity, such as containers or processes. From a resource footprint perspective, a pod is bigger than a container, but smaller than a Virtual Machine. The containers of a pod are atomically deployed and managed as a group. A useful mental model when thinking of a pod is that of a logical host, where all its containers share a context. A pod contains one or more application containers and zero or more init containers.
The equivalent Amazon ECS construct is the task.
Pod Manifest
A pod manifest or a workload resource manifest includes a pod template.
Pod Operation Atomicity
Atomic Success or Failure
The deployment of a pod is an atomic operation. This means that a pod is either entirely deployed, with all its containers co-located on the same node, or not deployed at all. There will never be a situation where a partially deployed pod will be servicing application requests.
All Containers of a Pod are Scheduled on the Same Node
A pod can be scheduled on one node and one node only - regardless of many containers the pod has. All containers in the pod will be always co-located and co-scheduled on the same node. Only when all pod resources are ready the pod becomes available and application traffic is directed to it.
The containers in a pod share a virtual network device, which results in all container having a unique IP, storage, in form of filesystem volumes and access to shared memory. From this perspective, a pod can be thought of as an application-specific logical host with all its processes (containers) sharing the network stack and the storage available to the host. In a pre-container world, these processes would have run on the same physical or virtual host. In line with this analogy, a pod cannot span hosts.
The pod's containers are relatively tightly coupled and run within the shared context provided by the pod, which is a set of Linux namespaces and cgroups:
- Network namespace. Because the containers share the same virtual network stack provided by the network namespace, they will share the same IP address - the pod IP address. For the same reason, containers will share the TCP and UDP port ranges and the routing table. If co-located containers need to talk to each other, they can use the pod's localhost interface. The pod IP address is used for inter-pod communication, as they are routable on the pod network
- UTS namespace. This namespace is dedicated to setting hostnames, so all containers of the pod will share the same hostname.
- IPC namespace.
- The memory address space.
- Volumes.
Within a pod's contexts, individual containers may have further sub-isolations applied.
Networking
Each pod is assigned a unique IP address in the pod network. Inside the pod, every container share the network namespace, including the IP address and network ports. and can communicate among themselves using localhost
. When containers in the pod communicate with entities outside the pod, the must coordinate how they use shared network resources such as ports. The containers in a pod can also communicate within each other using standard inter-process communication like System V semaphores and POSIX shared memory. Containers in different pods have distinct IP addresses and cannot communicate via IPC primitives without special configuration. In this case, containers belonging to different pods that want to communicate with each other must use IP networking to communicate.
More details about networking in:
Pod Hostname
Containers within a pod see the system hostname as being the same as the configured name
for the pod.
Storage
The files that are created in the root filesystem of a container are stored in the writable layer of the container, which is discarded when the container exits. This makes these files ephemeral, they get discarded as part of the writable layer when the container is stopped or it fails. If the containers of a pod intend to store state beyond their existence, they can use the volumes provided by the pod. A pod can specify a set of shared storage volumes. All containers in the pod can access shared volumes.
The most common way to provide storage to pods is in form of Pesistent Volumes, which is a type of cluster-level Kubernetes resource. Persistent volumes can be shared among the containers of a pod and also among different pods. The volumes are declared in the pod specification section of the pod manifest (.spec.volumes
). The volume declarations are shared by all containers of that pod. A volume is mounted inside a container as a container volume mount. The volume mounts are specific to a container, and are declared in the .spec.containers[*].volumeMounts field
. Each container in the pod must independently specify where to mount each volume. A process in a container sees a filesystem view composed from their container image and volumes. The container image is at the root of the filesystem hierarchy, and any volumes are mounted at the specified paths within the image.
Also see:
Security Context
Security restrictions and privileges for constituent containers, such as running the container in privileged mode, can be set at the pod level, by defining a security context. More details about pod and container security concepts are available in:
Single-Container Pods vs. Multi-Container Pods
Pods are used in two main ways: pods that run a single container and pods that run multiple containers that work together.
The most common case is to declare a single container in a pod. In this case the pod is an extra wrapper around one container - Kubernetes manages the pod instead of managing the container directly. Even if a pod can accommodate multiple containers, the preferred way to scale an application is to add more one-container pods, instead of adding more containers in a pod.
There are advanced use cases - for example, service meshes - that require running multiple containers inside a pod. Containers share a pod when they execute tightly-coupled workloads, provide complementary functionality and need to share resources. Configuring two or more containers in the same pod guarantees that the containers will be run on the same node. Some commonly accepted use cases for collocated containers are service meshes and logging. A typical patter for which this arrangement is common is the sidecar pattern.
Each container of a multi-container pod can be exposed externally on its individual port. The containers share the pod's network namespace, thus the TCP and UDP port ranges.
Pod State
Pods should not maintain state, they should be handled as expendable. Kubernetes treats pods as static, largely immutable - changes cannot be made to a pod definition while the pod is running - and expendable, they do not maintain state when they are destroyed and recreated. Therefore, they are managed as workload resources backed by controllers, such as deployments or jobs, not directly by users, though pods can be started and managed individually, if the user wishes so. To modify a pod configuration, the current pod must be terminated, and a new one with a modified base image and/or configuration must be created.
In case the pods maintain state, Kubernetes provides a specialized workload resource names stateful set.
Pod Lifecycle
Pods are usually created by the controllers which manage workload resources, but they can also be deployed individually. A pod instance is created from a pod template, which can exist by itself in a pod manifest or it can be a part of a workload resource manifest.
When a pod is created individually, the control plane will validate the manifest, write it into the cluster store, then the scheduler will schedule the pod on a node with sufficient resources. A pod deployed via a pod manifest is called a singleton, or a bare pod. However, this scenario is not common, as a singleton pod is not replicated and has no self-healing capabilities. workload resource controllers usually start and stop pods, based on pod specifications configured in their manifests. Regardless of how a pod is declared, individually or as part of a workload resource controller specification, when it comes to scheduling, the pod is handled in the same way by the scheduler: the pod is instantiated and assigned to run on a node as a result of the scheduling process.
During their creation phase, the pods are assigned a unique ID (UID).
Once created, a pods is scheduled to run on a node: all its containers are scheduled on the same node. Once scheduled on the node, the pod remains on that node until:
- the pod finishes execution
- the pod resource is deleted
- the pod is evicted for lack of resources
- the node fails. If the node fails, all pods running on the node are scheduled for deletion after a timeout period.
This is another way of saying that a pod is scheduled once in its lifetime. Once the pod is scheduled (assigned) to a node, the pod will run on that node until one of the conditions listed above are met. The pods do not "self-heal" by themselves, if conditions like node failure or eviction occur, the pods are deleted, and another higher level abstraction, the workload resource and its controller, starts equivalent pods on other nodes. A given pod as defined by its UID is never rescheduled to a different node. Instead, that pod can be replaced by a new, almost-identical pod, even with the same name if desired, but with a different UID. Included objects, such as volumes, have the same life cycle as their enclosing pod: they exist as long as the specific pod, with the exact UID, exists. If that pod is deleted for any reasons, and a quasi-independent replacement is created, the related objects - the volume, for example - is also destroyed and created anew.
This lifecycle is reflected in the pod's phases: Pending, Running, Succeeded, Failed or Unknown. While the pod is running, and any of its containers fail, the kubelet will attempt to restart the failed container, depending on its configuration. To be able to do that, the kubelet tracks the pod's containers states.
If the template a set of pods was created based on changes, the workload resource controller that created the pods detects the change and creates new pods while the old pods are deleted, rather than updating or patching the existing pods.
It is possible to manage pods directly, by updating some of the fields of a running pod, in place with kubectl patch
or kubectl replace
. However, updating the pods in place has limitations. Most of pod metadata (namespace
, name
, uid
, creationTimestamp
, etc.) is immutable. generation
can only be incremented. More details on in-place pod updates are available here: Pod Update and Replacement.
Pod Phases
The pod phase is reflected by the .status.phase
field of the pod status. The phase is a simple high-level summary of where the pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or pod state, nor it is intended to be a comprehensive state machine. The number and meanings of pod phase values are tightly guarded. Other than what is documented here, nothing should be assumed about pods that have a given phase value.
The phase of a pod can be obtained from the API Server:
kubectl -o jsonpath='{.status.phase}' pod <pod-name>
apiVersion: v1
kind: Pod
status:
phase: Running
Pending
The pod has been accepted by the Kubernetes cluster, but one or more containers has not been set up and made ready to run. In this phase, the scheduler attempts to find a suitable node, chosen by evaluating various predicates and selecting the most appropriate node from multiple nodes, if there are multiple available nodes. Once the node is chosen, the pod is scheduled to that node, and the node downloads images and starts the pod container(s). The pod remains in pending phase until all of its resources are ready. "Pending" includes time a pod spends waiting to be scheduled as well as time spent downloading container images over the network.
Running
A pod transitions to the Running
phase if it has been bound to a node, all of its containers have been created, and at least one of its primary containers is still running or it is in process of starting or restarting.
Succeeded
A pod transitions to Succeeded
phase if all of its primary containers have terminated successfully, and will not be restarted.
Failed
However, the pod may also fail, while being either in pending or running phase. When the system reports that a pod is in Failed
phase, it means that all its primary containers have terminated, and at least one container has terminated in failure - exited with a non-zero status or was terminated by the system. If a node dies or it is disconnected from the cluster, Kubernetes applies a policy for setting the phase of all its pods to "Failed".
A failed pod is never "resurrected". Instead, a new pod based on the same manifest may be created as replacement, and while in many aspects is similar with the old pod, it will always have a new UID, and in most cases, a new IP address. In consequence, the IP address of an individual pod cannot be relied on. To provide a stable access point to a set of equivalent pods - which is how most applications are deployed in Kubernetes, Kubernetes uses the concept of Service.
The fact that failed pods are replaced with equivalent pods is also the reason for not storing state in pods. When they produce state and need to store it, the pods should rely on external resources: volumes and other services that are specialized in storing state.
An individual pod by itself has no built-in resilience: if it fails for any reason, it is gone. A higher level primitive - the Deployment - is used to manage a set of pods from a high availability perspective: the Deployment insures that a specific number of equivalent pods is always running, and if one of more pods fail, the Deployment brings up replacement pods. There are higher-level pod controllers that manage sets of pods in different ways: DaemonSets and StatefulSets. Individual pods can be managed as Jobs or CronJobs.
Unknown
The state of the pod cannot be obtained, usually due an error in communication with the node where the pod should be running.
Pod Status and Conditions
In the Kubernetes API, pods have both a specification .spec
and an actual status .status
, which includes, among other status elements, a set of pod conditions listed below. It is possible to inject custom readiness information into the condition data for a pod, if that makes sense for the application (TODO: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate)
Pod Conditions
The pod conditions are listed in .status.conditions
as an array:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:51:54Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:52:28Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:52:28Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-09-26T20:51:54Z"
status: "True"
type: PodScheduled
Available conditions:
PodScheduled
The pod has been scheduled to a node.
ContainersReady
All containers in the pod are ready.
Initialized
All init containers have started successfully.
Ready
The pod is able to server requests and should be added to the load balancing pools of all matching services.
Terminating Pods
When a pod is being deleted it is shown as "Terminating" by some kubectl commands. "Terminating" is not one of the pod phases. A pod is granted a term to terminate gracefully, which defaults to 30 seconds. Pods can be forcefully terminated by using the --force
flag.
TODO:
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced
Terminating vs. NonTerminating Pods
A terminating pod has non-zero positive integer as value of "spec.activeDeadlineSeconds". A non-terminating pod has no "spec.activeDeadlineSeconds" specification (nil). Long running pods as a web server or a database are non-terminating pods. The pod type can be specified as scope for resource quotas.
Pod Readiness and Readiness Gates
An application can inject extra feedback into the pod status in form of "pod readiness". TODO:
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-status
Pods and Nodes
TODO Pod topology spread constraints, Interaction with Node Affinity and Node Selectors, Comparison with PodAffinity/PodAntiAffinity:
- https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
- https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/
Once bound to a node, a pod will never be detached from the node and re-bound to another node. The IP address of the node a pod is bound to can be retrieved by pulling the pod metadata, and searching the status for "hostIP". The name of the node can be found in the specification, searching for "nodeName".
apiVersion: v1
kind: Pod
metadata:
name: [...]
spec:
nodeName: ip-10-0-12-209.us-west-2.compute.internal
[...]
status:
hostIP: 10.0.12.209
[...]
Pod Placement
There are situations when we want to schedule specific pods to specific nodes - for example a pod running an application that has special memory requirements only some of the nodes can satisfy. Pods can be configured to scheduled on a specific node, defined by the node name, or on nodes that match a specific node selector.
To assign a pod to nodes that match a node selector, add the "nodeSelector" element in the pod configuration, with a value consisting in key/value pairs. After a successful placement, either by a replication controller or by a DaemonSet, the pod records the successful node selector expression as part of its definition, which can be rendered with kubectl get pod -o yaml
. Once bound to a node, a pod will never be relocated to another node.
Pod Security
The privileges and the access control settings for the containers in a pod are defined by security contexts - the pod security context and containers security contexts. Security contexts are enforced as part of a pod security policy. For more details see:
Pod Identity
The identity of the applications running inside a pod is conferred by its service account.
TODO:
- Define network identity
- StatefulSets reschedule pods in such a way that they retain their identity.
Pod Service Account
The pod service account can be explicitly specified in the pod manifest as serviceAccountName
. If the service account is not explicitly provided, it is implied to be the default service account for the namespace the pod is deployed under (system:serviceaccount:<namespace>:default
).
The pod service account is used in the following situations:
- When a higher level controller creates the pod, it does so under the identity of the pod service account. For more details see Identity under which Pods are Created.
- The service account provides the identity for processes that run in the pod. Processes will authenticate using the identity provided by the service account.
Also see:
TODO: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
Identity under which Pods are Created
When pods are created directly by a user with:
kubectl apply -f pod.yaml
the identity used at creation and which subsequently is evaluated and used by the admission controllers, is the identity of the external user.
This identity can be changed on command line with:
kubectl --as system:serviceaccount:some-namespace:some-service-account apply -f pod.yaml
When a higher level controller such as a deployment (which in turn uses a ReplicaSet) attempts to create the pod, it does that under the identity of the service account specified in the pod template. By default, no particular service account is specified in the pod template, and that means the default service account for the target namespace will be used. However, if the pod manifest explicitly requests a different service account, the identity of that service account will be used to create the pod. For more details on relationship between the pod and its service account, see Pod Service Account.
Pod Horizontal Scaling
Every pod is meant to run a single instance of a given application. If the application needs to scale to sustain more load, multiple pods should be started. In Kubernetes, this is typically referred to as replication. The equivalent pod instances are referred to as replicas. They are usually created and managed as a group by a workload resource and its controller.
Static Pods
A static pod is managed directly by the kubelet process on a specific node, without the API server observing them. The kubelet directly supervises each static pod and restarts it if it fails, in contrast to regular pods, which are managed by the control plane through a workload resource of some sort. Static pods are always bound to one kubelet on a specific note. The main use for static pods is to run a self-hosted control plane components such as the API server, etcd, the scheduler, etc. The kubelet automatically tries to create a mirror pod on the Kubernetes API server for each static pod. This means the static pods running on a node are visible on the API server, but cannot be controlled from there. The specification of a static pod cannot refer to other API objects such as service accounts, config maps, secrets, etc.
Bare Pods
Pods and Workload Resources
Pods Disruptions
Accessing the API Server from Inside a Pod
Pods and Containers
A pod and its containers have independent lifecycles. A pod is not a process, but an environment for running containers. Containers can be restarted in a pod, but a pod is never restarted: if a pod is gone, it is never resurrected. In the best case, another quasi-identical pod is created to take its place - more details available in the "Pod Lifecycle" section above.
Pod Operations
Container
Once the scheduler assigns a pod to a node, the kubelet starts creating containers for the pod, using the node's container runtime. There are thee possible container states: Waiting
, Running
or Terminated
. The kubectl describe pod <pod-name>
command shows the state for each container within the pod.
Container Types
Application Container
The application container is also referred to as "primary container". If a pod declares init containers, the application containers are only run after all init container complete successfully.
Init Container
Ephemeral Container
Container States
The container states are tracked by the kubelet, who may restart failed containers, depending on the configuration. This way, a pod is kept running. Container states can also be used as triggers for container lifecycle hooks. To check state of container, you can use:
kubectl describe pod <pod-name>
The container state is displayed for each container within that pod.
Waiting
If a container is not in either Running
or Terminated
state, it is in Waiting
, where is still running the operations it requires in order to complete start up: pulling the container image from the registry or applying Secret data. kubectl
query will give a "Reason" field when a container is in Waiting
state.
Running
The container is executing without issues. If there was a "postStart" hook configured, it has already executed and finished.
Terminated
The container began execution and the either ran to completion or failed for some reason. When queried with kubectl
the result will show a reason, an exit code and the start and finish time for the container. If a container has a "preStop" hook configured, that runs before the container enters in the Terminated
state.
Container Unready State
A container may put itself into a unready state regardless of whether the readiness probe exists. The pod remains in the unready state while it waits for the containers in the pod to stop.
Container Images
A container image represents binary data encapsulated as an application and all its dependencies. More details about Docker images are available here. A pod's containers pull their images from their respective repositories while the pod is in Pending phase. More details about how images are pulled and Kubernetes relationship with image registries are available here:
Container Restart Policy
Pods are never restarted. Constituents containers may be restarted by the kubelet, subject to the pod's restart policy, as configured in the .spec.restartPolicy
. The possible values are Always
, OnFailure
and Never
. The default value is Always
. The restart policy applies to all containers in the pod. It only refers to restarts of the containers by the kubelet on the same node. After containers in a pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, ...) that is capped at five minutes. Once a container is executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.
Container Probes and Pod Health
A probe is a diagnostic performed periodically by the kubelet on a container. Each container can declare a set of probes - liveness, readiness and startup - that are used to evaluate the the health of individual containers and the pod as a whole. For more details on how the health of individual container influences the health of a pod, see Container and Pod Startup Check, Container and Pod Liveness Check and Container and Pod Readiness Check in:
Container Lifecycle Hooks
TODO: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/.