Kubernetes Pod and Container Concepts

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

External

Internal

Overview

A pod is the fundamental, atomic compute unit created and managed by Kubernetes. An application is deployed as one or more equivalent pods. There are various strategies to partition applications to pods. A pod groups together one or more containers. There are several types of containers: application containers, init containers and ephemeral containers. Pods are deployed on worker nodes. A pod has a well-defined lifecycle with several phases, and the pod's containers can only be in one of a well-defined number of states. Kubernetes learns of what happens with a container by container probes.

Pod

A pod is a group of one or more containers Kubernetes deploys and manages as a compute unit, and the specification for how to run the containers. Kubernetes will not manage compute entities with smaller granularity, such as containers or processes. From a resource footprint perspective, a pod is bigger than a container, but smaller than a Virtual Machine. The containers of a pod are atomically deployed and managed as a group. A useful mental model when thinking of a pod is that of a logical host, where all its containers share a context. A pod contains one or more application containers and zero or more init containers.

Technically, a pod is a pause container, which isolates an area of the host OS, creates a network stack and a set of kernel namespaces and runs one or more containers in it.

The equivalent Amazon ECS construct is the task.

Pod Manifest

A pod manifest or a workload resource manifest includes a pod template.

Pod Manifest

Pod Operation Atomicity

Atomic Success or Failure

The deployment of a pod is an atomic operation. This means that a pod is either entirely deployed, with all its containers co-located on the same node, or not deployed at all. There will never be a situation where a partially deployed pod will be servicing application requests.

All Containers of a Pod are Scheduled on the Same Node

A pod can be scheduled on one node and one node only - regardless of many containers the pod has. All containers in the pod will be always co-located and co-scheduled on the same node. Only when all pod resources are ready the pod becomes available and application traffic is directed to it.

Shared Context

The containers in a pod share a virtual network device, which results in all container having a unique IP, storage, in form of filesystem volumes, and access to shared memory. From this perspective, a pod can be thought of as an application-specific logical host with all its processes (containers) sharing the network stack and the storage available to the host. In a pre-container world, these processes would have run on the same physical or virtual host. In line with this analogy, a pod cannot span hosts.

The pod's containers are relatively tightly coupled and run within the shared context provided by the pod: a set of Linux namespaces and cgroups:

System resource consumption at pod level is bounded using a cgroups-based mechanism. cgroups is a Linux kernel feature that allows allocation of system resources (CPU, memory, network bandwidth) among user-defined groups of processes. Individual containers in a pod may have further sub-isolations applied via their own cgroups limits.

Networking

Each pod is assigned a unique IP address in the pod network. Inside the pod, every container shares the same virtual network stack provided by the network namespace, they will share the same IP address - the pod IP address. For the same reason, containers will share the TCP and UDP port ranges and the routing table. They can communicate among themselves using the pod's localhost interface. When containers in the pod communicate with entities outside the pod, the must coordinate how they use shared network resources such as ports. The containers in a pod can also communicate within each other using standard inter-process communication like System V semaphores and POSIX shared memory. Containers in different pods have distinct IP addresses and cannot communicate via IPC primitives without special configuration. In this case, containers belonging to different pods that want to communicate with each other must use IP networking to communicate. The pod IP address is routable on the pod network. More details about networking in:

Kubernetes Networking Concepts

Pod Hostname

Containers within a pod see the system hostname as being the same as the configured name for the pod.

Storage

The files that are created in the root filesystem of a container are stored in the writable layer of the container, which is discarded when the container exits. This makes these files ephemeral, they get discarded as part of the writable layer when the container is stopped or it fails. If the containers of a pod intend to store state beyond their existence, they can use the volumes provided by the pod. A pod can specify a set of shared storage volumes. All containers in the pod can access shared volumes.

The most common way to provide storage to pods is in form of Pesistent Volumes, which is a type of cluster-level Kubernetes resource. Persistent volumes can be shared among the containers of a pod and also among different pods. The volumes are declared in the pod specification section of the pod manifest (.spec.volumes). The volume declarations are shared by all containers of that pod. A volume is mounted inside a container as a container volume mount. The volume mounts are specific to a container, and are declared in the .spec.containers[*].volumeMounts field. Each container in the pod must independently specify where to mount each volume. A process in a container sees a filesystem view composed from their container image and volumes. The container image is at the root of the filesystem hierarchy, and any volumes are mounted at the specified paths within the image.

Also see:

Kubernetes Storage Concepts
Pod Volumes
Persistent Volumes

Security Context

Security restrictions and privileges for constituent containers, such as running the container in privileged mode, can be set at the pod level, by defining a security context. More details about pod and container security concepts are available in:

Pod and Container Security

Single-Container Pods vs. Multi-Container Pods

Pods are used in two main ways: pods that run a single container and pods that run multiple containers that work together.

The most common case is to declare a single container in a pod. In this case the pod is an extra wrapper around one container - Kubernetes manages the pod instead of managing the container directly. Even if a pod can accommodate multiple containers, the preferred way to scale an application is to add more one-container pods, instead of adding more containers in a pod.

There are advanced use cases - for example, service meshes - that require running multiple containers inside a pod. Containers share a pod when they execute tightly-coupled workloads, provide complementary functionality and need to share resources. Configuring two or more containers in the same pod guarantees that the containers will be run on the same node. Some commonly accepted use cases for collocated containers are service meshes and logging. A typical patter for which this arrangement is common is the sidecar pattern.

Each container of a multi-container pod can be exposed externally on its individual port. The containers share the pod's network namespace, thus the TCP and UDP port ranges.

Pod State

Pods should not maintain state, they should be handled as expendable. Kubernetes treats pods as static, largely immutable - changes cannot be made to a pod definition while the pod is running - and expendable, they do not maintain state when they are destroyed and recreated. Therefore, they are managed as workload resources backed by controllers, such as deployments or jobs, not directly by users, though pods can be started and managed individually, if the user wishes so. To modify a pod configuration, the current pod must be terminated, and a new one with a modified base image and/or configuration must be created.

In case the pods maintain state, Kubernetes provides a specialized workload resource names stateful set.

Pod Lifecycle

Pods are usually created by the controllers which manage workload resources, but they can also be deployed individually. A pod instance is created from a pod template, which can exist by itself in a pod manifest or it can be a part of a workload resource manifest.

When a pod is created individually, the control plane will validate the manifest, write it into the cluster store, then the scheduler will schedule the pod on a node with sufficient resources.

More details about the Kubernetes scheduler and scheduling are available in:

Kubernetes Scheduling, Preemption and Eviction Concepts

A pod deployed via a pod manifest is called a singleton, or a bare pod. However, this scenario is not common, as a singleton pod is not replicated and has no self-healing capabilities. workload resource controllers usually start and stop pods, based on pod specifications configured in their manifests. Regardless of how a pod is declared, individually or as part of a workload resource controller specification, when it comes to scheduling, the pod is handled in the same way by the scheduler: the pod is instantiated and assigned to run on a node as a result of the scheduling process.

During their creation phase, the pods are assigned a unique ID (UID).

Once created, a pods is scheduled to run on a node: all its containers are scheduled on the same node. Once scheduled on the node, the pod remains on that node until:

  • the pod finishes execution
  • the pod resource is deleted
  • the pod is evicted for lack of resources
  • the node fails. If the node fails, all pods running on the node are scheduled for deletion after a timeout period.

This is another way of saying that a pod is scheduled once in its lifetime. Once the pod is scheduled (assigned) to a node, the pod will run on that node until one of the conditions listed above are met. The pods do not "self-heal" by themselves, if conditions like node failure or eviction occur, the pods are deleted, and another higher level abstraction, the workload resource and its controller, starts equivalent pods on other nodes. A given pod as defined by its UID is never rescheduled to a different node. Instead, that pod can be replaced by a new, almost-identical pod, even with the same name if desired, but with a different UID. Included objects, such as volumes, have the same life cycle as their enclosing pod: they exist as long as the specific pod, with the exact UID, exists. If that pod is deleted for any reasons, and a quasi-independent replacement is created, the related objects - the volume, for example - is also destroyed and created anew.

This lifecycle is reflected in the pod's phases: Pending, Running, Succeeded, Failed or Unknown. While the pod is running, and any of its containers fail, the kubelet will attempt to restart the failed container, depending on its configuration. To be able to do that, the kubelet tracks the pod's containers states.

If the template a set of pods was created based on changes, the workload resource controller that created the pods detects the change and creates new pods while the old pods are deleted, rather than updating or patching the existing pods.

It is possible to manage pods directly, by updating some of the fields of a running pod, in place with kubectl patch or kubectl replace. However, updating the pods in place has limitations. Most of pod metadata (namespace, name, uid, creationTimestamp, etc.) is immutable. generation can only be incremented. More details on in-place pod updates are available here: Pod Update and Replacement.

Pod Phases

The pod phase is reflected by the .status.phase field of the pod status. The phase is a simple high-level summary of where the pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or pod state, nor it is intended to be a comprehensive state machine. The number and meanings of pod phase values are tightly guarded. Other than what is documented here, nothing should be assumed about pods that have a given phase value.

The phase of a pod can be obtained from the API Server:

kubectl -o jsonpath='{.status.phase}' pod <pod-name>

apiVersion: v1
kind: Pod
status:
  phase: Running

Pending

The pod has been accepted by the Kubernetes cluster, but one or more containers has not been set up and made ready to run. In this phase, the scheduler attempts to find a suitable node, chosen by evaluating various predicates and selecting the most appropriate node from multiple nodes, if there are multiple available nodes. Once the node is chosen, the pod is scheduled to that node, and the node downloads images and starts the pod container(s). The pod remains in pending phase until all of its resources are ready. "Pending" includes time a pod spends waiting to be scheduled as well as time spent downloading container images over the network.

Running

A pod transitions to the Running phase if it has been bound to a node, all of its containers have been created, and at least one of its primary containers is still running or it is in process of starting or restarting.

Succeeded

A pod transitions to Succeeded phase if all of its primary containers have terminated successfully, and will not be restarted.

Failed

However, the pod may also fail, while being either in pending or running phase. When the system reports that a pod is in Failed phase, it means that all its primary containers have terminated, and at least one container has terminated in failure - exited with a non-zero status or was terminated by the system. If a node dies or it is disconnected from the cluster, Kubernetes applies a policy for setting the phase of all its pods to "Failed".

A failed pod is never "resurrected". Instead, a new pod based on the same manifest may be created as replacement, and while in many aspects is similar with the old pod, it will always have a new UID, and in most cases, a new IP address. In consequence, the IP address of an individual pod cannot be relied on. To provide a stable access point to a set of equivalent pods - which is how most applications are deployed in Kubernetes, Kubernetes uses the concept of Service.

The fact that failed pods are replaced with equivalent pods is also the reason for not storing state in pods. When they produce state and need to store it, the pods should rely on external resources: volumes and other services that are specialized in storing state.

An individual pod by itself has no built-in resilience: if it fails for any reason, it is gone. A higher level primitive - the Deployment - is used to manage a set of pods from a high availability perspective: the Deployment insures that a specific number of equivalent pods is always running, and if one of more pods fail, the Deployment brings up replacement pods. There are higher-level pod controllers that manage sets of pods in different ways: DaemonSets and StatefulSets. Individual pods can be managed as Jobs or CronJobs.

Unknown

The state of the pod cannot be obtained, usually due an error in communication with the node where the pod should be running.

Pod Status and Conditions

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions

In the Kubernetes API, pods have both a specification .spec and an actual status .status, which includes, among other status elements, a set of pod conditions listed below. It is possible to inject custom readiness information into the condition data for a pod, if that makes sense for the application (TODO: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate)

Pod Conditions

The pod conditions are listed in .status.conditions as an array:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-26T20:51:54Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-26T20:52:28Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-26T20:52:28Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-26T20:51:54Z"
    status: "True"
    type: PodScheduled

Available conditions:

PodScheduled

The pod has been scheduled to a node.

ContainersReady

All containers in the pod are ready.

Initialized

All init containers have started successfully.

Ready

The pod is able to server requests and should be added to the load balancing pools of all matching services.

Terminating Pods

When deletion of a Pod is being requested via API, the kubelet makes a request to the container runtime to attempt to stop the containers. The container runtime first sends a SIGTERM signal into the main process of each container, then allows for a grace period so the processes are given a chance to exit gracefully, then, if they are still running, forcefully stopped with the KILL signal. Then the Pod is deleted from the API Server.

When a pod is being deleted it is shown as "Terminating" by some kubectl commands. "Terminating" is not one of the pod phases. A pod is granted a term to terminate gracefully, which defaults to 30 seconds. Pods can be forcefully terminated by using the --force flag.

TODO:

Terminating vs. NonTerminating Pods

A terminating pod has non-zero positive integer as value of "spec.activeDeadlineSeconds". A non-terminating pod has no "spec.activeDeadlineSeconds" specification (nil). Long running pods as a web server or a database are non-terminating pods. The pod type can be specified as scope for resource quotas.

Pod Readiness and Readiness Gates

An application can inject extra feedback into the pod status in form of "pod readiness". TODO:

Pods and Nodes

TODO Pod topology spread constraints, Interaction with Node Affinity and Node Selectors, Comparison with PodAffinity/PodAntiAffinity:


Once bound to a node, a pod will never be detached from the node and re-bound to another node. The IP address of the node a pod is bound to can be retrieved by pulling the pod metadata, and searching the status for "hostIP". The name of the node can be found in the specification, searching for "nodeName".

apiVersion: v1
kind: Pod
metadata:
  name: [...]
spec:
  nodeName: ip-10-0-12-209.us-west-2.compute.internal
  [...]
status:
  hostIP: 10.0.12.209
  [...]

Pod Placement

There are situations when we want to schedule specific pods to specific nodes - for example a pod running an application that has special memory requirements only some of the nodes can satisfy. Pods can be configured to scheduled on a specific node, defined by the node name, or on nodes that match a specific node selector.

To assign a pod to nodes that match a node selector, add the "nodeSelector" element in the pod configuration, with a value consisting in key/value pairs. After a successful placement, either by a replication controller or by a DaemonSet, the pod records the successful node selector expression as part of its definition, which can be rendered with kubectl get pod -o yaml. Once bound to a node, a pod will never be relocated to another node.

Assign Pod to Specific Node

Pod Security

The privileges and the access control settings for the containers in a pod are defined by security contexts - the pod security context and containers security contexts. Security contexts are enforced as part of a pod security policy. For more details see:

Pod and Container Security

Pod Identity

The identity of the applications running inside a pod is conferred by its service account.

TODO:

  • Define network identity
  • StatefulSets reschedule pods in such a way that they retain their identity.

Pod Service Account

The pod service account can be explicitly specified in the pod manifest as serviceAccountName. If the service account is not explicitly provided, it is implied to be the default service account for the namespace the pod is deployed under (system:serviceaccount:<namespace>:default).

The pod service account is used in the following situations:

  • When a workload resource creates the pod, it does so under the identity of the pod service account. For more details see Identity under which Pods are Created.
  • The service account provides the identity for processes that run in the pod. Processes will authenticate using the identity provided by the service account.

Also see:

Service Account

TODO: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/

Identity under which Pods are Created

When pods are created directly by a user with:

kubectl apply -f pod.yaml

the identity used at creation and which subsequently is evaluated and used by the admission controllers, is the identity of the external user.

This identity can be changed on command line with:

kubectl --as system:serviceaccount:some-namespace:some-service-account apply -f pod.yaml

When a workload resource such as a deployment (which in turn uses a ReplicaSet) attempts to create the pod, it does that under the identity of the service account specified in the pod template. By default, no particular service account is specified in the pod template, and that means the default service account for the target namespace will be used. However, if the pod manifest explicitly requests a different service account, the identity of that service account will be used to create the pod. For more details on relationship between the pod and its service account, see Pod Service Account.

Pod Horizontal Scaling

Every pod is meant to run a single instance of a given application. If the application needs to scale to sustain more load, multiple pods should be started. In Kubernetes, this is typically referred to as replication. The equivalent pod instances are referred to as replicas. They are usually created and managed as a group by a workload resource and its controller.

Static Pods

A static pod is managed directly by the kubelet process on a specific node, without the API server observing them. The kubelet directly supervises each static pod and restarts it if it fails, in contrast to regular pods, which are managed by the control plane through a workload resource of some sort. Static pods are always bound to one kubelet on a specific note. The main use for static pods is to run a self-hosted control plane components such as the API server, etcd, the scheduler, etc. The kubelet automatically tries to create a mirror pod on the Kubernetes API server for each static pod. This means the static pods running on a node are visible on the API server, but cannot be controlled from there. The specification of a static pod cannot refer to other API objects such as service accounts, config maps, secrets, etc.

Bare Pods

Bare Pods

Pods and Workload Resources

Workload Resources

Pods Disruptions

Pods Disruptions

Accessing the API Server from Inside a Pod

https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod

Pods and Containers

A pod and its containers have independent lifecycles. A pod is not a process, but an environment for running containers. Containers can be restarted in a pod, but a pod is never restarted: if a pod is gone, it is never resurrected. In the best case, another quasi-identical pod is created to take its place - more details available in the "Pod Lifecycle" section above.

Pod Operations

Container

https://kubernetes.io/docs/concepts/containers/
https://developers.redhat.com/blog/2018/02/22/container-terminology-practical-introduction

Once the scheduler assigns a pod to a node, the kubelet starts creating containers for the pod, using the node's container runtime. There are thee possible container states: Waiting, Running or Terminated. The kubectl describe pod <pod-name> command shows the state for each container within the pod.

Container Types

Application Container

The application container is also referred to as "primary container". If a pod declares init containers, the application containers are only run after all init container complete successfully.

Init Container

Init Containers

Ephemeral Container

Ephemeral Containers

Container States

The container states are tracked by the kubelet, who may restart failed containers, depending on the configuration. This way, a pod is kept running. Container states can also be used as triggers for container lifecycle hooks. To check state of container, you can use:

kubectl describe pod <pod-name>

The container state is displayed for each container within that pod.

Waiting

If a container is not in either Running or Terminated state, it is in Waiting, where is still running the operations it requires in order to complete start up: pulling the container image from the registry or applying Secret data. kubectl query will give a "Reason" field when a container is in Waiting state.

Running

The container is executing without issues. If there was a "postStart" hook configured, it has already executed and finished.

Terminated

The container began execution and the either ran to completion or failed for some reason. When queried with kubectl the result will show a reason, an exit code and the start and finish time for the container. If a container has a "preStop" hook configured, that runs before the container enters in the Terminated state.

Container Unready State

A container may put itself into a unready state regardless of whether the readiness probe exists. The pod remains in the unready state while it waits for the containers in the pod to stop.

Container Images

A container image represents binary data encapsulated as an application and all its dependencies. More details about Docker images are available here. A pod's containers pull their images from their respective repositories while the pod is in Pending phase. More details about how images are pulled and Kubernetes relationship with image registries are available here:

Container Image Pull Concepts

Container Restart Policy

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

Pods are never restarted. Constituents containers may be restarted by the kubelet, subject to the pod's restart policy, as configured in the .spec.restartPolicy. The possible values are Always, OnFailure and Never. The default value is Always. The restart policy applies to all containers in the pod. It only refers to restarts of the containers by the kubelet on the same node. After containers in a pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, ...) that is capped at five minutes. Once a container is executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.

Container Probes and Pod Health

A probe is a diagnostic performed periodically by the kubelet on a container. Each container can declare a set of probes - liveness, readiness and startup - that are used to evaluate the the health of individual containers and the pod as a whole. For more details on how the health of individual container influences the health of a pod, see Container and Pod Startup Check, Container and Pod Liveness Check and Container and Pod Readiness Check in:

Container Probes

Container Lifecycle Hooks

TODO: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/.

Scheduling, Preemption and Eviction

Kubernetes Scheduling, Preemption and Eviction Concepts