Kubernetes Pod and Container Concepts: Difference between revisions
Line 61: | Line 61: | ||
Pods are usually created by the controllers which manage [[Kubernetes Workload Resources#Overview|workload resources]], but they can also be created individually. A pod instance is created from a [[Kubernetes_Pod_Manifest#Pod_Template|pod template]], which can exist by itself in a [[Kubernetes Pod Manifest|pod manifest]] or it can be a part of a workload resource manifest. During their creation phase, the pods are assigned a unique ID ([[Kubernetes_API_Resources_Concepts#UIDs|UID]]). | Pods are usually created by the controllers which manage [[Kubernetes Workload Resources#Overview|workload resources]], but they can also be created individually. A pod instance is created from a [[Kubernetes_Pod_Manifest#Pod_Template|pod template]], which can exist by itself in a [[Kubernetes Pod Manifest|pod manifest]] or it can be a part of a workload resource manifest. During their creation phase, the pods are assigned a unique ID ([[Kubernetes_API_Resources_Concepts#UIDs|UID]]). | ||
Once created, a pods is scheduled to run on a node: [[#All_Containers_of_a_Pod_are_Scheduled_on_the_Same_Node|all its containers are scheduled on the same node]]. Once scheduled on the node, the pod remains on that node until: | |||
* the pod finishes execution | |||
* the pod resource is deleted | |||
* the pod is evicted for lack of resources | |||
* the node fails. | |||
This is another way of saying that a pod is scheduled once in its lifetime. Once the pod is scheduled (assigned) to a node, the pod will run on that node until one of the conditions listed above are met. This lifecycle is reflected in the pod's phases: [[#Pending|Pending]], [[#Running|Running]], [[#Succeeded|Succeeded]], [[#Failed|Failed]] or [[#Unknown|Unknown]]. While the pod is running, and any of its containers fail, the kubelet will attempt to restart the failed container, depending on its configuration. To be able to do that, the kubelet tracks the pod's [[#Container_States|containers states]]. | |||
If the template a set of pods was created based on changes, the workload resource controller that created the pods detects the change and creates new pods while the old pods are deleted, rather than updating or patching the existing pods. | If the template a set of pods was created based on changes, the workload resource controller that created the pods detects the change and creates new pods while the old pods are deleted, rather than updating or patching the existing pods. |
Revision as of 20:25, 26 September 2021
External
- https://kubernetes.io/docs/concepts/workloads/pods/ (fully synced ✓)
Internal
Overview
A pod is the fundamental, atomic compute unit created and managed by Kubernetes. An application is deployed as one or more equivalent pods. There are various strategies to partition applications to pods. A pod groups together one or more containers. There are several types of containers: application containers, init containers and ephemeral containers. Pods are deployed on worker nodes. A pod has a well-defined lifecycle with several phases, and the pod's containers can only be in one of a well-defined number of states. Kubernetes learns of what happens with a container by container probes.
Pod
A pod is a group of one or more containers Kubernetes deploys and manages a compute unit, and the specification for how to run the containers. Kubernetes will not manage compute entities with smaller granularity, such as containers or processes. From a resource footprint perspective, a pod is bigger than a container, but smaller than a Virtual Machine. The containers of a pod are atomically deployed and managed as a group. A useful mental model when thinking of a pod is that of a logical host, where all its containers share a context. A pod contains one or more application containers and zero or more init containers.
The equivalent Amazon ECS construct is the task.
Pod Manifest
A pod manifest or a workload resource manifest includes a pod template.
Pod Operation Atomicity
Atomic Success or Failure
The deployment of a pod is an atomic operation. This means that a pod is either entirely deployed, with all its containers co-located on the same node, or not deployed at all. There will never be a situation where a partially deployed pod will be servicing application requests.
All Containers of a Pod are Scheduled on the Same Node
A pod can be scheduled on one node and one node only - regardless of many containers the pod has. All containers in the pod will be always co-located and co-scheduled on the same node. Only when all pod resources are ready the pod becomes available and application traffic is directed to it.
The containers in a pod share a virtual network device - a unique IP -, storage, in form of filesystem volumes and access to shared memory. From this perspective, a pod can be thought of as an application-specific logical host with all its processes (containers) sharing the network stack and the storage available to the host. In a pre-container world, these processes would have run on the same physical or virtual host. In line with this analogy, the pod cannot span hosts. The pod's containers are relatively tightly coupled and run within the shared context provided by the pod. The shared context of a pod is a set of Linux namespaces and cgroups. Within a pod's contexts, individual containers may have further sub-isolations applied.
Pods enable data sharing and communication among their constituent containers.
Networking
Each pod is assigned a unique IP address in the pod network. Inside the pod, every container share the network namespace, including the IP address and network ports. and can communicate among themselves using localhost
. When containers in the pod communicate with entities outside the pod, the must coordinate how they use shared network resources such as ports. The containers in a pod can also communicate within each other using standard inter-process communication like System V semaphores and POSIX shared memory. Containers in different pods have distinct IP addresses and cannot communicate via IPC primitives without special configuration. In this case, containers belonging to different pods that want to communicate with each other must use IP networking to communicate.
More details about networking in:
Pod Hostname
Containers within a pod see the system hostname as being the same as the configured name
for the pod.
Storage
The files that are created in the root filesystem of a container are stored in the writable layer of the container, which is discarded when the container exits. This makes these files ephemeral, they get discarded as part of the writable layer when the container is stopped or it fails. If the containers of a pod intend to store state beyond their existence, they can use the volumes provided by the pod. A pod can specify a set of shared storage volumes. All containers in the pod can access shared volumes.
The most common way to provide storage to pods is in form of Pesistent Volumes, which is a type of cluster-level Kubernetes resource. Persistent volumes can be shared among the containers of a pod and also among different pods. The volumes are declared in the pod specification section of the pod manifest (.spec.volumes
). The volume declarations are shared by all containers of that pod. A volume is mounted inside a container as a container volume mount. The volume mounts are specific to a container, and are declared in the .spec.containers[*].volumeMounts field
. Each container in the pod must independently specify where to mount each volume. A process in a container sees a filesystem view composed from their container image and volumes. The container image is at the root of the filesystem hierarchy, and any volumes are mounted at the specified paths within the image.
Also see:
Security Context
Security restrictions and privileges for constituent containers, such as running the container in privileged mode, can be set at the pod level, by defining a security context. More details about pod and container security concepts are available in:
Single-Container Pods vs. Multi-Container Pods
Pods are used in two main ways: pods that run a single container and pods that run multiple containers that work together.
The most common case is to declare a single container in a pod. In this case the pod is an extra wrapper around one container - Kubernetes manages the pod instead of managing the container directly. Even if a pod can accommodate multiple containers, the preferred way to scale an application is to add more one-container pods, instead of adding more containers in a pod.
There are advanced use cases - for example, service meshes - that require running multiple containers inside a pod. Containers share a pod when they execute tightly-coupled workloads, provide complementary functionality and need to share resources. Configuring two or more containers in the same pod guarantees that the containers will be run on the same node. Some commonly accepted use cases for collocated containers are service meshes and logging. A typical patter for which this arrangement is common is the sidecar pattern.
Each container of a multi-container pod can be exposed externally on its individual port. The containers share the pod's network namespace, thus the TCP and UDP port ranges.
Pod State
Pods should not maintain state, they should be handled as expendable. Kubernetes treats pods as static, largely immutable - changes cannot be made to a pod definition while the pod is running - and expendable, they do not maintain state when they are destroyed and recreated. Therefore, they are managed as workload resources backed by controllers, such as deployments or jobs, not directly by users, though pods can be started and managed individually, if the user wishes so. To modify a pod configuration, the current pod must be terminated, and a new one with a modified base image and/or configuration must be created.
In case the pods maintain state, Kubernetes provides a specialized workload resource names stateful set.
Pod Lifecycle
Pods are usually created by the controllers which manage workload resources, but they can also be created individually. A pod instance is created from a pod template, which can exist by itself in a pod manifest or it can be a part of a workload resource manifest. During their creation phase, the pods are assigned a unique ID (UID).
Once created, a pods is scheduled to run on a node: all its containers are scheduled on the same node. Once scheduled on the node, the pod remains on that node until:
- the pod finishes execution
- the pod resource is deleted
- the pod is evicted for lack of resources
- the node fails.
This is another way of saying that a pod is scheduled once in its lifetime. Once the pod is scheduled (assigned) to a node, the pod will run on that node until one of the conditions listed above are met. This lifecycle is reflected in the pod's phases: Pending, Running, Succeeded, Failed or Unknown. While the pod is running, and any of its containers fail, the kubelet will attempt to restart the failed container, depending on its configuration. To be able to do that, the kubelet tracks the pod's containers states.
If the template a set of pods was created based on changes, the workload resource controller that created the pods detects the change and creates new pods while the old pods are deleted, rather than updating or patching the existing pods.
It is possible to manage pods directly, by updating some of the fields of a running pod, in place with kubectl patch
or kubectl replace
. However, updating the pods in place has limitations. Most of pod metadata (namespace
, name
, uid
, creationTimestamp
, etc.) is immutable. generation
can only be incremented. More details on in-place pod updates are available here: Pod Update and Replacement.
Pod Phases
Pending
Running
A pod transitions to the Running
phase if at least one of its primary containers start OK.
Succeeded
A pod transitions to Succeeded
phase if all of its primary containers terminate successfully.
Failed
A pod transitions to Failed
phase if any of its primary containers terminate in failure.
Unknown
Pod Status and Conditions
In the Kubernetes API, pods have both a specification (.spec
) and an actual status, which consists of a set of pod conditions listed below. It is possible to inject custom readiness information into the condition data for a pod, if that makes sense for the application (TODO: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate)
Pod Conditions
PodScheduled
Pods and Nodes
Once bound to a node, a pod will never be detached from the node and re-bound to another node. The IP address of the node a pod is bound to can be retrieved by pulling the pod metadata, and searching the status for "hostIP". The name of the node can be found in the specification, searching for "nodeName".
apiVersion: v1
kind: Pod
metadata:
name: [...]
spec:
nodeName: ip-10-0-12-209.us-west-2.compute.internal
[...]
status:
hostIP: 10.0.12.209
[...]
Pod Placement
There are situations when we want to schedule specific pods to specific nodes - for example a pod running an application that has special memory requirements only some of the nodes can satisfy. Pods can be configured to scheduled on a specific node, defined by the node name, or on nodes that match a specific node selector.
To assign a pod to nodes that match a node selector, add the "nodeSelector" element in the pod configuration, with a value consisting in key/value pairs. After a successful placement, either by a replication controller or by a DaemonSet, the pod records the successful node selector expression as part of its definition, which can be rendered with kubectl get pod -o yaml
. Once bound to a node, a pod will never be relocated to another node.
Pods and Containers
A pod and its containers have independent lifecycles. A pod is not a process, but an environment for running containers. Containers can be restarted in a pod, but a pod is never restarted: if a pod is gone, it is never resurrected. In the best case, another quasi-identical pod is created to take its place.
Pod Security
Pod Horizontal Scaling
Every pod is meant to run a single instance of a given application. If the application needs to scale to sustain more load, multiple pods should be started. In Kubernetes, this is typically referred to as replication. The equivalent pod instances are referred to as replicas. They are usually created and managed as a group by a workload resource and its controller.
Static Pods
A static pod is managed directly by the kubelet process on a specific node, without the API server observing them. The kubelet directly supervises each static pod and restarts it if it fails, in contrast to regular pods, which are managed by the control plane through a workload resource of some sort. Static pods are always bound to one kubelet on a specific note. The main use for static pods is to run a self-hosted control plane components such as the API server, etcd, the scheduler, etc. The kubelet automatically tries to create a mirror pod on the Kubernetes API server for each static pod. This means the static pods running on a node are visible on the API server, but cannot be controlled from there. The specification of a static pod cannot refer to other API objects such as service accounts, config maps, secrets, etc.
Pods and Workload Resources
Pod Operations
Container
TODO:
Container Types
Application Container
The application container is also referred to as "primary container".
Init Container
Ephemeral Container
Container States
The container states are tracked by the kubelet, who may restart failed containers, depending on the configuration. This way, a pod is kept running.
Container Probes and Pod Health
A probe is a diagnostic performed periodically by the kubelet on a container. Each container can declare a set of probes - liveness, readiness and startup - that are used to evaluate the the health of individual containers and the pod as a whole. Summarize of a relationship between container probe result and overall pod situation.