Kubernetes Storage Concepts
- 1 External
- 2 Internal
- 3 Overview
- 4 Pod Volumes
- 4.1 Volume Types
- 4.1.1 configMap
- 4.1.2 secret
- 4.1.3 downwardAPI
- 4.1.4 emptyDir
- 4.1.5 hostPath
- 4.1.6 local
- 4.1.7 nfs
- 4.1.8 persistentVolumeClaim
- 4.1.9 projected
- 4.1.10 awsElasticBlockStore
- 4.1.11 glusterfs
- 4.1.12 azureDisk
- 4.1 Volume Types
- 5 Mounting Volumes in Pods
- 6 Storage Providers
- 7 Storage Plugins
- 8 Container Storage Interface (CSI)
- 9 API Resources
- 9.1 Storage Class (SC)
- 9.2 Persistent Volume (PV)
- 9.2.1 Difference between a Pod Volume and a Persistent Volume
- 9.2.2 Persistent Volume Manifest
- 9.2.3 Access Mode
- 9.2.4 Reclaim Policy
- 9.2.5 Storage Class Name
- 9.2.6 Capacity
- 9.2.7 Node Affinity
- 9.2.8 CSI
- 9.3 Persistent Volume Claim (PVC)
- 9.4 Volume and Claim Lifecycle and Binding
- 10 Dynamic Volume Provisioning
- 11 Persistent Volume Controller
- 12 Storage Operations
Kubernetes has a mature and feature-rich subsystem called the persistent volume subsystem, which exposes external storage to applications.
The volume outlives any containers that run within the pod, and data is preserved across container restarts. However, when a pod ceases to exist, the volume will too cease to exist. A pod can use multiple volumes, at the same time. Conceptually, a pod volume is just a directory, which is accessible to the containers in the pod. However, the actual backing medium of the directory, and its contents are determined by the particular volume type used. More details on how volumes and volume mounts are declared in the pod manifests are available in:
This type of volume is backed by a CofigMap API resource instance. For more details, see:
This type of volume is backed by a Secret API resource instance. secret volumes are backed by tmpfs (RAM-backed filesystem) so they are never written to non-volatile storage. For more details, see:
A typical secret volume definition looks as follows:
kind: Pod spec: [...] volumes: - name: my-secret-volume secret: defaultMode: 256 secretName: my-secret
When projected into the pod, the secret files belong to root:root, even if the pod's security context specify a runAsUser or runAsGroup. However, if fsGroup is defined in the pod security context, the secret files belong to fsGroup and the file permissions are automatically adjusted so they are readable by the group.
Specifies the permissions for the file created by the secret volume mount. JSON does not support octal notation, so the "0400" octal notation must be converted to decimal (256). If YAML is used, octal notation can be used. Note that if fsGroup is declared in the pod security context, the file permissions are automatically adjusted so they are readable by the group, even if the defaultMode is 0400.
An emptyDir volume is erased when the pod is removed.
A hostPath volume mounts a file or a directory from the node's host file system into the pod.
Normally, this is not something that pods should do, as it couples a pod with a specific node. This type of volume might introduce non-determinism in the pod behavior because pods with identical configuration may behave differently on different nodes due to different files on the nodes. The recommended way to consume local storage is via local volumes.
The files or directories created on the underlying hosts are only writable by root. You either need to run your process as root in a privileged container or modify the file permissions on the host to be able to write to a hostPath volume.
apiVersion: v1 kind: Pod metadata: name: test spec: containers: - name: test ... volumeMounts: - mountPath: /test-pd name: test-volume volumes: - name: test-volume hostPath: # directory location on host path: /data # this field is optional type: Directory
Required parameter that specifies the path on the local host filesystem.
An empty string (default) is for backward compatibility, which means that no checks will be performed before mounting the hostPath volume.
Other supported values:
- DirectoryOrCreate. If nothing exists at the given path, an empty directory will be created there as needed with permission set to 0755, having the same group and ownership with Kubelet.
- Directory. A directory must exist at the given path.
If we rely on the existence of the directory on the host, and we don't want to create it upon projection, then it is best to use 'type: Directory'. If the directory does not exist on the host path, the pod creation will fail with "MountVolume.SetUp failed for volume volume-1: hostPath type check failed: /tmp/x is not a directory", as a fail-early test. Also see hostPath on single-node Kubernetes Clusters (minikube,_Docker_Desktop_Kubernetes) below.
- FileOrCreate. If nothing exists at the given path, an empty file will be created there as needed with permission set to 0644, having the same group and ownership with Kubelet.
- File. A file must exist at the given path
- Socket. A UNIX socket must exist at the given path
- CharDevice. A character device must exist at the given path
- BlockDevice. A block device must exist at the given path
hostPath on single-node Kubernetes Clusters (minikube, Docker Desktop Kubernetes)
Single-node Kubernetes clusters running in VMs, such as Docker Desktop Kubernetes or minikube allow access to their host paths only if those paths are "shared" via the cluster's configuration. For Docker Desktop Kubernetes, host directories can be "shared" via Preferences → Resources → File Sharing (see Docker Desktop File Sharing). For minikube running with a VM driver, directories need to be individually mounted, while there are several mounted by default (see minikube mount). Minikube in bare-metal mode offers direct access to host directories.
If the path being attempted to be mounted as "hostPath" is not among the shared directories, it is interpreted as being relative to the embedded VM that runs the single-node Kubernetes cluster, not to the "outer" host and it is usually created inside the VM. Since the directory is created, the directory belongs to root:root, and that explains the impossibility to write into it as a non-root user. To prevent this behavior and fail early, use "type: Directory" for hostPath.
A local volume is storage physically attached to the node host. As such, a local volume on a certain node will be only available to pods scheduled on that node. This storage model makes sense for StatefulSets, but not for other pod deployment models: using local storage ties the application to specific nodes, making it harder to schedule. If that node or local volume encounters a failure and becomes inaccessible, then that pod also becomes inaccessible. In addition, many cloud providers do not provide extensive data durability guarantees for local storage, so all data could be lost in certain scenarios. Applications that are suitable for local storage should be tolerant of node failures, data unavailability, and data loss (e.g. Cassandra).
The local volume mechanism allows exposing a local disk, partition or directory. The storage can be exposed to the pod as a block storage (alpha feature at the time of the writing - this is useful to workloads that need to directly access block devices and manage their own data format) or as a filesystem.
Local volumes are available since v1.14.
Before any persistent volume claims for local persistent volumes are created, a dedicated storage class with the volumeBindingMode set to 'WaitForFirstConsumer' must be created. An example is available here.
local Volume Operations
An nfs volume allows an existing NFS share to be mounted into pods. Unlike emptyDir, which is erased when a pod is removed, the contents of an nfs volume are preserved and the volume is merely unmounted. This makes possible to pre-populated nfs volumes with data, and hand off data to pods and between pods. NFS can be mounted by multiple writers simultaneously. The NFS server must be running and the share exported before it can be used as an nfs volume. This is how a pod mounts an NFS volume:
apiVersion: v1 kind: Pod metadata: name: test spec: containers: - name: test ... volumeMounts: - mountPath: "/something" name: nfs-volume volumes: - name: nfs-volume nfs: # the URL of the NFS server server: 10.10.2.249 path: /opt/nfs0
Important The Kubernetes node host on which pods that attempt to mount nfs volumes are schedules must have NFS client dependencies, as described in NFS Client Installation, otherwise the mount will fail with messages similar to "mount: wrong fs type, bad option, bad superblock on 1..."
NFS volume example:
A persistentVolumeClaim volume is used to mount a persistent volume into the pod, by raising a "claim" to storage, in form of a persistent volume claim API object. This mode allows getting storage without knowing the details of a particular environment. This is how a pod requests a persistent volume:
apiVersion: v1 kind: Pod metadata: name: test spec: containers: - name: test ... volumeMounts: - mountPath: "/something" name: pvc-volume volumes: - name: pvc-volume persistentVolumeClaim: claimName: test-pvc
If the same claim name is reused for a volume with a different name, the pod will not start with:
Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-j6wgp persistent-storage persistent-storage-2]: timed out waiting for the condition
persistentVolumeClaim and hostPathA hostPath (local directory) can be exposed to a pod as a persistent volume, attached to the pod via a persistent volume claim:
TODO: A projected volume maps several existing volume sources into the same directory.
An awsElasticBlockStore volume mounts an Amazon Elastic Block Store volume into the pod. The EBS volume is a raw block volume. When the pod is removed, the contents of the ESB volumes are preserved, and the ESB volume is merely unmounted. This means it can be pre-populated with data, which can be handed off to pods. To use awsElasticBlockStore volumes, the nodes on which pods are running must be AWS EC2 instances, and those instances need to be in the same region and availability-zone as the EBS volume. EBS only supports a single EC2 instance mounting a volume.
apiVersion: v1 kind: Pod metadata: name: test spec: containers: - name: test ... volumeMounts: - mountPath: /test-ebs name: test-volume volumes: - name: test-volume # This AWS EBS volume must already exist awsElasticBlockStore: volumeID: <volume-id> fsType: ext4
A glusterfs volume allows a GlusterFS volume to be mounted into the pod.
Mounting Volumes in Pods
Storage is made available to a Kubernetes cluster by storage providers. The Kubernetes persistent volume subsystem supports, among others:
- iSCSI volumes
- NFS volumes
- Enterprise storage arrays from vendors like EMC and NetApp
- object storage blobs
- Amazon Elastic Block Store block devices
- Azure File resources, AzureDisk. See Azure Kubernetes Storage below.
- GCE Persistent Disks
- GlusterFS volumes
Each storage provider has its own plugin that handles the details of exposing the storage to the Kubernetes cluster.
Azure Kubernetes Storage
The terms "storage plugin" and "provisioner" can be used interchangeably. "Provisioner" is used especially when dynamic provisioning is involved. "Driver" is another equivalent term for storage plugin.
Old storage plugins used to be implemented as part of the main Kubernetes code tree (in-tree storage plugins), which raised a series of problems, such as that all had to be open-source and their release cycle was tied to the Kubernetes release cycle. Newer plugins are based on the Container Storage Interface (CSI) and can be developed out-of-tree.
- https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner - a local volume static provisioner that manages the persistent volume lifecycle for pre-allocated disks by detecting and creating PVs for each local disk on the host, and cleaning up the disks when released. It does not support dynamic provisioning.
Container Storage Interface (CSI)
Container Storage Interface (CSI) is an open standard that provides a clean interface for storage plugins and abstracts the internal Kubernetes storage details. CSI provides means so the external storage can be leveraged in a uniform way across multiple container orchestrators - not only Kubernetes. Both block and filesystem storage can be exposed via CSI.
The Kubernetes resources supporting the CSIDriver.
kubectl get csidriver
Amazon EFS CSI
Persistent Volume CSI Configuration
See CSI below.
Storage Class (SC)
A storage class is an API resource that allows the definition of a class or tier of storage, from which an application can then dynamically request storage. Storage classes are not namespaced.
For an overview of how storage classes, volumes and volume claims work together, see Volume and Claim Lifecycle and Binding below.
Different classes might map to quality-of-service levels, or to backup policies, or to arbitrary policies defined by the cluster administrators. Obviously, the type of storage classes that can be defined depends on the types of external storage the Kubernetes cluster has access to. A pod can use a dynamically-provisioned persistent volume from a specific storage class by using a persistent volume claim that references that storage class by name. The persistent volume that will provide the storage does not need to be created or declared: the storage class creates the persistent volume dynamically. Once deployed, the storage class watches the API server for new PVC objects that reference its name. When a matching persistent volume claim appears, the storage class dynamically creates the required persistent volume.
The storage class resources are defined in the storage.k8s.io/v1 API group. Each storage class object relates to a single provisioner. StorageClass objects are immutable, they cannot be modified once deployed.
Storage Class Manifest
Default Storage Class
For the time being, the default storage class is set via annotations. If the cluster has a default storage class, a pod can be deployed using just a persistent volume claim - the storage class does not need to be manually created.
Persistent Volume (PV)
The persistent volume is the API resource that maps onto external storage assets and makes them accessible to the Kubernetes cluster and to applications. Each persistent volume is an object in the Kubernetes cluster that maps back to a specific storage asset (LUN, share, blob, etc.). A single external storage asset can only be used by a single persistent volume.
The persistent volume, lasts for the cluster lifetime, unlike a pod volume, which lasts for the pod lifetime.
A pod can use a persistent volume by indicating a persistent volume claim (see below) whose access mode, storage class name and capacity match that of the persistent volume. The pod cannot specify a persistent volume directly, the match is intermediated by the Kubernetes cluster. For an overview of how storage classes, volumes and volume claims work together, see Volume and Claim Lifecycle and Binding below.
Difference between a Pod Volume and a Persistent Volume
Persistent Volume Manifest
The binding between a Persistent Volume and its Persistent Volume Claims can be made in one mode only. It is not possible for a persistent volume to have one Persistent Volume Claim bound to a Persistent Volume in ReadOnlyMany mode and another Persistent Volume Claim bound to the same volume in ReadWriteMany mode.
This mode defines a Persistent Volume that can only be bound in read/write mode by a single Persistent Volume Claim. An attempt to bind it via multiple Persistent Volume Claims will fail. In general, block storage normally only supports RWO.
This mode defines a Persistent Volume that can be bound in read only mode by multiple Persistent Volume Claims.
The reclaim policy tells Kubernetes what to do with a persistent volume when its persistent volume claim has been released.
This policy deletes the persistent volume and the underlying associated external storage resource, on the external storage system. This is the default policy for volumes that are created dynamically via a storage class.
This policy keeps the persistent volume in the cluster, as well as the underlying associated external storage resource, on the external storage system. However, it will prevent another persistent volume claim from using the persistent volume. To reuse the space associated with a retained persistent volume, the persistent volume should be manually deleted, the underlying external storage reformatted and then the persistent volume should be recreated.
Local persistent volumes can only support a "Retain" reclaim policy. The administrator must manually clean up and set up the local volume again for reuse.
Storage Class Name
The capacity, expressed in the persistent volume manifest, can be less than the actual underlying physical storage, but cannot be more.
The persistent volume scheduler uses the node affinity configuration of a local persistent volume to understand what node host the storage for the volume is available on.
Persistent Volume Claim (PVC)
Pods do not act directly on persistent volumes, they need something called Persistent Volume Claims, which is an API resource object that is bound to the Persistent Volume the pod wants to use. A Persistent Volume Claim is similar to a ticket that authorizes a pod to use a certain Persistent Volume. Once an application has a Persistent Volume Claim, it can mount the respective volume into its pod.
Persistent Volume Claims are namespaced, so their "effective" name is <namespace>/<claim-name>. Two different Persistent Volume Claims with the same name, but declared in different namespaces are different, so if one is bound to a Persistent Volume, the other cannot be bound to the same volume.
A Persistent Volume Claim can be bound to one and only one Persistent Volume. However, multiple pods can use the same Persistent Volume Claim, accessing, and sharing the same Persistent Volume, if the persistent volume storage allows sharing. For an in-depth discussion on how storage classes, volumes and volume claims work together, see Volume and Claim Lifecycle and Binding below.
Persistent Volume Claims and Storage Class
A claim may request a particular storage class by specifying its name, using the attribute storageClassName. If the claim expressly requests a class, only the persistent volumes of that class can be bound to the claim. Claims do not necessarily have to request a class. A claim with its storageClassName set to "" is always interpreted to be requesting a persistent volume with no class, so it can only be bound to persistent volumes with no class (no annotation or one set equal to ""). A claim with no storageClassName is not quite the same and is treated differently by the cluster, depending on whether the DefaultStorageClass admission controller is turned on. The DefaultStorageClass admission controller observes creation of PersistentVolumeClaim objects that do not request any specific storage class and automatically adds a default storage class to them. This way, users that do not request any special storage class do not need to care about them at all and they will get the default one. When more than one storage class is marked as default, it rejects any creation of persistent volume claim with an error and an administrator must revisit their StorageClass objects and mark only one as default. This admission controller ignores any persistent volume claim updates; it acts only on creation. The admission controller does not do anything when no default storage class is configured: the claims with no explicit storage class will only be bound to matching persistent volume with no storage class, if any. If the matching persistent volumes belong to an explicit storage class, they won't bind: this is because the claim and the persistent volume's storage classed must match to bind.
An optional persistent volume name can be specified in the persistent volume claim metadata. More qualified content here.
Persistent Volume Claim Manifest
Persistent Volume Claim Template
Volume and Claim Lifecycle and Binding
TODO: next time I am here, process https://kubernetes.io/docs/concepts/storage/persistent-volumes/#lifecycle-of-a-volume-and-claim and then integrate this:
A persistent volume is a cluster-level resource. A persistent volume claim is a request for a persistent volume resource, and acts as a claim check to the resources. To get access to storage, a pod lists a persistent volume claim in its volumes list. The persistent volume claim must exist as API resource. It usually specifies a storage class. Can a PVC request a specific PV, not a generic PV from a storage class?. During the pod deployment, the appropriate persistent volume from the storage class is identified (if exists), allocated and bound to the persistent volume claim, and thus bound to the pod. A persistent volume can be associated with one and only one persistent volume claim. However, multiple pods ca use the same persistent volume claim, thus sharing the persistent volume. The binding between a persistent volume and persistent volume claim is reflected both in the manifest of the persistent volume claim:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: efs-claim [...] spec: [...] volumeMode: Filesystem volumeName: efs-pv
apiVersion: v1 kind: PersistentVolume metadata: name: efs-pv spec: [...] claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: efs-claim namespace: dev resourceVersion: "18663986" uid: a139cd2f-3223-4caa-bdd1-9b6d80ca7b1
Dynamic Volume Provisioning
As per 2019, dynamic provisioning of local volumes is under design.
Persistent Volume Controller
The persistent volume controller matches persistent volume claims with suitable persistent volumes. It is part of the controller manager.