Kubernetes Storage Concepts
Internal
Overview
Kubernetes has a mature and feature-rich subsystem called the persistent volume subsystem, which exposes external storage to applications.
Pod Volumes
Regardless of where it comes from, external storage is exposed to pods in the form of volumes (or pod volume, as opposite to persistent volumes).
A Kubernetes pod volume has the same lifetime as the pod that encloses it.
The volume outlives any containers that run within the pod, and data is preserved across container restarts. However, when a pod ceases to exist, the volume will too cease to exist. A pod can use multiple volumes, at the same time. Conceptually, a pod volume is just a directory, which is accessible to the containers in the pod. However, the actual backing medium of the directory, and its contents are determined by the particular volume type used. More details on how volumes and volume mounts are declared in the pod manifests are available in:
Also see difference between a (pod) volume and a Persistent volume.
Volume Types
configMap
This type of volume is backed by a CofigMap API resource instance. For more details, see:
secret
This type of volume is backed by a Secret API resource instance. secret volumes are backed by tmpfs (RAM-backed filesystem) so they are never written to non-volatile storage. For more details, see:
downwardAPI
emptyDir
An emptyDir volume is erased when the pod is removed.
hostPath
A hostPath volume mounts a file or a directory from the node's host file system into the pod.
Normally, this is not something that pods should do, as it couples a pod with a specific node. This type of volume might introduce non-determinism in the pod behavior because pods with identical configuration may behave differently on different nodes due to different files on the nodes. The recommended way to consume local storage is via local volumes.
The files or directories created on the underlying hosts are only writable by root. You either need to run your process as root in a privileged container or modify the file permissions on the host to be able to write to a hostPath volume.
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: test
...
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
# this field is optional
type: Directory
path
Required parameter that specifies the path on the local host filesystem.
type
An empty string (default) is for backward compatibility, which means that no checks will be performed before mounting the hostPath volume.
Other supported values:
- DirectoryOrCreate. If nothing exists at the given path, an empty directory will be created there as needed with permission set to 0755, having the same group and ownership with Kubelet.
- Directory. A directory must exist at the given path
- FileOrCreate. If nothing exists at the given path, an empty file will be created there as needed with permission set to 0644, having the same group and ownership with Kubelet.
- File. A file must exist at the given path
- Socket. A UNIX socket must exist at the given path
- CharDevice. A character device must exist at the given path
- BlockDevice. A block device must exist at the given path
local
A local volume is storage physically attached to the node host. As such, a local volume on a certain node will be only available to pods scheduled on that node. This storage model makes sense for StatefulSets, but not for other pod deployment models: using local storage ties the application to specific nodes, making it harder to schedule. If that node or local volume encounters a failure and becomes inaccessible, then that pod also becomes inaccessible. In addition, many cloud providers do not provide extensive data durability guarantees for local storage, so all data could be lost in certain scenarios. Applications that are suitable for local storage should be tolerant of node failures, data unavailability, and data loss (e.g. Cassandra).
The local volume mechanism allows exposing a local disk, partition or directory. The storage can be exposed to the pod as a block storage (alpha feature at the time of the writing - this is useful to workloads that need to directly access block devices and manage their own data format) or as a filesystem.
Local volumes are available since v1.14.
Before any persistent volume claims for local persistent volumes are created, a dedicated storage class with the volumeBindingMode set to 'WaitForFirstConsumer' must be created. An example is available here.
local Volume Operations
nfs
An nfs volume allows an existing NFS share to be mounted into pods. Unlike emptyDir, which is erased when a pod is removed, the contents of an nfs volume are preserved and the volume is merely unmounted. This makes possible to pre-populated nfs volumes with data, and hand off data to pods and between pods. NFS can be mounted by multiple writers simultaneously. The NFS server must be running and the share exported before it can be used as an nfs volume. This is how a pod mounts an NFS volume:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: test
...
volumeMounts:
- mountPath: "/something"
name: nfs-volume
volumes:
- name: nfs-volume
nfs:
# the URL of the NFS server
server: 10.10.2.249
path: /opt/nfs0
Important The Kubernetes node host on which pods that attempt to mount nfs volumes are schedules must have NFS client dependencies, as described in NFS Client Installation, otherwise the mount will fail with messages similar to "mount: wrong fs type, bad option, bad superblock on 1..."
NFS volume example:
Also see:
persistentVolumeClaim
A persistentVolumeClaim volume is used to mount a persistent volume into the pod, by raising a "claim" to storage, in form of a persistent volume claim API object. This mode allows getting storage without knowing the details of a particular environment. This is how a pod requests a persistent volume:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: test
...
volumeMounts:
- mountPath: "/something"
name: pvc-volume
volumes:
- name: pvc-volume
persistentVolumeClaim:
claimName: test-pvc
persistentVolumeClaim Idiosyncrasies
If the same claim name is reused for a volume with a different name, the pod will not start with:
Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-j6wgp persistent-storage persistent-storage-2]: timed out waiting for the condition
persistentVolumeClaim and hostPath
A hostPath (local directory) can be exposed to a pod as a persistent volume, attached to the pod via a persistent volume claim:
projected
TODO: A projected volume maps several existing volume sources into the same directory.
awsElasticBlockStore
An awsElasticBlockStore volume mounts an Amazon Elastic Block Store volume into the pod. The EBS volume is a raw block volume. When the pod is removed, the contents of the ESB volumes are preserved, and the ESB volume is merely unmounted. This means it can be pre-populated with data, which can be handed off to pods. To use awsElasticBlockStore volumes, the nodes on which pods are running must be AWS EC2 instances, and those instances need to be in the same region and availability-zone as the EBS volume. EBS only supports a single EC2 instance mounting a volume.
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: test
...
volumeMounts:
- mountPath: /test-ebs
name: test-volume
volumes:
- name: test-volume
# This AWS EBS volume must already exist
awsElasticBlockStore:
volumeID: <volume-id>
fsType: ext4
glusterfs
A glusterfs volume allows a GlusterFS volume to be mounted into the pod.
Also see:
azureDisk
Mounting a Volume in Pod
Mounting the same volume (specified by its name) multiple time, with different mount characteristics, such different mount points, subPaths, etc. is permitted.
TODO consolidate with Pod Manifest - volumeMounts.
Storage Providers
Storage is made available to a Kubernetes cluster by storage providers. The Kubernetes persistent volume subsystem supports, among others:
- iSCSI volumes
- SMB
- NFS volumes
- Enterprise storage arrays from vendors like EMC and NetApp
- object storage blobs
- Amazon Elastic Block Store block devices
- Azure File resources, AzureDisk. See Azure Kubernetes Storage below.
- GCE Persistent Disks
- GlusterFS volumes
Each storage provider has its own plugin that handles the details of exposing the storage to the Kubernetes cluster.
Azure Kubernetes Storage
{{Internal|Azure Kubernetes Storage|Azure Kubernetes Storage]]
Storage Plugins
The terms "storage plugin" and "provisioner" can be used interchangeably. "Provisioner" is used especially when dynamic provisioning is involved. "Driver" is another equivalent term for storage plugin.
Old storage plugins used to be implemented as part of the main Kubernetes code tree (in-tree), which raised a series of problems, such as that all had to be open-source and their release cycle was tied to the Kubernetes release cycle. Newer plugins are based on the Container Storage Interface (CSI) and can be developed out-of-tree.
Plugin Types
kubernetes.io/no-provisioner
kuberentes.io/aws-ebs
kuberentes.io/gce-pd
Other Provisioners
- quay.io/kubernetes_incubator/nfs-provisioner
- https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner - a local volume static provisioner that manages the persistent volume lifecycle for pre-allocated disks by detecting and creating PVs for each local disk on the host, and cleaning up the disks when released. It does not support dynamic provisioning.
Container Storage Interface (CSI)
Container Storage Interface (CSI) is an open standard that provides a clean interface for storage plugins and abstracts the internal Kubernetes storage details. CSI provides means so the external storage can be leveraged in a uniform way across multiple container orchestrators - not only Kubernetes. Both block and filesystem storage can be exposed via CSI.
CSIDriver
The Kubernetes resources supporting the CSIDriver.
kubectl get csidriver
Amazon EFS CSI
Azure CSI
Persistent Volume CSI Configuration
See CSI below.
API Resources
The persistent volume subsystem consists of the following three API resource types that allow applications to consume storage: persistent volumes, persistent volume claims and storage classes:
Storage Class (SC)
A storage class is an API resource that allows the definition of a class or tier of storage, from which an application can then dynamically request storage. Storage classes are not namespaced.
For an overview of how storage classes, volumes and volume claims work together, see Volume and Claim Lifecycle and Binding below.
Different classes might map to quality-of-service levels, or to backup policies, or to arbitrary policies defined by the cluster administrators. Obviously, the type of storage classes that can be defined depends on the types of external storage the Kubernetes cluster has access to. A pod can use a dynamically-provisioned persistent volume from a specific storage class by using a persistent volume claim that references that storage class by name. The persistent volume that will provide the storage does not need to be created or declared: the storage class creates the persistent volume dynamically. Once deployed, the storage class watches the API server for new PVC objects that reference its name. When a matching persistent volume claim appears, the storage class dynamically creates the required persistent volume.
The storage class resources are defined in the storage.k8s.io/v1 API group. Each storage class object relates to a single provisioner. StorageClass objects are immutable, they cannot be modified once deployed.
Storage Class Manifest
Default Storage Class
For the time being, the default storage class is set via annotations. If the cluster has a default storage class, a pod can be deployed using just a persistent volume claim - the storage class does not need to be manually created.
Examples
Persistent Volume (PV)
The persistent volume is the API resource that maps onto external storage assets and makes them accessible to the Kubernetes cluster and to applications. Each persistent volume is an object in the Kubernetes cluster that maps back to a specific storage asset (LUN, share, blob, etc.). A single external storage asset can only be used by a single persistent volume.
The persistent volume, lasts for the cluster lifetime, unlike a pod volume, which lasts for the pod lifetime.
A pod can use a persistent volume by indicating a persistent volume claim (see below) whose access mode, storage class name and capacity match that of the persistent volume. The pod cannot specify a persistent volume directly, the match is intermediated by the Kubernetes cluster. For an overview of how storage classes, volumes and volume claims work together, see Volume and Claim Lifecycle and Binding below.
From a declarative perspective, to get a persistent volume storage, the pod lists a persistentVolumeClaim volume among the required volumes in its manifest, as shown above.
Difference between a Pod Volume and a Persistent Volume
Persistent Volume Manifest
Access Mode
The binding between a Persistent Volume and its Persistent Volume Claims can be made in one mode only. It is not possible for a persistent volume to have one Persistent Volume Claim bound to a Persistent Volume in ReadOnlyMany mode and another Persistent Volume Claim bound to the same volume in ReadWriteMany mode.
ReadWriteOnce (RWO)
This mode defines a Persistent Volume that can only be bound in read/write mode by a single Persistent Volume Claim. An attempt to bind it via multiple Persistent Volume Claims will fail. In general, block storage normally only supports RWO.
ReadWriteMany (RWM)
This mode defines a Persistent Volume that can be bound in read/write mode by multiple Persistent Volume Claims. In general, file storage and object storage support RWM.
ReadOnlyMany (ROM)
This mode defines a Persistent Volume that can be bound in read only mode by multiple Persistent Volume Claims.
Reclaim Policy
The reclaim policy tells Kubernetes what to do with a persistent volume when its persistent volume claim has been released.
Delete
This policy deletes the persistent volume and the underlying associated external storage resource, on the external storage system. This is the default policy for volumes that are created dynamically via a storage class.
Retain
This policy keeps the persistent volume in the cluster, as well as the underlying associated external storage resource, on the external storage system. However, it will prevent another persistent volume claim from using the persistent volume. To reuse the space associated with a retained persistent volume, the persistent volume should be manually deleted, the underlying external storage reformatted and then the persistent volume should be recreated.
Local persistent volumes can only support a "Retain" reclaim policy. The administrator must manually clean up and set up the local volume again for reuse.
Storage Class Name
Capacity
The capacity, expressed in the persistent volume manifest, can be less than the actual underlying physical storage, but cannot be more.
Node Affinity
The persistent volume scheduler uses the node affinity configuration of a local persistent volume to understand what node host the storage for the volume is available on.
CSI
driver
volumeHandle
Persistent Volume Claim (PVC)
Pods do not act directly on persistent volumes, they need something called Persistent Volume Claims, which is an API resource object that is bound to the Persistent Volume the pod wants to use. A Persistent Volume Claim is similar to a ticket that authorizes a pod to use a certain Persistent Volume. Once an application has a Persistent Volume Claim, it can mount the respective volume into its pod.
Persistent Volume Claims are namespaced, so their "effective" name is <namespace>/<claim-name>. Two different Persistent Volume Claims with the same name, but declared in different namespaces are different, so if one is bound to a Persistent Volume, the other cannot be bound to the same volume.
A Persistent Volume Claim can be bound to one and only one Persistent Volume. However, multiple pods can use the same Persistent Volume Claim, accessing, and sharing the same Persistent Volume, if the persistent volume storage allows sharing. For an in-depth discussion on how storage classes, volumes and volume claims work together, see Volume and Claim Lifecycle and Binding below.
From a declarative perspective, to get a persistent volume storage, the pod lists a persistentVolumeClaim volume among the required volumes in its manifest, as shown above.
Persistent Volume Claims and Storage Class
A claim may request a particular storage class by specifying its name, using the attribute storageClassName. If the claim expressly requests a class, only the persistent volumes of that class can be bound to the claim. Claims do not necessarily have to request a class. A claim with its storageClassName set to "" is always interpreted to be requesting a persistent volume with no class, so it can only be bound to persistent volumes with no class (no annotation or one set equal to ""). A claim with no storageClassName is not quite the same and is treated differently by the cluster, depending on whether the DefaultStorageClass admission controller is turned on. The DefaultStorageClass admission controller observes creation of PersistentVolumeClaim objects that do not request any specific storage class and automatically adds a default storage class to them. This way, users that do not request any special storage class do not need to care about them at all and they will get the default one. When more than one storage class is marked as default, it rejects any creation of persistent volume claim with an error and an administrator must revisit their StorageClass objects and mark only one as default. This admission controller ignores any persistent volume claim updates; it acts only on creation. The admission controller does not do anything when no default storage class is configured: the claims with no explicit storage class will only be bound to matching persistent volume with no storage class, if any. If the matching persistent volumes belong to an explicit storage class, they won't bind: this is because the claim and the persistent volume's storage classed must match to bind.
An optional persistent volume name can be specified in the persistent volume claim metadata. More qualified content here.
Persistent Volume Claim Manifest
Persistent Volume Claim Template
Volume and Claim Lifecycle and Binding
TODO: next time I am here, process https://kubernetes.io/docs/concepts/storage/persistent-volumes/#lifecycle-of-a-volume-and-claim and then integrate this:
A persistent volume is a cluster-level resource. A persistent volume claim is a request for a persistent volume resource, and acts as a claim check to the resources. To get access to storage, a pod lists a persistent volume claim in its volumes list. The persistent volume claim must exist as API resource. It usually specifies a storage class. Can a PVC request a specific PV, not a generic PV from a storage class?. During the pod deployment, the appropriate persistent volume from the storage class is identified (if exists), allocated and bound to the persistent volume claim, and thus bound to the pod. A persistent volume can be associated with one and only one persistent volume claim. However, multiple pods ca use the same persistent volume claim, thus sharing the persistent volume. The binding between a persistent volume and persistent volume claim is reflected both in the manifest of the persistent volume claim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-claim
[...]
spec:
[...]
volumeMode: Filesystem
volumeName: efs-pv
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv
spec:
[...]
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: efs-claim
namespace: dev
resourceVersion: "18663986"
uid: a139cd2f-3223-4caa-bdd1-9b6d80ca7b1
Dynamic Volume Provisioning
As per 2019, dynamic provisioning of local volumes is under design.
Persistent Volume Controller
The persistent volume controller matches persistent volume claims with suitable persistent volumes.