Kubernetes Pod and Container Security: Difference between revisions
(164 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
* [[Kubernetes_Security_Concepts#Pod_Security_Context_and_Container_Security_Context|Kubernetes Security Concepts]] | * [[Kubernetes_Security_Concepts#Pod_Security_Context_and_Container_Security_Context|Kubernetes Security Concepts]] | ||
* [[Linux Security Concepts]] | * [[Linux Security Concepts]] | ||
* [[Kubernetes_Pod_Security_Policy_Concepts#Overview|Pod Security Policy Concepts]] | |||
=Overview= | =Overview= | ||
A container instantiated from its image by a container runtime executes by default with access control settings and privileges defined in the image metadata. For example the user and the group various container processes run under are by default specified with the [[Dockerfile#USER|USER directive]] in the container image. The processes in the container run by default in [[Linux_Security_Concepts#Unprivileged_Container|unprivileged mode]] and get by default only a limited set of [[Linux_Capabilities#Overview|Linux capabilities]]. The [[#Pod_Security_Context|pod]] and [[#Container_Security_Context|container]] security contexts, described below, are a declarative method to modify all these run-time settings and get the containers to run with a different runtime configuration. As the name implies, all configuration elements controlled by security contexts are security sensitive. | A container instantiated from its image by a container runtime executes by default with access control settings and privileges defined in the image metadata. For example the user and the group various container processes run under are by default specified with the [[Dockerfile#USER|USER directive]] in the container image. The processes in the container run by default in [[Linux_Security_Concepts#Unprivileged_Container|unprivileged mode]] and get by default only a limited set of [[Linux_Capabilities#Overview|Linux capabilities]]. The [[#Pod_Security_Context|pod]] and [[#Container_Security_Context|container]] security contexts, described below, are a declarative method to modify all these run-time settings and get the containers to run with a different runtime configuration. As the name implies, all configuration elements controlled by security contexts are security sensitive. All privileges and access control settings requested by the security context are subject to verification and override by [[#Pod_Security_Policy|pod security policies]]. The cluster admin can restrict the use of the security-related features by creating one or more PodSecurityPolicy resources. | ||
=Pod Security Context= | =Pod Security Context= | ||
Line 31: | Line 32: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
==Elements Specific to the Pod Security Context== | ==Elements Specific to the Pod Security Context== | ||
* <tt>[[#fsGroup|fsGroup]]</tt> | * <tt>[[#fsGroup|fsGroup]]</tt> | ||
* <tt>[[#fsGroupChangePolicy|fsGroupChangePolicy]]</tt> | * <tt>[[#fsGroupChangePolicy|fsGroupChangePolicy]]</tt> | ||
* <tt>[[#supplementalGroups|supplementalGroups]]</tt> | * <tt>[[#supplementalGroups|supplementalGroups]]</tt> | ||
Line 37: | Line 38: | ||
==Elements Shared by the Pod Security Context and Container Security Context== | ==Elements Shared by the Pod Security Context and Container Security Context== | ||
* <tt>[[#runAsUser|runAsUser]]</tt> | * <tt>[[#runAsUser|runAsUser]]</tt> | ||
* <tt>[[#runAsGroup|runAsGroup]]</tt> | * <tt>[[#runAsGroup|runAsGroup]]</tt> | ||
* <tt>[[#runAsNonRoot|runAsNonRoot]]</tt> | * <tt>[[#runAsNonRoot|runAsNonRoot]]</tt> | ||
* <tt>[[#SELinux|seLinuxOptions]]</tt> | * <tt>[[#SELinux|seLinuxOptions]]</tt> | ||
Line 60: | Line 61: | ||
* <tt>[[#privileged|privileged]]</tt> | * <tt>[[#privileged|privileged]]</tt> | ||
* <tt>[[#allowPrivilegeEscalation|allowPrivilegeEscalation]]</tt> | * <tt>[[#allowPrivilegeEscalation|allowPrivilegeEscalation]]</tt> | ||
* <tt>[[ | * <tt>[[#readOnlyRootFilesystem|readOnlyRootFilesystem]]</tt> | ||
* <tt>[[ | * <tt>[[#Linux_.28Kernel.29_Capabilities|capabilities]]</tt> | ||
* <tt>[[Kubernetes_Pod_Security_Policy_Concepts#Others|procMount]]</tt> | * <tt>[[Kubernetes_Pod_Security_Policy_Concepts#Others|procMount]]</tt> | ||
* <tt>seccompProfile</tt> | * <tt>[[#Seccomp|seccompProfile]]</tt> | ||
=Pod Security Policy= | =Pod Security Policy= | ||
Line 77: | Line 78: | ||
The permissions to access files in a container are based on the User ID and Group ID. More about Discretionary Access Control is available here: {{Internal|Linux_Security_Concepts#Discretionary_Access_Control|Linux Security Concepts | Discretionary Access Control}} | The permissions to access files in a container are based on the User ID and Group ID. More about Discretionary Access Control is available here: {{Internal|Linux_Security_Concepts#Discretionary_Access_Control|Linux Security Concepts | Discretionary Access Control}} | ||
====<tt>runAsUser</tt>==== | ====<tt>runAsUser</tt>==== | ||
Both [[#Elements_Shared_by_the_Pod_Security_Context_and_Container_Security_Context|pod security context]] and [[#Container_Security_Context|container security context]] allow declaring <code>runAsUser</code>. | Can be used to specify a UID all processes in a container run with. It is an integer, it must not quoted in the YAML manifest. | ||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
spec: | |||
securityContext: | |||
runAsUser: 1000 | |||
[...] | |||
containers: | |||
- name: some-container | |||
securityContext: | |||
runAsUser: 2000 | |||
[...] | |||
</syntaxhighlight> | |||
Any files created will be owned by this UID. If not specified in any context, the container metadata [[Dockerfile#USER|USER]] directive will be used. If no USER metadata is present, the UID will default to root (0). Both [[#Elements_Shared_by_the_Pod_Security_Context_and_Container_Security_Context|pod security context]] and [[#Container_Security_Context|container security context]] allow declaring <code>runAsUser</code>. | |||
For more details on how the <code>runAsUser</code> setting influences mount point permissions, see: {{Internal|Kubernetes_Mounting_Volumes_in_Pods#Permissions|Mounting Volumes in Pods | Permissions}} | |||
The setting is subject to the applicable [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration: | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
[...] | |||
runAsUser: | |||
rule: RunAsAny | |||
</syntaxhighlight> | |||
A special runAsUser rule is "MustRunAsNonRoot". When declared, it prevents users from deploying containers that run as root. | |||
Also see [[#Rules_and_Constraints|Rules and Constraints]] below. More details on runAsUser pod security policy configuration here: {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#users-and-groups}} | |||
====<tt>runAsGroup</tt>==== | ====<tt>runAsGroup</tt>==== | ||
Both [[#Elements_Shared_by_the_Pod_Security_Context_and_Container_Security_Context|pod security context]] and [[#Container_Security_Context|container security context]] allow declaring <code> runAsGroup</code>. If | {{External|https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#podsecuritycontext-v1-core}} | ||
Provides the [[Linux_Security_Concepts#Primary_Group|primary group ID]] to run the entrypoint of the container process. The GID will also be reported as part of the user's groups. Any files created will be owned by this GID. It is an integer, it must not quoted in the YAML manifest. | |||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
spec: | |||
securityContext: | |||
runAsUser: 1000 | |||
runAsGroup: 2000 | |||
[...] | |||
containers: | |||
- name: some-container | |||
securityContext: | |||
runAsUser: 3000 | |||
runAsGroup: 4000 | |||
[...] | |||
</syntaxhighlight> | |||
If not set, the container image value is used, and if that is not set, the primary group ID of the container will be root(0). Both [[#Elements_Shared_by_the_Pod_Security_Context_and_Container_Security_Context|pod security context]] and [[#Container_Security_Context|container security context]] allow declaring <code> runAsGroup</code>. | |||
{{Note|runAsGroup cannot be specified without being accompanied by runAsUser. If only runAsGroup is used, the pod will not start with an "runAsGroup is specified without a runAsUser" error message.}} | |||
For more details on how the <code>runAsGroup</code> setting influences mount point permissions, see: {{Internal|Kubernetes_Mounting_Volumes_in_Pods#Permissions|Mounting Volumes in Pods | Permissions}} | |||
The setting is subject to the applicable [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration: | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
[...] | |||
runAsGroup: | |||
rule: RunAsAny | |||
</syntaxhighlight> | |||
More details on runAsGroup pod security policy configuration here: {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#users-and-groups}} | |||
====<tt>runAsNonRoot</tt>==== | |||
Although containers are mostly isolated from the host system, running their processes are root is considered bad practice. For example, when a host directory is mounted into the container, if the process running in the container is running as root, it has full access to the mounted directory. As such, it is common to prevent running a container process as root, regardless of what the container metadata configuration contains. This can be achieved by setting <code>runAsNonRoot</code> to "true". When set to "true", <code>runAsNonRoot</code> will prevent a container whose user was set to root in the container metadata from running in that configuration. Both [[#Elements_Shared_by_the_Pod_Security_Context_and_Container_Security_Context|pod security context]] and [[#Container_Security_Context|container security context]] allow declaring <code>runAsNonRoot</code>. | |||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
spec: | |||
securityContext: | |||
runAsNonRoot: true | |||
[...] | |||
containers: | |||
- name: some-container | |||
securityContext: | |||
runAsNonRoot: true | |||
[...] | |||
</syntaxhighlight> | |||
If <code>runAsNonRoot</code> is set to true and the container attempts to run as root, the pod will end up with a "CreateContainerConfigError" status and an error message along the lines of: | |||
<syntaxhighlight lang='text'> | |||
"Error: container has runAsNonRoot and image will run as root". | |||
</syntaxhighlight> | |||
====<tt>supplementalGroups</tt>==== | ====<tt>supplementalGroups</tt>==== | ||
<code>supplementalGroups</code> it is a [[#supplementalGroups|pod-level | {{External|https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#podsecuritycontext-v1-core}} | ||
<code>supplementalGroups</code> it is a [[#Elements_Specific_to_the_Pod_Security_Context|pod-level setting]] that contains a list of groups applied to the first process run in each container, in addition to the container's primary GID. If unspecified, no groups will be added to any container. Also see: {{Internal|Linux_Security_Concepts#Supplementary_Group_List|Linux Security Concepts | Supplementary Group List}} | |||
The setting is subject to the applicable [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration: | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
[...] | |||
supplementalGroups: | |||
rule: RunAsAny | |||
</syntaxhighlight> | |||
More details on supplementalGroups pod security policy configuration here: {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#users-and-groups}} | |||
==File System Access Control== | |||
====<tt>readOnlyRootFilesystem</tt>==== | |||
<code>readOnlyRootFilesystem</code> allows configuration that prevents processes from writing the container's root filesystem. If set to "true", the policy will enforce that the containers will run with a read-only root filesystem (i.e. no [[Docker_Concepts#Difference_Between_Containers_and_Images_-_a_Writable_Layer|writable layer]]). Mounted volumes can be written. This is a common security practice. <code>readOnlyRootFilesystem</code> can only be set at [[#Elements_Specific_to_the_Container_Security_Context|container security context level]]. | |||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
spec: | |||
containers: | |||
- name: some-container | |||
securityContext: | |||
readOnlyRootFileSystem: true | |||
[...] | |||
</syntaxhighlight> | |||
= | This configuration can be enforced in the [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]]: | ||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
spec: | |||
readOnlyRootFilesystem: true | |||
[...] | |||
</syntaxhighlight> | |||
If the container attempts to write, it'll transition to status "CrashLoopBackOff". The cause is described in the container logs: | |||
<syntaxhighlight lang='text'> | |||
[Sat Sep 05 04:07:00.410595 2020] [core:error] [pid 1:tid 140116758865024] (30)Read-only file system: AH00099: could not create /usr/local/apache2/logs/httpd.pid | |||
</syntaxhighlight> | |||
====<tt>fsGroup</tt>==== | ====<tt>fsGroup</tt>==== | ||
{{External|https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#podsecuritycontext-v1-core}} | |||
{{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems}} | {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems}} | ||
<code>fsGroup</code> | |||
<font color=darkgray>Define file group ownership when both runAsGroup and fsGroup are specified.</font> | |||
<code>fsGroup</code> is a [[#Elements_Specific_to_the_Pod_Security_Context|pod-level setting]] that specifies a special supplemental group ID applying to all containers in the pod. It is an integer, it must not quoted in the YAML manifest. | |||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
spec: | |||
securityContext: | |||
fsGroup: 3333 | |||
[...] | |||
</syntaxhighlight> | |||
"id" executed from a container that belongs to a pod configured as such return the fsGroup among its "groups": | |||
<syntaxhighlight lang='text'> | |||
# id | |||
uid=1111 gid=2222 groups=2222,3333 | |||
</syntaxhighlight> | |||
Some volume types allow the Kubelet to change the ownership of that volume, <font color=darkgray>as projected in the pod</font>, to be owned by the pod: | |||
# The owning GID will be the fsGroup | |||
# The setgid bit is set. New files created in the volume will be owned by fsGroup. | |||
# The permission bits are OR'd with rw-rw---- | |||
If not set, the Kubelet will not modify the ownership and permissions of any volume. | |||
When fsGroups is supported, the mounted volume shows that it is owned by the fsGroup group: | |||
<syntaxhighlight lang='text'> | |||
# ls -ld /data | |||
drwxrwsrwx 2 root 3333 4096 Mar 2 21:17 /data | |||
</syntaxhighlight> | |||
A file created inside the volume from a pod configured with fsGroup, the file is owned by the user executing the pod and by the fsGroup group: | |||
<syntaxhighlight lang='text'> | |||
# touch some-file | |||
# ls -l some-file | |||
-rw-r--r-- 1 1111 3333 0 Mar 2 21:29 some-file | |||
</syntaxhighlight> | |||
Note that files created outside the volumes configured with fsGroup belong to the primary group of the user. | |||
For more details on how the <code>fsGroup</code> setting influences mount point permissions, see: {{Internal|Kubernetes_Mounting_Volumes_in_Pods#Permissions|Mounting Volumes in Pods | Permissions}} | |||
Also see: {{Internal|Linux_Security_Concepts#Supplementary_Group_List|Linux Security Concepts | Supplementary Group List}} | |||
The setting is subject to the applicable [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration: | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
[...] | |||
fsGroup: | |||
rule: RunAsAny | |||
</syntaxhighlight> | |||
For "RunAsAny", any fsGroup ID can be specified. Alternatives are: | |||
* "MustRunAs", which requires one or more "range"s. Uses the minimum value of the first range as the default. | |||
* "MayRunAs", which requires one or more "range"s. Allows fsGroups to be left unset without providing a default. Validates against all ranges if fsGroups is set. | |||
=====Volume Types that Support fsGroup===== | |||
* emptyDir | |||
* secret (note that for "secret" volumes, fsGroups has implications on how the secrets are projected into the pods, see more about this subject here: [[Kubernetes_Storage_Concepts#secret|'secret' Volumes]]. | |||
* Some volumes exposed via CSI. See https://kubernetes-csi.github.io/docs/support-fsgroup.html | |||
=====Volume Types that Do Not Support fsGroup===== | |||
For the following volumes, setting fsGroup does not have any effect: | |||
* Docker Desktop Kubernetes hostPath: it will create the files with runAsGroup or root if runAsGroup not set. | |||
* EKS with EFS exposed as PVs | |||
====<tt>fsGroupChangePolicy</tt>==== | ====<tt>fsGroupChangePolicy</tt>==== | ||
{{External|https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#podsecuritycontext-v1-core}} | |||
{{External| https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods}} | {{External| https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods}} | ||
<code>fsGroupChangePolicy</code> it is a [[# | {{External|https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/}} | ||
By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the pod security context's [[#fsGroup|fsGroup]] when that volume is mounted. For large volumes, checking and changing ownership and permissions can take a lot of time, slowing Pod startup. <code>fsGroupChangePolicy</code> it is a [[#Elements_Specific_to_the_Pod_Security_Context|pod-level setting]] that defines behavior of changing ownership and permission of the volume before being exposed inside pod. This field will only apply to [[#Volume_Types_that_Support_fsGroup|volume types which support fsGroup based ownership]] (and permissions). It will have no effect on ephemeral volume types such as: secret, configmaps and emptydir. Valid values are "OnRootMismatch" and "Always". If not specified defaults to "Always". | |||
====<tt> | |||
====<tt>allowedProcMountTypes</tt>==== | |||
==sysctls== | ==sysctls== | ||
====<tt>forbiddenSysctls</tt>==== | |||
[[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration element. More details: {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl}} | |||
====<tt>allowedUnsafeSysctls</tt>==== | |||
[[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]]. More details: {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#sysctl}} | |||
==Privileged Mode== | ==Privileged Mode== | ||
{{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged}} | {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged}} | ||
====<tt>privileged</tt>==== | ====<tt>privileged</tt>==== | ||
<code>privileged</code> can only be set at [[#Elements_Specific_to_the_Container_Security_Context|container security context level]]. | This setting allows running the container in [[Linux_Security_Concepts#Privileged_Mode|privileged mode]], meaning that the container gets full access to the node's kernel. <code>privileged</code> can only be set at [[#Elements_Specific_to_the_Container_Security_Context|container security context level]]. | ||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
spec: | |||
containers: | |||
- name: some-container | |||
securityContext: | |||
privileged: true | |||
[...] | |||
</syntaxhighlight> | |||
The setting is subject to the applicable [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration: | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
privileged: true|false | |||
[...] | |||
</syntaxhighlight> | |||
More details on privileged mode: {{Internal|Linux_Security_Concepts#Privileged_Mode|Linux Security Concepts | Privileged Mode}} | |||
====<tt>allowPrivilegeEscalation</tt>==== | ====<tt>allowPrivilegeEscalation</tt>==== | ||
{{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privilege-escalation}} | {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privilege-escalation}} | ||
<code>allowPrivilegeEscalation</code> can only be set at [[#Elements_Specific_to_the_Container_Security_Context|container security context level]]. This setting controls whether a process can gain more privileges than its parent process. The boolean value directly controls whether the <code>no_new_privs</code> | <code>allowPrivilegeEscalation</code> can only be set at [[#Elements_Specific_to_the_Container_Security_Context|container security context level]]. This setting controls whether a process can gain more privileges than its parent process. The boolean value directly controls whether the <code>[[Linux Security Concepts#no_new_privs|no_new_privs]]</code> flag gets set on the container process. <tt>allowPrivilegeEscalation</tt> is true always when the container is run as [[#privileged|privileged]] or has the [[Linux_Capabilities#CAP_SYS_ADMIN|CAP_SYS_ADMIN]] capability. | ||
The configuration is controlled by a field with the same name in the [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]]. | |||
====<tt>defaultAllowPrivilegeEscalation</tt>==== | ====<tt>defaultAllowPrivilegeEscalation</tt>==== | ||
[[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration element. More details: {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privilege-escalation}} | |||
==Linux (Kernel) Capabilities== | ==Linux (Kernel) Capabilities== | ||
Also see: {{Internal|Linux Capabilities|Linux Capabilities}} | {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#capabilities}} | ||
{{External|https://linux-audit.com/linux-capabilities-hardening-linux-binaries-by-removing-setuid/}} | |||
{{External|https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container}} | |||
Linux capabilities are a fine-grained mechanism that allows giving a container access only to the kernel features it requires instead of giving it unlimited permissions by making in a [[Kubernetes_Pod_and_Container_Security#Privileged_Mode|privileged]] container. Also see: {{Internal|Linux Capabilities|Linux Capabilities}} | |||
====<tt>capabilities</tt>==== | |||
This setting allows adding or dropping capabilities on a per-container basis. <code>capabilities</code> can only be set at [[#Elements_Specific_to_the_Container_Security_Context|container security context level]]. | |||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
spec: | |||
containers: | |||
- name: some-container | |||
capabilities: | |||
add: | |||
- SYS_TIME | |||
drop: | |||
- CHOWN | |||
[...] | |||
</syntaxhighlight> | |||
{{Note|Linux kernel capabilities are usually prefixed with CAP_ (e.g. CAP_SYS_TIME). However, when specifying them in a pod specification, you must leave out the prefix: SYS_TIME.}} | |||
The setting is subject to the applicable [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] capabilities configuration: | |||
=====<tt>allowedCapabilities</tt>===== | |||
This field defines what capabilities containers are allowed to "add" in their security context [[#capabilities|capabilities]] section. If a pod attempts to add a capability that is not listed here, the pod will be rejected. | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
allowedCapabilities: | |||
- SYS_TIME | |||
[...] | |||
</syntaxhighlight> | |||
=====<tt>defaultAddCapabilities</tt>===== | |||
This field defines what capabilities are automatically added to every container. | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
defaultAddCapabilities: | |||
- CHOWN | |||
[...] | |||
</syntaxhighlight> | |||
If the user does not want certain containers to have these capabilities, they need to explicitly drop them in the specifications of those containers. | |||
=====<tt>requiredDropCapabilities</tt>===== | |||
This field defines capabilities that are automatically dropped from every container. The PodSecurityPolicy admission controller will add them to every container's security context "drop" field. If the user tries to create a pod where they explicitly add one of the capabilities listed here, the pod will be rejected. | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
requiredDropCapabilities | |||
- SYS_ADMIN | |||
- SYS_MODULE | |||
[...] | |||
</syntaxhighlight> | |||
==SELinux== | ==SELinux== | ||
{{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#selinux}} | |||
More details: {{Internal|Selinux|SELinux}} | More details: {{Internal|Selinux|SELinux}} | ||
====<tt>seLinuxOptions</tt>==== | ====<tt>seLinuxOptions</tt>==== | ||
{{External|https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container}} | {{External|https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#assign-selinux-labels-to-a-container}} | ||
Both [[#Pod_Security_Context|pod security context]] and [[#Container_Security_Context|container security context]] allow declaring <code>seLinuxOptions</code>. | Both [[#Pod_Security_Context|pod security context]] and [[#Container_Security_Context|container security context]] allow declaring <code>seLinuxOptions</code>. To assign SELinux labels, the SELinux security module must be loaded on the host operating system. | ||
<syntaxhighlight lang='yaml'> | |||
kind: Pod | |||
[...] | |||
securityContext: | |||
seLinuxOptions: | |||
level: "s0:c123,c456" | |||
</syntaxhighlight> | |||
Volumes that support SELinux labeling are relabeled to be accessible by the label specified �ed under seLinuxOptions. Usually you only need to set the level section. This sets the Multi-Category Security (MCS) label given to all containers in the pod as well as the volumes. | |||
====<tt>seLinux</tt>==== | |||
The setting is subject to the applicable [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration: | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
seLinux: | |||
rule: RunAsAny | |||
[...] | |||
</syntaxhighlight> | |||
==Seccomp== | |||
{{External|https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-seccomp-profile-for-a-container}} | |||
These settings are used to filter a process' system calls. Also see: {{Internal|Linux Security Concepts#seccomp|Secure Computing Mode (seccomp)}} | |||
====<tt>annotations</tt>==== | |||
[[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration element. More details: {{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp}} | |||
==Access to Host Namespaces== | |||
{{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces}} | |||
The [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] defines the following configuration elements: | |||
====<tt>hostPID</tt>==== | |||
Controls whether the pod containers can share the host process ID namespace. Note that when paired with ptrace this can be used to escalate privileges outside of the container. | |||
====<tt>hostIPC</tt>==== | |||
Controls whether the pod containers can share the host IPC namespace. | |||
===Access to Host Networking and Ports=== | |||
====<tt>hostNetwork</tt>==== | |||
Controls whether the pod may use the node network namespace. Doing so gives the pod access to the loopback device, services listening on localhost and could be used to snoop on network activity of other pods on the same node. | |||
====<tt>hostPorts</tt>==== | |||
Provides a list of ranges of allowable ports in the host network namespace. It is defined as a list of HostPortRange, with min (inclusive) and max (inclusive). | |||
==Specification of Accepted Volume Types and File System Access Control== | |||
{{External|https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems}} | |||
====<tt>volumes</tt>==== | |||
The [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] defines which volume type users can add in their pods. At minimum, emptyDir, configMap, secret, downwardAPI and persistentVolumeClaim volumes should be allowed. | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
volumes: | |||
- emptyDir | |||
- configMap | |||
- secret | |||
- downwardAPI | |||
- persistentVolumeClaim | |||
[...] | |||
</syntaxhighlight> | |||
"*" may be used to allow all volume types. More details about volumes: {{Internal|Kubernetes_Storage_Concepts#Volume_Types|Volume Types}} | |||
====<tt>allowedHostPaths</tt>==== | |||
[[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration element. It specifies a list of host paths that are allowed to be used by hostPath volumes. An empty list means no restrictions. This is defined as a list of objects with a single pathPrefix field, which allows hostPath volumes to mount a path that begins with an allowed prefix, and a readOnly field indicating it must be mounted read-only. | |||
{{Warn|There are many ways a container with unrestricted access to the host filesystem can escalate privileges, including reading data from other containers, and abusing the credentials of system services, such as Kubelet.}} | |||
Writeable hostPath directory volumes allow containers to write to the filesystem in ways that let them traverse the host filesystem outside the pathPrefix. readOnly: true must be used on all allowed host paths to effectively limit access to the specified pathPrefix . | |||
====<tt>allowedFlexVolumes</tt>==== | |||
[[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration element. | |||
==Specification of Allowed Proc Mount types== | |||
====<tt>allowedProcMountTypes</tt>==== | |||
[[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] configuration element. | |||
==Rules and Constraints== | |||
The following [[Kubernetes_Pod_Security_Policy_Concepts#PodSecurityPolicy|PodSecurityPolicy]] syntax applies to [[#runAsUser|runAsUser]], [[#runAsGroup|runAsGroup]], [[#fsGroup|fsGroup]], [[#supplementalGroups|supplementalGroups]], etc. | |||
<syntaxhighlight lang='yaml'> | |||
kind: PodSecurityPolicy | |||
[...] | |||
spec: | |||
[...] | |||
runAsUser|runAsGroup|fsGroup|supplementalGroups: | |||
rule: MustRunAs | |||
ranges: | |||
- min: 10 | |||
max: 20 | |||
- min: 50 | |||
max: 60 | |||
</syntaxhighlight> |
Latest revision as of 17:39, 9 March 2021
External
- https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
- https://kubernetes.io/docs/concepts/security/pod-security-standards/
- https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#podsecuritycontext-v1-core
- https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#securitycontext-v1-core
Internal
Overview
A container instantiated from its image by a container runtime executes by default with access control settings and privileges defined in the image metadata. For example the user and the group various container processes run under are by default specified with the USER directive in the container image. The processes in the container run by default in unprivileged mode and get by default only a limited set of Linux capabilities. The pod and container security contexts, described below, are a declarative method to modify all these run-time settings and get the containers to run with a different runtime configuration. As the name implies, all configuration elements controlled by security contexts are security sensitive. All privileges and access control settings requested by the security context are subject to verification and override by pod security policies. The cluster admin can restrict the use of the security-related features by creating one or more PodSecurityPolicy resources.
Pod Security Context
The pod security context is a pod-wide section of the pod manifest that defines privileges and access control settings for the pod and all containers running in the pod.
The pod security context holds pod-level security attributes and common container settings that apply to all containers in the pod. Some configuration elements, such as those referring to the pod's volumes, make sense at the pod level only. Other configuration elements, such as the UID or the GID containers run with, are shared with the container security contexts, and when specified in the pod security context, apply to all containers in the pod. Those fields can be overridden by the per-container security context. If the same configuration element is set in both the container security context and the pod security context, the value set in the container security context takes precedence.
kind: Pod
[...]
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
runAsNonRoot: true
fsGroup: 2000
[...]
Elements Specific to the Pod Security Context
Container Security Context
Each container may have its own security context definition:
kind: Pod
[...]
spec:
containers:
- name: some-container
securityContext:
runAsUser: 1000
runAsGroup: 3000
runAsNonRoot: true
fsGroup: 2000
[...]
Elements Specific to the Container Security Context
Pod Security Policy
A pod security policy is a cluster-level API resource that specifies required values or limits for security-sensitive aspects for pod and container configurations, as configured by the pod security context and container security context. If those values are not present in the pod configuration, the pod security policy provides default values. For more details on pod security policies, see:
Privileges and Access Control Settings
The following sections document privileges and access control settings that can be set and modified with pod and container security policies and pod seucirty context.
Discretionary Access Control
The permissions to access files in a container are based on the User ID and Group ID. More about Discretionary Access Control is available here:
runAsUser
Can be used to specify a UID all processes in a container run with. It is an integer, it must not quoted in the YAML manifest.
kind: Pod
[...]
spec:
securityContext:
runAsUser: 1000
[...]
containers:
- name: some-container
securityContext:
runAsUser: 2000
[...]
Any files created will be owned by this UID. If not specified in any context, the container metadata USER directive will be used. If no USER metadata is present, the UID will default to root (0). Both pod security context and container security context allow declaring runAsUser
.
For more details on how the runAsUser
setting influences mount point permissions, see:
The setting is subject to the applicable PodSecurityPolicy configuration:
kind: PodSecurityPolicy
[...]
spec:
[...]
runAsUser:
rule: RunAsAny
A special runAsUser rule is "MustRunAsNonRoot". When declared, it prevents users from deploying containers that run as root.
Also see Rules and Constraints below. More details on runAsUser pod security policy configuration here:
runAsGroup
Provides the primary group ID to run the entrypoint of the container process. The GID will also be reported as part of the user's groups. Any files created will be owned by this GID. It is an integer, it must not quoted in the YAML manifest.
kind: Pod
[...]
spec:
securityContext:
runAsUser: 1000
runAsGroup: 2000
[...]
containers:
- name: some-container
securityContext:
runAsUser: 3000
runAsGroup: 4000
[...]
If not set, the container image value is used, and if that is not set, the primary group ID of the container will be root(0). Both pod security context and container security context allow declaring runAsGroup
.
runAsGroup cannot be specified without being accompanied by runAsUser. If only runAsGroup is used, the pod will not start with an "runAsGroup is specified without a runAsUser" error message.
For more details on how the runAsGroup
setting influences mount point permissions, see:
The setting is subject to the applicable PodSecurityPolicy configuration:
kind: PodSecurityPolicy
[...]
spec:
[...]
runAsGroup:
rule: RunAsAny
More details on runAsGroup pod security policy configuration here:
runAsNonRoot
Although containers are mostly isolated from the host system, running their processes are root is considered bad practice. For example, when a host directory is mounted into the container, if the process running in the container is running as root, it has full access to the mounted directory. As such, it is common to prevent running a container process as root, regardless of what the container metadata configuration contains. This can be achieved by setting runAsNonRoot
to "true". When set to "true", runAsNonRoot
will prevent a container whose user was set to root in the container metadata from running in that configuration. Both pod security context and container security context allow declaring runAsNonRoot
.
kind: Pod
[...]
spec:
securityContext:
runAsNonRoot: true
[...]
containers:
- name: some-container
securityContext:
runAsNonRoot: true
[...]
If runAsNonRoot
is set to true and the container attempts to run as root, the pod will end up with a "CreateContainerConfigError" status and an error message along the lines of:
"Error: container has runAsNonRoot and image will run as root".
supplementalGroups
supplementalGroups
it is a pod-level setting that contains a list of groups applied to the first process run in each container, in addition to the container's primary GID. If unspecified, no groups will be added to any container. Also see:
The setting is subject to the applicable PodSecurityPolicy configuration:
kind: PodSecurityPolicy
[...]
spec:
[...]
supplementalGroups:
rule: RunAsAny
More details on supplementalGroups pod security policy configuration here:
File System Access Control
readOnlyRootFilesystem
readOnlyRootFilesystem
allows configuration that prevents processes from writing the container's root filesystem. If set to "true", the policy will enforce that the containers will run with a read-only root filesystem (i.e. no writable layer). Mounted volumes can be written. This is a common security practice. readOnlyRootFilesystem
can only be set at container security context level.
kind: Pod
[...]
spec:
containers:
- name: some-container
securityContext:
readOnlyRootFileSystem: true
[...]
This configuration can be enforced in the PodSecurityPolicy:
kind: PodSecurityPolicy
spec:
readOnlyRootFilesystem: true
[...]
If the container attempts to write, it'll transition to status "CrashLoopBackOff". The cause is described in the container logs:
[Sat Sep 05 04:07:00.410595 2020] [core:error] [pid 1:tid 140116758865024] (30)Read-only file system: AH00099: could not create /usr/local/apache2/logs/httpd.pid
fsGroup
Define file group ownership when both runAsGroup and fsGroup are specified.
fsGroup
is a pod-level setting that specifies a special supplemental group ID applying to all containers in the pod. It is an integer, it must not quoted in the YAML manifest.
kind: Pod
[...]
spec:
securityContext:
fsGroup: 3333
[...]
"id" executed from a container that belongs to a pod configured as such return the fsGroup among its "groups":
# id
uid=1111 gid=2222 groups=2222,3333
Some volume types allow the Kubelet to change the ownership of that volume, as projected in the pod, to be owned by the pod:
- The owning GID will be the fsGroup
- The setgid bit is set. New files created in the volume will be owned by fsGroup.
- The permission bits are OR'd with rw-rw----
If not set, the Kubelet will not modify the ownership and permissions of any volume.
When fsGroups is supported, the mounted volume shows that it is owned by the fsGroup group:
# ls -ld /data
drwxrwsrwx 2 root 3333 4096 Mar 2 21:17 /data
A file created inside the volume from a pod configured with fsGroup, the file is owned by the user executing the pod and by the fsGroup group:
# touch some-file
# ls -l some-file
-rw-r--r-- 1 1111 3333 0 Mar 2 21:29 some-file
Note that files created outside the volumes configured with fsGroup belong to the primary group of the user.
For more details on how the fsGroup
setting influences mount point permissions, see:
Also see:
The setting is subject to the applicable PodSecurityPolicy configuration:
kind: PodSecurityPolicy
[...]
spec:
[...]
fsGroup:
rule: RunAsAny
For "RunAsAny", any fsGroup ID can be specified. Alternatives are:
- "MustRunAs", which requires one or more "range"s. Uses the minimum value of the first range as the default.
- "MayRunAs", which requires one or more "range"s. Allows fsGroups to be left unset without providing a default. Validates against all ranges if fsGroups is set.
Volume Types that Support fsGroup
- emptyDir
- secret (note that for "secret" volumes, fsGroups has implications on how the secrets are projected into the pods, see more about this subject here: 'secret' Volumes.
- Some volumes exposed via CSI. See https://kubernetes-csi.github.io/docs/support-fsgroup.html
Volume Types that Do Not Support fsGroup
For the following volumes, setting fsGroup does not have any effect:
- Docker Desktop Kubernetes hostPath: it will create the files with runAsGroup or root if runAsGroup not set.
- EKS with EFS exposed as PVs
fsGroupChangePolicy
By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the pod security context's fsGroup when that volume is mounted. For large volumes, checking and changing ownership and permissions can take a lot of time, slowing Pod startup. fsGroupChangePolicy
it is a pod-level setting that defines behavior of changing ownership and permission of the volume before being exposed inside pod. This field will only apply to volume types which support fsGroup based ownership (and permissions). It will have no effect on ephemeral volume types such as: secret, configmaps and emptydir. Valid values are "OnRootMismatch" and "Always". If not specified defaults to "Always".
allowedProcMountTypes
sysctls
forbiddenSysctls
PodSecurityPolicy configuration element. More details:
allowedUnsafeSysctls
PodSecurityPolicy. More details:
Privileged Mode
privileged
This setting allows running the container in privileged mode, meaning that the container gets full access to the node's kernel. privileged
can only be set at container security context level.
kind: Pod
[...]
spec:
containers:
- name: some-container
securityContext:
privileged: true
[...]
The setting is subject to the applicable PodSecurityPolicy configuration:
kind: PodSecurityPolicy
[...]
spec:
privileged: true|false
[...]
More details on privileged mode:
allowPrivilegeEscalation
allowPrivilegeEscalation
can only be set at container security context level. This setting controls whether a process can gain more privileges than its parent process. The boolean value directly controls whether the no_new_privs
flag gets set on the container process. allowPrivilegeEscalation is true always when the container is run as privileged or has the CAP_SYS_ADMIN capability.
The configuration is controlled by a field with the same name in the PodSecurityPolicy.
defaultAllowPrivilegeEscalation
PodSecurityPolicy configuration element. More details:
Linux (Kernel) Capabilities
Linux capabilities are a fine-grained mechanism that allows giving a container access only to the kernel features it requires instead of giving it unlimited permissions by making in a privileged container. Also see:
capabilities
This setting allows adding or dropping capabilities on a per-container basis. capabilities
can only be set at container security context level.
kind: Pod
[...]
spec:
containers:
- name: some-container
capabilities:
add:
- SYS_TIME
drop:
- CHOWN
[...]
Linux kernel capabilities are usually prefixed with CAP_ (e.g. CAP_SYS_TIME). However, when specifying them in a pod specification, you must leave out the prefix: SYS_TIME.
The setting is subject to the applicable PodSecurityPolicy capabilities configuration:
allowedCapabilities
This field defines what capabilities containers are allowed to "add" in their security context capabilities section. If a pod attempts to add a capability that is not listed here, the pod will be rejected.
kind: PodSecurityPolicy
[...]
spec:
allowedCapabilities:
- SYS_TIME
[...]
defaultAddCapabilities
This field defines what capabilities are automatically added to every container.
kind: PodSecurityPolicy
[...]
spec:
defaultAddCapabilities:
- CHOWN
[...]
If the user does not want certain containers to have these capabilities, they need to explicitly drop them in the specifications of those containers.
requiredDropCapabilities
This field defines capabilities that are automatically dropped from every container. The PodSecurityPolicy admission controller will add them to every container's security context "drop" field. If the user tries to create a pod where they explicitly add one of the capabilities listed here, the pod will be rejected.
kind: PodSecurityPolicy
[...]
spec:
requiredDropCapabilities
- SYS_ADMIN
- SYS_MODULE
[...]
SELinux
More details:
seLinuxOptions
Both pod security context and container security context allow declaring seLinuxOptions
. To assign SELinux labels, the SELinux security module must be loaded on the host operating system.
kind: Pod
[...]
securityContext:
seLinuxOptions:
level: "s0:c123,c456"
Volumes that support SELinux labeling are relabeled to be accessible by the label specified �ed under seLinuxOptions. Usually you only need to set the level section. This sets the Multi-Category Security (MCS) label given to all containers in the pod as well as the volumes.
seLinux
The setting is subject to the applicable PodSecurityPolicy configuration:
kind: PodSecurityPolicy
[...]
spec:
seLinux:
rule: RunAsAny
[...]
Seccomp
These settings are used to filter a process' system calls. Also see:
annotations
PodSecurityPolicy configuration element. More details:
Access to Host Namespaces
The PodSecurityPolicy defines the following configuration elements:
hostPID
Controls whether the pod containers can share the host process ID namespace. Note that when paired with ptrace this can be used to escalate privileges outside of the container.
hostIPC
Controls whether the pod containers can share the host IPC namespace.
Access to Host Networking and Ports
hostNetwork
Controls whether the pod may use the node network namespace. Doing so gives the pod access to the loopback device, services listening on localhost and could be used to snoop on network activity of other pods on the same node.
hostPorts
Provides a list of ranges of allowable ports in the host network namespace. It is defined as a list of HostPortRange, with min (inclusive) and max (inclusive).
Specification of Accepted Volume Types and File System Access Control
volumes
The PodSecurityPolicy defines which volume type users can add in their pods. At minimum, emptyDir, configMap, secret, downwardAPI and persistentVolumeClaim volumes should be allowed.
kind: PodSecurityPolicy
[...]
spec:
volumes:
- emptyDir
- configMap
- secret
- downwardAPI
- persistentVolumeClaim
[...]
"*" may be used to allow all volume types. More details about volumes:
allowedHostPaths
PodSecurityPolicy configuration element. It specifies a list of host paths that are allowed to be used by hostPath volumes. An empty list means no restrictions. This is defined as a list of objects with a single pathPrefix field, which allows hostPath volumes to mount a path that begins with an allowed prefix, and a readOnly field indicating it must be mounted read-only.
There are many ways a container with unrestricted access to the host filesystem can escalate privileges, including reading data from other containers, and abusing the credentials of system services, such as Kubelet.
Writeable hostPath directory volumes allow containers to write to the filesystem in ways that let them traverse the host filesystem outside the pathPrefix. readOnly: true must be used on all allowed host paths to effectively limit access to the specified pathPrefix .
allowedFlexVolumes
PodSecurityPolicy configuration element.
Specification of Allowed Proc Mount types
allowedProcMountTypes
PodSecurityPolicy configuration element.
Rules and Constraints
The following PodSecurityPolicy syntax applies to runAsUser, runAsGroup, fsGroup, supplementalGroups, etc.
kind: PodSecurityPolicy
[...]
spec:
[...]
runAsUser|runAsGroup|fsGroup|supplementalGroups:
rule: MustRunAs
ranges:
- min: 10
max: 20
- min: 50
max: 60