1. What is an admission controller

An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized.

access control overview

The controllers are compiled into the kube-apiserver binary, and may only be configured by the cluster administrator. There are two special controllers: MutatingAdmissionWebhook and ValidatingAdmissionWebhook. These execute the mutating and validating (respectively) admission control webhooks which are configured in the API.

Admission controllers limit requests to create, delete, modify objects or connect to proxy. They do not limit requests to read objects.

The admission control process proceeds in two phases. In the first phase, mutating admission controllers are run. In the second phase, validating admission controllers are run. Note again that some of the controllers are both.

If any of the controllers in either phase reject the request, the entire request is rejected immediately and an error is returned to the end-user.

1.1. Turn on/off an admission controller

The Kubernetes API server flag enable-admission-plugins takes a comma-delimited list of admission control plugins to invoke prior to modifying objects in the cluster.

kube-apiserver --enable-admission-plugins=NamespaceLifecycle,LimitRanger ...

The Kubernetes API server flag disable-admission-plugins takes a comma-delimited list of admission control plugins to be disabled, even if they are in the list of plugins enabled by default.

kube-apiserver --disable-admission-plugins=PodNodeSelector,AlwaysDeny ...

To see which admission plugins are enabled by default:

kube-apiserver -h | grep enable-admission-plugins
$ docker run --rm -it k8s.gcr.io/kube-apiserver:v1.22.3 kube-apiserver -h | grep enable-admission-plugins
      --enable-admission-plugins strings       admission plugins that should be enabled in addition to default enabled ones (NamespaceLifecycle, LimitRanger, ServiceAccount, TaintNodesByCondition, PodSecurity, Priority, DefaultTolerationSeconds, DefaultStorageClass, StorageObjectInUseProtection, PersistentVolumeClaimResize, RuntimeClass, CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, MutatingAdmissionWebhook, ValidatingAdmissionWebhook, ResourceQuota). Comma-delimited list of admission plugins: AlwaysAdmit, AlwaysDeny, AlwaysPullImages, CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, DefaultStorageClass, DefaultTolerationSeconds, DenyServiceExternalIPs, EventRateLimit, ExtendedResourceToleration, ImagePolicyWebhook, LimitPodHardAntiAffinityTopology, LimitRanger, MutatingAdmissionWebhook, NamespaceAutoProvision, NamespaceExists, NamespaceLifecycle, NodeRestriction, OwnerReferencesPermissionEnforcement, PersistentVolumeClaimResize, PersistentVolumeLabel, PodNodeSelector, PodSecurity, PodSecurityPolicy, PodTolerationRestriction, Priority, ResourceQuota, RuntimeClass, SecurityContextDeny, ServiceAccount, StorageObjectInUseProtection, TaintNodesByCondition, ValidatingAdmissionWebhook. The order of plugins in this flag does not matter.

1.2. Dynamic Admission Control

In addition to compiled-in admission plugins, admission plugins can be developed as extensions and run as webhooks configured at runtime.

Admission webhooks are HTTP callbacks that receive admission requests and do something with them. You can define both validating admission webhook and mutating admission webhook admission webhooks.

The webhook handles the AdmissionReview request sent by the apiservers, and sends back its decision as an AdmissionReview object in the same version it received.

Mutating admission webhooks are invoked first, and can modify objects sent to the API server to enforce custom defaults. After all object modifications are complete, and after the incoming object is validated by the API server, validating admission webhooks are invoked and can reject requests to enforce custom policies.

You can dynamically configure what resources are subject to what admission webhooks via ValidatingWebhookConfiguration or MutatingWebhookConfiguration.

You can use the follow commands to inspect details about each config field:

$ kubectl explain mutatingwebhookconfigurations
$ kubectl explain validatingwebhookconfigurations

The following is an example ValidatingWebhookConfiguration, a mutating webhook configuration is similar.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: "pod-policy.kube-admission.io"
webhooks:
- name: "pod-policy.kube-admission.io"
  rules:
    - apiGroups:   [""]
      apiVersions: ["v1"]
      operations:  ["CREATE"]
      resources:   ["pods"]
      scope:       "Namespaced"
  clientConfig:
    caBundle: LS0....
    service:
      namespace: "default"
      name: "kube-admission"
      path: /always-allow-delay-5s
  admissionReviewVersions: ["v1"]
  sideEffects: None
  timeoutSeconds: 10
Note: When using clientConfig.service, the server cert must be valid for <svc_name>.<svc_namespace>.svc.

Besides, there’s a sample of admission controller at my GitHub: https://github.com/qqbuby/sample-kube-admission-controller.

2. Pod and Container Security Context

Principle of Least Privilege

In information security, computer science, and other fields, the principle of least privilege (PoLP), also known as the principle of minimal privilege or the principle of least authority, requires that in a particular abstraction layer of a computing environment, every module (such as a process, a user, or a program, depending on the subject) must be able to access only the information and resources that are necessary for its legitimate purpose.

Benefits of the principle include:

  • Better system stability.

    When code is limited in the scope of changes it can make to a system, it is easier to test its possible actions and interactions with other applications. In practice for example, applications running with restricted rights will not have access to perform operations that could crash a machine, or adversely affect other applications running on the same system.

  • Better system security.

    When code is limited in the system-wide actions it may perform, vulnerabilities in one application cannot be used to exploit the rest of the machine. For example, Microsoft states “Running in standard user mode gives customers increased protection against inadvertent system-level damage caused by "shatter attacks" and malware, such as root kits, spyware, and undetectable viruses”.

  • Ease of deployment.

    In general, the fewer privileges an application requires, the easier it is to deploy within a larger environment. This usually results from the first two benefits, applications that install device drivers or require elevated security privileges typically have additional steps involved in their deployment. For example, on Windows a solution with no device drivers can be run directly with no installation, while device drivers must be installed separately using the Windows installer service in order to grant the driver elevated privileges.

A security context defines privilege and access control settings for a Pod or Container. Security context settings include, but are not limited to:

  • Discretionary Access Control: Permission to access an object, like a file, is based on user ID (UID) and group ID (GID).

  • Security Enhanced Linux (SELinux): Objects are assigned security labels.

  • Running as privileged or unprivileged.

  • Linux Capabilities: Give a process some privileges, but not all the privileges of the root user.

  • AppArmor: Use program profiles to restrict the capabilities of individual programs.

  • Seccomp: Filter a process’s system calls.

  • AllowPrivilegeEscalation: Controls whether a process can gain more privileges than its parent process. This bool directly controls whether the no_new_privs flag gets set on the container process.

    AllowPrivilegeEscalation is true always when the container is:

    1) run as Privileged

    OR

    2) has CAP_SYS_ADMIN.

  • readOnlyRootFilesystem: Mounts the container’s root filesystem as read-only.

For more information about security mechanisms in Linux, see Overview of Linux Kernel Security Features.

2.1. Set the security context for a Pod

To specify security settings for a Pod, include the securityContext field in the Pod specification.

The securityContext field is a PodSecurityContext object.

The security settings that you specify for a Pod apply to all Containers in the Pod.

pods/security/security-context.yaml
apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
  volumes:
  - name: sec-ctx-vol
    emptyDir: {}
  containers:
  - name: sec-ctx-demo
    image: busybox:1
    stdin: true
    tty: true
    volumeMounts:
    - name: sec-ctx-vol
      mountPath: /data/demo
    securityContext:
      allowPrivilegeEscalation: false

In the configuration file, the runAsUser field specifies that for any Containers in the Pod, all processes run with user ID 1000. The runAsGroup field specifies the primary group ID of 3000 for all processes within any containers of the Pod. If this field is omitted, the primary group ID of the containers will be root(0). Any files created will also be owned by user 1000 and group 3000 when runAsGroup is specified. Since fsGroup field is specified, all processes of the container are also part of the supplementary group ID 2000. The owner for volume /data/demo and any files created in that volume will be Group ID 2000.

$ kubectl apply -f pods/security/security-context.yaml

$ kubectl exec -it security-context-demo -- sh
/ $ id
uid=1000 gid=3000 groups=2000

/ $ ls -l /data
total 4
drwxrwsrwx    2 root     2000          4096 Dec 16 09:14 demo

/ $ touch /data/demo/testfile

/ $ ls -l /data/demo/testfile
-rw-r--r--    1 1000     2000             0 Dec 16 09:15 /data/demo/testfile

/ $ stat /data/demo/
  File: /data/demo/
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d	Inode: 3539320     Links: 2
Access: (2777/drwxrwsrwx)  Uid: (    0/    root)   Gid: ( 2000/ UNKNOWN)
<...>

/ $ cat /etc/passwd
root:x:0:0:root:/root:/bin/sh
<...>
www-data:x:33:33:www-data:/var/www:/bin/false
operator:x:37:37:Operator:/var:/bin/false
nobody:x:65534:65534:nobody:/home:/bin/false

/ $ cat /etc/group
root:x:0:
<...>
nobody:x:65534:

/ $ exit

2.2. Set the security context for a Container

To specify security settings for a Container, include the securityContext field in the Container manifest.

The securityContext field is a SecurityContext object.

Security settings that you specify for a Container apply only to the individual Container, and they override settings made at the Pod level when there is overlap.

Container settings do not affect the Pod’s Volumes.

pods/security/security-context-2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo-2
spec:
  securityContext:
    runAsUser: 1000
  containers:
  - name: sec-ctx-demo-2
    image: busybox:1
    stdin: true
    tty: true
    securityContext:
      runAsUser: 2000
      allowPrivilegeEscalation: false
$ kubectl apply -f pods/security/security-context-2.yaml

$ kubectl exec -it security-context-demo-2 -- sh
/ $ id
uid=2000 gid=0(root)

/ $ exit

2.3. Set capabilities for a Container

With Linux capabilities, you can grant certain privileges to a process without granting all the privileges of the root user. To add or remove Linux capabilities for a Container, include the capabilities field in the securityContext section of the Container manifest.

First, see what happens when you don’t include a capabilities field.

pods/security/security-context-3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo-3
spec:
  containers:
  - name: sec-ctx-3
    image: k8s.gcr.io/echoserver:1.10
    ports:
    - containerPort: 8080
$ kubectl exec -it security-context-demo-3 -- sh

# id
uid=0(root) gid=0(root) groups=0(root)

# cat /proc/1/status | grep Cap
CapInh:	00000000a80425fb
CapPrm:	00000000a80425fb
CapEff:	00000000a80425fb
CapBnd:	00000000a80425fb
CapAmb:	0000000000000000

# exit

Next, run a Container that is the same as the preceding container, except that it has additional capabilities set.

pods/security/security-context-4.yaml
apiVersion: v1
kind: Pod
metadata:
  name: security-context-demo-4
spec:
  containers:
  - name: sec-ctx-4
    image: k8s.gcr.io/echoserver:1.10
    ports:
    - containerPort: 8080
    securityContext:
      capabilities:
        add: ["NET_ADMIN", "SYS_TIME"]
$ kubectl exec -it security-context-demo-4 -- sh

# id
uid=0(root) gid=0(root) groups=0(root)

# cat /proc/1/status | grep Cap
CapInh:	00000000aa0435fb
CapPrm:	00000000aa0435fb
CapEff:	00000000aa0435fb
CapBnd:	00000000aa0435fb
CapAmb:	0000000000000000

# exit

Compare the capabilities of the two Containers:

00000000a80425fb
00000000aa0435fb

In the capability bitmap of the first container, bits 12 and 25 are clear. In the second container, bits 12 and 25 are set. Bit 12 is CAP_NET_ADMIN, and bit 25 is CAP_SYS_TIME. See capability.h for definitions of the capability constants.

Linux capability constants have the form CAP_XXX. But when you list capabilities in your Container manifest, you must omit the CAP_ portion of the constant. For example, to add CAP_SYS_TIME, include SYS_TIME in your list of capabilities.

2.4. Clean up

Delete the Pod:

kubectl delete pod security-context-demo
kubectl delete pod security-context-demo-2
kubectl delete pod security-context-demo-3
kubectl delete pod security-context-demo-4

3. What is a Pod Security Policy?

Kubernetes has officially deprecated PodSecurityPolicy in version 1.21. PodSecurityPolicy will be shut down in version 1.25.

PodSecurityPolicy is being replaced by a new, simplified PodSecurity admission controller.

PodSecurityPolicy is a built-in admission controller that allows a cluster administrator to control security-sensitive aspects of the Pod specification.

A PodSecurityPolicy is a built-in admission controller that allows a cluster administrator to control security-sensitive aspects of the Pod specification to create and update Pods on your cluster.

In most Kubernetes clusters, RBAC (Role-Based Access Control) rules control access to these resources. list, get, create, edit, and delete are the sorts of API operations that RBAC cares about, but RBAC does not consider what settings are being put into the resources it controls.

To control what sorts of settings are allowed in the resources defined in your cluster, you need Admission Control in addition to RBAC.

Kubernetes SIG Security, SIG Auth, and a diverse collection of other community members have been working together for months to ensure that what’s coming next is going to be awesome. We have developed a Kubernetes Enhancement Proposal (KEP 2579) and a prototype for a new feature, currently being called by the temporary name "PSP Replacement Policy."

If your use of PSP is relatively simple, with a few policies and straightforward binding to service accounts in each namespace, you will likely find PSP Replacement Policy to be a good match for your needs. Evaluate your PSPs compared to the Kubernetes Pod Security Standards to get a feel for where you’ll be able to use the Restricted, Baseline, and Privileged policies. Please follow along with or contribute to the KEP and subsequent development, and try out the Alpha release of PSP Replacement Policy when it becomes available.

# policy/privileged-psp.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: privileged
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
spec:
  privileged: true
  allowPrivilegeEscalation: true
  allowedCapabilities:
  - '*'
  volumes:
  - '*'
  hostNetwork: true
  hostPorts:
  - min: 0
    max: 65535
  hostIPC: true
  hostPID: true
  runAsUser:
    rule: 'RunAsAny'
  seLinux:
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'
# policy/restricted-psp.yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted
  annotations:
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default'
    apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default'
    apparmor.security.beta.kubernetes.io/defaultProfileName:  'runtime/default'
spec:
  privileged: false
  # Required to prevent escalations to root.
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  # Allow core volume types.
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    # Assume that ephemeral CSI drivers & persistentVolumes set up by the cluster admin are safe to use.
    - 'csi'
    - 'persistentVolumeClaim'
    - 'ephemeral'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    # Require the container to run without root privileges.
    rule: 'MustRunAsNonRoot'
  seLinux:
    # This policy assumes the nodes are using AppArmor rather than SELinux.
    rule: 'RunAsAny'
  supplementalGroups:
    rule: 'MustRunAs'
    ranges:
      # Forbid adding the root group.
      - min: 1
        max: 65535
  fsGroup:
    rule: 'MustRunAs'
    ranges:
      # Forbid adding the root group.
      - min: 1
        max: 65535
  readOnlyRootFilesystem: false

3.1. Policy Order

In addition to restricting pod creation and update, pod security policies can also be used to provide default values for many of the fields that it controls. When multiple policies are available, the pod security policy controller selects policies according to the following criteria:

  • PodSecurityPolicies which allow the pod as-is, without changing defaults or mutating the pod, are preferred. The order of these non-mutating PodSecurityPolicies doesn’t matter.

  • If the pod must be defaulted or mutated, the first PodSecurityPolicy (ordered by name) to allow the pod is selected.

During update operations (during which mutations to pod specs are disallowed) only non-mutating PodSecurityPolicies are used to validate the pod.

3.2. Enabling Pod Security Policies

Pod security policy control is implemented as an optional admission controller. PodSecurityPolicies are enforced by enabling the admission controller, but doing so without authorizing any policies will prevent any pods from being created in the cluster.

  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-apiserver
# ...
    - --enable-admission-plugins=NodeRestriction,PodSecurityPolicy
# ...
$ kubectl create ns psp-test
namespace/psp-test created

$ kubectl create rolebinding -n psp-test default:edit --clusterrole edit --serviceaccount psp-test:default
rolebinding.rbac.authorization.k8s.io/default:edit created

$ kubectl --as system:serviceaccount:psp-test:default create -n psp-test -f- <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: pause
spec:
  containers:
    - name: pause
      image: k8s.gcr.io/pause
EOF
Error from server (Forbidden): error when creating "STDIN": pods "pause" is forbidden: PodSecurityPolicy: unable to admit pod: []

$ kubectl delete ns psp-test
namespace "psp-test" deleted

3.3. Authorizing Policies

When a PodSecurityPolicy resource is created, it does nothing. In order to use it, the requesting user or target pod’s service account must be authorized to use the policy, by allowing the use verb on the policy.

Most Kubernetes pods are not created directly by users. Instead, they are typically created indirectly as part of a Deployment, ReplicaSet, or other templated controller via the controller manager. Granting the controller access to the policy would grant access for all pods created by that controller, so the preferred method for authorizing policies is to grant access to the pod’s service account.

RBAC is a standard Kubernetes authorization mode, and can easily be used to authorize use of policies.

First, a Role or ClusterRole needs to grant access to use the desired policies. The rules to grant access look like this:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: <role name>
rules:
- apiGroups: ['policy']
  resources: ['podsecuritypolicies']
  verbs:     ['use']
  resourceNames:
  - <list of policies to authorize>

Then the (Cluster)Role is bound to the authorized user(s):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: <binding name>
roleRef:
  kind: ClusterRole
  name: <role name>
  apiGroup: rbac.authorization.k8s.io
subjects:
# Authorize all service accounts in a namespace (recommended):
- kind: Group
  apiGroup: rbac.authorization.k8s.io
  name: system:serviceaccounts:<authorized namespace>
# Authorize specific service accounts (not recommended):
- kind: ServiceAccount
  name: <authorized service account name>
  namespace: <authorized pod namespace>
# Authorize specific users (not recommended):
- kind: User
  apiGroup: rbac.authorization.k8s.io
  name: <authorized user name>

If a RoleBinding (not a ClusterRoleBinding) is used, it will only grant usage for pods being run in the same namespace as the binding. This can be paired with system groups to grant access to all pods run in the namespace:

# Authorize all service accounts in a namespace:
- kind: Group
  apiGroup: rbac.authorization.k8s.io
  name: system:serviceaccounts
# Or equivalently, all authenticated users in a namespace:
- kind: Group
  apiGroup: rbac.authorization.k8s.io
  name: system:authenticated

3.4. kube-psp-advisor

Kubernetes Pod Security Policy Advisor (a.k.a kube-psp-advisor) is an opensource tool from Sysdig. kube-psp-advisor scans the existing security context from Kubernetes resources like deployments, daementsets, replicasets, etc taken as the reference model we want to enforce and then automatically generates the Pod Security Policy for all the resources in the entire cluster.

$ kubectl krew install advise-psp
Updated the local copy of plugin index.
Installing plugin: advise-psp
Installed plugin: advise-psp
\
 | Use this plugin:
 | 	kubectl advise-psp
 | Documentation:
 | 	https://github.com/sysdiglabs/kube-psp-advisor
/
WARNING: You installed plugin "advise-psp" from the krew-index plugin repository.
   These plugins are not audited for security by the Krew maintainers.
   Run them at your own risk.

$ kubectl advise-psp inspect --namespace default --report
{
  "podSecuritySpecs": {
    "hostIPC": [],
    "hostNetwork": [],
    "hostPID": []
  },
  "podVolumeTypes": {
...

3.5. Example

$ kubectl apply -f - <<EOF
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: psp-hostpath
spec:
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  runAsUser:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny
  volumes:
    - configMap
    - emptyDir
    - projected
    - secret
    - downwardAPI
EOF
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/psp-hostpath created

$ kubectl apply -f - <<EOF
> apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: psp:hostpath
rules:
  - apiGroups: ['policy']
    resources: ['podsecuritypolicies']
    verbs:     ['use']
    resourceNames:
      - psp-hostpath
EOF
clusterrole.rbac.authorization.k8s.io/psp:hostpath unchanged

$ kubectl create ns psp-test
namespace/psp-test created

$ kubectl create rolebinding -n psp-test edit --clusterrole edit --serviceaccount psp-test:default
rolebinding.rbac.authorization.k8s.io/edit created

$ kubectl create rolebinding -n psp-test psp:hostpath --clusterrole psp:hostpath --serviceaccount psp-test:default
rolebinding.rbac.authorization.k8s.io/psp:hostpath created

$ kubectl apply -n psp-test --as system:serviceaccount:psp-test:default -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: pause
spec:
  containers:
    - name: pause
      image: k8s.gcr.io/pause:3.6
EOF
pod/pause created

$ kubectl apply -n psp-test --as system:serviceaccount:psp-test:default -f - <<EOF
> apiVersion: v1
kind: Pod
metadata:
  name: hostpath
spec:
  containers:
    - name: pause
      image: k8s.gcr.io/pause:3.6
  volumes:
    - name: hostpath
      hostPath:
        path: /tmp
EOF
Error from server (Forbidden): error when creating "STDIN": pods "hostpath" is forbidden: PodSecurityPolicy: unable to admit pod: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]

$ kubectl delete ns psp-test
namespace "psp-test" deleted

$ kubectl delete psp psp-hostpath
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy "psp-hostpath" deleted

$ kubectl delete clusterrole psp:hostpath
clusterrole.rbac.authorization.k8s.io "psp:hostpath" deleted

4. Pod Security Admission Controller

FEATURE STATE: Kubernetes v1.23 [beta]

The Kubernetes Pod Security Standards define different isolation levels for Pods. These standards let you define how you want to restrict the behavior of pods in a clear, consistent fashion.

Kubernetes offers a built-in Pod Security admission controller, the successor to PodSecurityPolicies.

Pod security restrictions are applied at the namespace level when pods are created.