Giant Swarm's farewell to PSPs

Aug 25, 2022

Pod Security Policies (PSPs) are deprecated and will be removed in Kubernetes version 1.25. Obviously, our customers still want to have security controls for pods in our clusters, so we are replacing the functionality with Kyverno policies, which map to the official Pod Security Standards. The actual policies are very similar to PSPs, covering things like not running as a root user and dropping unneeded capabilities. The largest change for Giant Swarm will be how exceptions are handled. 

Historical context

PSPs were introduced in Kubernetes 1.5 to limit the damage that could be caused by malicious or compromised pods. PSPs work as an admission controller built into the api-server, which enforces, and in some cases, defaults certain properties of a pod.

As Kubernetes has evolved and seen wider adoption, and since PSPs never officially left beta, the maintainers decided the PSP subsystem was no longer sufficient for meeting Kubernetes security goals and decided to remove it. It has been deprecated since version 1.21 and will be removed entirely in 1.25.

As a replacement, upstream Kubernetes has created an in-tree admission controller called Pod Security Admission (PSA), which audits and enforces three levels of configuration: privileged (no restrictions), baseline (best practices), and restricted (highest security). These levels are called Pod Security Standards (PSS). So, the Pod Security Admission (PSA) controller applies the Pod Security Standards (PSS) to a pod.

The major limitation for Giant Swarm in adopting PSA, is that PSA works only on pods, and currently is configurable only for entire namespaces. It is not possible, using PSA, to allow an exception for one workload within a given namespace. Similarly, it is not possible to give a workload only one additional privilege within one of the three PSS levels.

This means, for example, that if a customer is using the restricted policy level but wants to deploy one application with higher privileges (for example, a log shipper that needs access to a sensitive path on the node), the customer has to move that application into its own separate namespace with a less restrictive policy level. To allow a pod to have a HostPath volume type, for example, that pod would need to run in a namespace with the privileged policy level, which enforces no security controls at all, because the next level, baseline, prohibits that type of volume.

There have been feature requests to improve the granularity of PSA, but there is no active work in this direction and implementation of more granular controls have been left to external admission controllers. PSA may someday be usable for our customers and us, but not in its current form.

Why Kyverno

There are two main players in the out-of-tree alternatives: OPA Gatekeeper and Kyverno.

In either case, we need to resort to using an external admission controller to replace PSPs.

We tried and disliked Gatekeeper in the past. More importantly, the philosophy of the design behind Kyverno aligns better with other areas of the Giant Swarm offering. Kyverno is built for Kubernetes specifically, is configured by custom resources, and similarly outputs its findings back into the cluster as native resources. An upstream working group is standardizing the format and behavior of the policy reports as an in-tree resource type. Kyverno was one of the first users of this and Kyverno maintainers continue to drive this.

Gatekeeper, on the other hand, uses the rego policy language. It is a general-purpose policy language not specific to Kubernetes, and one we’ve found unpleasant to work with in the past. We expect to get PSS policies ‘out of the box’ from either Kyverno or Gatekeeper. In the case of creating custom policies and managing the provided policies, the perception is that there is less developer overhead in working with Kyverno policies.

So, we’ve decided to replace PSPs with PSS, enforced by Kyverno policies.

Kyverno vs. PSPs

In terms of what they actually enforce at a pod level, PSPs and PSS are very similar. PSS is slightly more demanding because it requires some security controls to be explicitly set where previously they may have been defaulted or ignored.

Differences we’ve encountered

Many of our customers use a workflow aligned with the underlying PSS assumption that cluster policies are tightly controlled by cluster admins. Usually, individual dev teams don’t have the ability to deploy their own PSP with whatever privileges they want. 

That being said, we are extreme users of PSPs. So, in that regard, the adoption may actually be more painful (or at least less intuitive) for Giant Swarm than for our customers.

We anticipate the biggest difference with regard to the actual usage of PSS versus PSP is how policy exceptions are handled. Under PSPs, each project specified its own PSP manifest and used it via RBAC bindings. Under PSS, policies are centralized and applied cluster-wide.

Another way to think about it is: instead of one security policy (file) per pod containing all rules (Giant Swarm PSPs), we will now have one security policy (file) per rule affecting all pods. Each rule will be applied to every incoming pod. There is not yet a native mechanism to allow a single pod to exempt itself from a policy. All exceptions must be included in the central policy. We are exploring several ways around that. This means that all Giant Swarm pods will either need to be fully compliant with the enforced policies, or be added to a central list of exceptions.

Policy levels

There are three levels of protection prescribed by Pod Security Standards and enforceable with Kyverno:

Privileged

The privileged standard contains no policies and enforces no security controls. It is the 'no-op'/wide open setting used for opting resources out of enforcement.

Baseline

The baseline standard is a set of best practices, including:

  • disallow-capabilities - includes a list of permitted capabilities, and rejects pods with any capabilities not in the list.
  • disallow-host-namespaces - rejects pods that use host networking, host IPC, or host PID.
  • disallow-host-path - rejects pods that use HostPath volumes.
  • disallow-host-ports - rejects pods that use host ports.
  • disallow-host-process - rejects Windows pods that allow privileged access to host processes.
  • disallow-privileged-containers - rejects pods that run in privileged mode, i.e. privileged: true.
  • disallow-proc-mount - rejects pods that attempt to set any non-default procMount.
  • disallow-selinux - rejects pods that set non-standard SELinux types, users, or roles.
  • restrict-apparmor-profiles - rejects pods that set a non-default AppArmor profile.
  • restrict-seccomp - rejects pods that set a non-default Seccomp profile.
  • restrict-sysctls - rejects pods that set sysctls aside from a pre-approved list.

Restricted

The restricted standard extends the baseline standard to be more restrictive. These policies include:

  • disallow-capabilities-strict - like disallow-capabilities, but requires pods to explicitly drop all capabilities and allows only the NET_BIND_SERVICE capability to be set.
  • disallow-privilege-escalation - rejects pods that allow changing users or groups at runtime via allowPrivilegeEscalation.
  • require-run-as-non-root-user - rejects pods that do not run as a numeric user ID greater than 0.
  • require-run-as-nonroot - rejects pods that do not explicitly set runAsNonRoot: false.
  • restrict-seccomp-strict - like restrict-seccomp, but requires an approved Seccomp profile to be set.
  • restrict-volume-types - rejects pods that use volume types other than an approved list.

As expert users of Kubernetes, Giant Swarm will adopt the restricted standard. Exact policy details are available in the policies themselves.

Side notes

  • We are exploring the possibility of allowing a pod to opt itself out of policies via labels/annotations. This is not yet supported upstream, but would make it possible to run a non-compliant pod without adding it to an exception list.

  • For certain policies it's possible to mutate the incoming resource to comply with the policy. We do not recommend starting with this approach because:

    1. We have no guarantee that a particular workload will actually function correctly after mutation. 
    2. This is not a long-term solution, especially in GitOps environments where the source of truth should be the backing repository.

  • Working with policy violations:

There are several ways for users and cluster admins to get information about failing policies. This is out of scope for this blog post. Keep your eye on this space, as we will follow-up with a post that expands on this topic.

Conclusion

The cloud native ecosystem is a dynamic place. Things are constantly changing. These changes need to be accounted for. Alternatives need to be tested and migrations planned. Giant Swarm stays on top of these things. We ensure that our customers are not left with gaps in security, even when monumental changes take place.

I'd like to thank my colleague Zach Stone for being the inspiration for this piece and keeping me honest during the writing of it. In general, he is my go-to for any security related question. Contact us if you would like to benefit from Zach's security know-how and we learn how we can help you have a safer, more productive cloud native journey.

You May Also Like

These Stories on Tech

Sep 14, 2022
Sep 6, 2022
Jun 24, 2022