Why is requiring that all capabilities be dropped in a Kubernetes PodSecurityPolicy redundant with non-root + disallow privilege escalation?

11/15/2018

The second example policy from the PodSecurityPolicy documentation consists of the following PodSecurityPolicy snippet

...
spec:
  privileged: false
  # Required to prevent escalations to root.
  allowPrivilegeEscalation: false
  # This is redundant with non-root + disallow privilege escalation,
  # but we can provide it for defense in depth.
  requiredDropCapabilities:
    - ALL
...

Why is dropping all capabilities redundant for non-root + disallow privilege escalation? You can have a container process without privilege escalation that is non-root but has effective capabilities right?

It seems like this is not possible with Docker:

$ docker run --cap-add SYS_ADMIN --user 1000 ubuntu grep Cap /proc/self/status
CapInh: 00000000a82425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a82425fb
CapAmb: 0000000000000000

All effective capabilities have been dropped even when trying to explicitly add them. But other container runtimes could implement it, so is this comment just Docker specific?

-- dippynark
containers
docker
kubernetes
linux-capabilities

2 Answers

11/16/2018

Why is dropping all capabilities redundant for non-root + disallow privilege escalation?

Because you need privilege escalation to be able to use 'new' capabilities, an effectively allowPrivilegeEscalation: false is disabling setuid in the execve system call that prevents the use of any new capabilities.
Also as shown in the docs: "Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset". More info here.

This in combination with privileged: false renders requiredDropCapabilities: [ALL] redundant.

The equivalent Docker options here are:

  • --user=whatever => privileged: false
  • --security-opt=no-new-privileges => allowPrivilegeEscalation: false
  • --cap-drop=all => requiredDropCapabilities: [ALL]

It seems like this is not possible with Docker

That's what looks like Docker is doing, the moment you specify a non-privileged user all of the effective capabilities are dropped (CapEff: 0000000000000000), even if you specify --cap-add SYS_ADMIN

This combined with the --security-opt=no-new-privileges as an option renders --cap-drop=all redundant.

Note that it seems like the default capability mask for docker includes SYS_ADMIN

$ docker run --rm ubuntu grep Cap /proc/self/status
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
$ capsh --decode=00000000a82425fb
0x00000000a82425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap

Which would make sense why the 00000000a82425fb is the same without specifying any --cap-add option.

But other container runtimes could implement it, so is this comment just Docker specific?

I suppose, so you could have a case where privileged: false and allowPrivilegeEscalation: false not effectively disabling capabilities and that could be dropped with requiredDropCapabilities: (Although, I don't see why another runtime would want to change the Docker behavior).

-- Rico
Source: StackOverflow

9/27/2019

There are a multiple (good) sub questions inside your question.
I want to focus on the main question:

Why is dropping all capabilities redundant for non-root + disallow privilege escalation?

To make it simpler I think we can focus on the disallow privilege escalation part and simply ask:

What happens behind the scenes when we set the allowPrivilegeEscalation: false in a PodSecurityPolicy?

From the K8S docs you can see that "This bool directly controls whether the no_new_privs flag gets set on the container process".

So what happens if this flag is being set?

Quoting from the kernel docs: "When this flag is set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.
For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set"
.

In other words, setting up allowPrivilegeEscalation: false will cause all capabilities to be dropped.

This is why adding this part consider to be redundant:

 requiredDropCapabilities:
    - ALL

I hope this simplify things a bit.

I think the answers for the other questions are very clear in the accepted answer, and I have nothing to add to them.


Notice: If you're running a kernel >= 4.10, then you can see the value of a thread's no_new_privs attribute in /proc/[pid]/status file - under the capabilities attributes:

.
.
CapInh: 00000000a82425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a82425fb
CapAmb: 0000000000000000
NoNewPrivs: 0 <-----
.
.
-- RtmY
Source: StackOverflow