The second example policy from the PodSecurityPolicy documentation consists of the following PodSecurityPolicy snippet
...
spec:
privileged: false
# Required to prevent escalations to root.
allowPrivilegeEscalation: false
# This is redundant with non-root + disallow privilege escalation,
# but we can provide it for defense in depth.
requiredDropCapabilities:
- ALL
...
Why is dropping all capabilities redundant for non-root + disallow privilege escalation? You can have a container process without privilege escalation that is non-root but has effective capabilities right?
It seems like this is not possible with Docker:
$ docker run --cap-add SYS_ADMIN --user 1000 ubuntu grep Cap /proc/self/status
CapInh: 00000000a82425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a82425fb
CapAmb: 0000000000000000
All effective capabilities have been dropped even when trying to explicitly add them. But other container runtimes could implement it, so is this comment just Docker specific?
Why is dropping all capabilities redundant for non-root + disallow privilege escalation?
Because you need privilege escalation to be able to use 'new' capabilities, an effectively allowPrivilegeEscalation: false
is disabling setuid in the execve system call that prevents the use of any new capabilities.
Also as shown in the docs: "Once the bit is set, it is inherited across fork, clone, and execve and cannot be unset". More info here.
This in combination with privileged: false
renders requiredDropCapabilities: [ALL]
redundant.
The equivalent Docker options here are:
--user=whatever
=> privileged: false
--security-opt=no-new-privileges
=> allowPrivilegeEscalation: false
--cap-drop=all
=> requiredDropCapabilities: [ALL]
It seems like this is not possible with Docker
That's what looks like Docker is doing, the moment you specify a non-privileged user all of the effective capabilities are dropped (CapEff: 0000000000000000
), even if you specify --cap-add SYS_ADMIN
This combined with the --security-opt=no-new-privileges
as an option renders --cap-drop=all
redundant.
Note that it seems like the default capability mask for docker includes SYS_ADMIN
$ docker run --rm ubuntu grep Cap /proc/self/status
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
$ capsh --decode=00000000a82425fb
0x00000000a82425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_admin,cap_mknod,cap_audit_write,cap_setfcap
Which would make sense why the 00000000a82425fb
is the same without specifying any --cap-add
option.
But other container runtimes could implement it, so is this comment just Docker specific?
I suppose, so you could have a case where privileged: false
and allowPrivilegeEscalation: false
not effectively disabling capabilities and that could be dropped with requiredDropCapabilities:
(Although, I don't see why another runtime would want to change the Docker behavior).
There are a multiple (good) sub questions inside your question.
I want to focus on the main question:
Why is dropping all capabilities redundant for non-root + disallow privilege escalation?
To make it simpler I think we can focus on the disallow privilege escalation part and simply ask:
What happens behind the scenes when we set the allowPrivilegeEscalation: false
in a PodSecurityPolicy?
From the K8S docs you can see that "This bool directly controls whether the no_new_privs
flag gets set on the container process".
So what happens if this flag is being set?
Quoting from the kernel docs: "When this flag is set, execve promises not to grant the privilege to do anything that could not have been done without the execve call.
For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set".
In other words, setting up allowPrivilegeEscalation: false
will cause all capabilities to be dropped.
This is why adding this part consider to be redundant:
requiredDropCapabilities:
- ALL
I hope this simplify things a bit.
I think the answers for the other questions are very clear in the accepted answer, and I have nothing to add to them.
Notice: If you're running a kernel >= 4.10, then you can see the value of a thread's no_new_privs
attribute in /proc/[pid]/status
file - under the capabilities attributes:
.
.
CapInh: 00000000a82425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a82425fb
CapAmb: 0000000000000000
NoNewPrivs: 0 <-----
.
.