I am experiencing a lot of CPU throttling (see nginx graph below, other pods often 25% to 50%) in my Kubernetes cluster (k8s v1.18.12, running 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux).
Due to backports, I do not know whether my cluster contains the Linux kernel bug described in https://lkml.org/lkml/2019/5/17/581. How can I find out? Is there a simple way to check or measure?
If I have the bug, what is the best approach to get the fix? Or should I mitigate otherwise, e.g. not use CFS quota (--cpu-cfs-quota=false
or no CPU limits) or reduce cfs_period_us
and cfs_quota_us
?
CPU Throttling Percentage for nginx (scaling horizontally around 15:00 and removing CPU limits around 19:30):
Since the fix was backported to many older Kernel versions, I do not know how to look up easily whether e.g. 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux
has the fix.
But you can measure whether your CFS is working smoothly or is throttling too much, as described in https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1:
1) you run the given cfs.go
with suitable settings for its sleeps and iterations as well as CFS settings, e.g. docker run --rm -it --cpu-quota 20000 --cpu-period 100000 -v $(pwd):$(pwd) -w $(pwd) golang:1.9.2 go run cfs.go -iterations 100 -sleep 1000ms
2) you check whether all burn
took 5ms. If not, your CFS is throttling too much. This could be e.g. due to the original bug 198197 (see https://bugzilla.kernel.org/show_bug.cgi?id=198197) or the regression introduced by the fix for bug 198197 (details see https://lkml.org/lkml/2019/5/17/581).
This measurement approach is also taken in https://github.com/kubernetes/kops/issues/8954, showing that Linux kernel 4.9.0-11-amd64
is throttling too much (however, with an earlier Debian 4.9.189-3+deb9u1 (2019-09-20)
than your Debian 4.9.189-3+deb9u2 (2019-11-11)
).
The CFS bug was fixed in Linux 5.4, exec kubectl describe nodes | grep Kernel
or go to any of your Kubernetes nodes and execuname -sr
that will tell you the Kernel release you are running on.