I am experiencing a lot of CPU throttling on my k8s cluster -- do I have the high-throttling low-quota Linux kernel bug?

4/29/2021

I am experiencing a lot of CPU throttling (see nginx graph below, other pods often 25% to 50%) in my Kubernetes cluster (k8s v1.18.12, running 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux).

Due to backports, I do not know whether my cluster contains the Linux kernel bug described in https://lkml.org/lkml/2019/5/17/581. How can I find out? Is there a simple way to check or measure?

If I have the bug, what is the best approach to get the fix? Or should I mitigate otherwise, e.g. not use CFS quota (--cpu-cfs-quota=false or no CPU limits) or reduce cfs_period_us and cfs_quota_us?

CPU Throttling Percentage for nginx (scaling horizontally around 15:00 and removing CPU limits around 19:30): enter image description here

-- DaveFar
cpu
cpu-usage
kubernetes
linux-kernel
resources

2 Answers

5/1/2021

Since the fix was backported to many older Kernel versions, I do not know how to look up easily whether e.g. 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux has the fix.

But you can measure whether your CFS is working smoothly or is throttling too much, as described in https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1: 1) you run the given cfs.go with suitable settings for its sleeps and iterations as well as CFS settings, e.g. docker run --rm -it --cpu-quota 20000 --cpu-period 100000 -v $(pwd):$(pwd) -w $(pwd) golang:1.9.2 go run cfs.go -iterations 100 -sleep 1000ms 2) you check whether all burn took 5ms. If not, your CFS is throttling too much. This could be e.g. due to the original bug 198197 (see https://bugzilla.kernel.org/show_bug.cgi?id=198197) or the regression introduced by the fix for bug 198197 (details see https://lkml.org/lkml/2019/5/17/581).

This measurement approach is also taken in https://github.com/kubernetes/kops/issues/8954, showing that Linux kernel 4.9.0-11-amd64 is throttling too much (however, with an earlier Debian 4.9.189-3+deb9u1 (2019-09-20) than your Debian 4.9.189-3+deb9u2 (2019-11-11)).

-- DaveFar
Source: StackOverflow

4/29/2021

The CFS bug was fixed in Linux 5.4, exec kubectl describe nodes | grep Kernel or go to any of your Kubernetes nodes and execuname -sr that will tell you the Kernel release you are running on.

-- cperez08
Source: StackOverflow