Kubernetes nodes with high system loads and low CPU use always broken the system

5/14/2020

I have a cluster(It was deployed by rancher RKE) with 3 masters(HA) and 8 workers like below

worker7    Ready                      worker              199d    v1.15.5   10.116.18.42    <none>        Red Hat Enterprise Linux Server 7.5 (Maipo)   3.10.0-1062.el7.x86_64   docker://19.3.4

And it's using ingress-nginx(image tag 0.25) as an ingress controller, cannel as network plugins. The cluster works well, see the top below

NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
master1   219m         5%     4497Mi          78%
master2   299m         7%     4053Mi          71%
master3   266m         6%     4255Mi          72%
worker1    778m         4%     27079Mi         42%
worker2    691m         4%     43636Mi         67%
worker3    528m         3%     48660Mi         75%
worker4    677m         4%     37532Mi         58%
worker5    895m         5%     51634Mi         80%
worker6    838m         5%     47337Mi         73%
worker7    2388m        14%    47065Mi         73%
worker8    1805m        11%    40601Mi         63%

The pods on the worker1 below

Non-terminated Pods:          (10 in total)
  Namespace                   Name                                         CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                   ----                                         ------------  ----------   ---------------  -------------  ---
  cattle-prometheus           exporter-node-cluster-monitoring-jqqkv       100m (0%)     200m (1%)    30Mi (0%)        200Mi (0%)     197d
  cattle-prometheus           prometheus-cluster-monitoring-1              1350m (8%)    1800m (11%)  5200Mi (8%)      5350Mi (8%)    4d23h
  cattle-system               cattle-node-agent-ml7fl                      0 (0%)        0 (0%)       0 (0%)           0 (0%)         173d
  ingress-nginx               nginx-ingress-controller-hdbjp               0 (0%)        0 (0%)       0 (0%)           0 (0%)         92d
  kube-system                 canal-bpqjl                                  250m (1%)     0 (0%)       0 (0%)           0 (0%)         165d
  sigma-demo                  apollo-configservice-dev-64f54f4b58-8tdm8    0 (0%)        0 (0%)       0 (0%)           0 (0%)         4d23h
  sigma-demo                  ibor-8d9c9d54d-8bmh9                         700m (4%)     1 (6%)       1Gi (1%)         4Gi (6%)       2d16h
  sigma-sit                   ibor-admin-7f886488cb-k4t5p                  100m (0%)     1500m (9%)   1Gi (1%)         4Gi (6%)       2d19h
  sigma-sit                   ibor-collect-5698947546-69zz9                200m (1%)     1 (6%)       1Gi (1%)         2Gi (3%)       2d16h
  utils                       filebeat-filebeat-59hx7                      100m (0%)     1 (6%)       100Mi (0%)       200Mi (0%)     6d13h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2800m (17%)   6500m (40%)
  memory             8402Mi (13%)  15990Mi (24%)
  ephemeral-storage  0 (0%)        0 (0%)
Events:              <none>

As you can see they're not so many high resources request(ibor was an Java program to load data(need high CPU and high memory used, and although there needs to optimize) and apollo was a config center)

But while I log into worker1 node and use htop command we can get the report with system loads high and already filled up all the numbers of CPU

enter image description here

But I do not understand wich process lets the system load so high. And it will grow up to about 30~40 and finally broken the system. Everything is not different just the high cs and us.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 15227600   3176 40686872    0    0     0    26    1    2  4  1 95  0  0
 0  0      0 15227772   3176 40686952    0    0     0    34 16913 14861  2  2 96  0  0
 1  0      0 15226836   3176 40686976    0    0     0    33 18861 13368  2  2 96  0  0
 0  0      0 15226736   3176 40686984    0    0     0   630 15778 14887  2  1 97  0  0
 0  0      0 15226716   3176 40687196    0    0     0    31 17228 14023  4  2 95  0  0
 0  0      0 15225188   3176 40687224    0    0     0     0 20546 17126  3  2 95  0  0
 0  0      0 15224868   3176 40687240    0    0     0    32 16025 14326  2  1 97  0  0
 2  0      0 15224128   3176 40687544    0    0     0    34 20494 16183  3  2 95  0  0
 0  0      0 15224324   3176 40687548    0    0     0    33 15158 12917  3  1 95  0  0
 0  0      0 15225152   3176 40687572    0    0     0     0 19292 15307  2  2 96  0  0
 2  0      0 15224764   3176 40687576    0    0     0    33 15634 13430  3  1 95  0  0
 1  0      0 15220824   3176 40687768    0    0     0     0 21238 15215 11  2 86  0  0
 2  0      0 15221352   3176 40687776    0    0     0    33 14481 12017  3  1 95  0  0
 2  0      0 15220140   3176 40687796    0    0     0    33 20263 16450  4  3 93  0  0
 1  0      0 15220200   3176 40688108    0    0     0     0 16103 12503  2  1 97  0  0
 1  0      0 15220692   3176 40688116    0    0     0    64 20478 15081  2  2 95  0  0

So, I'd like to ask for help with which process causes this situation and how to check?

-- Aisuko
kubernetes

0 Answers