Jenkins makes a Kubernetes node stuck when high CPU usage

4/5/2019

I noticed that when launching some Jenkins builds sometimes the node hosting Jenkins get stuck forever. It means the whole node is not reachable, and all its pods are down (not ready in the dashboard).

To make things up again I need to remove it from the cluster and add it again (I'm on GCE so I need to remove it from the instance group to be able to delete it).

Note: during hours I'm not able to connect through SSH to the node, it is clearly out of service ^^

From my understanding, reaching memory top crashes a node, but reaching top CPU usage should just slow down the server and not make a big deal like what I'm experiencing. In the worst case Kubelet should be unavailable until CPU gets better.

Does someone is able to help me determining the origin of this issue? What could cause such a problem?

Node metrics 1

Node metrics 2

Jenkins slave metrics

Node metrics from GCE

On the other side, after waiting hours, I've been able to access the node through SSH and I run sudo journalctl -u kubelet to see what's going on. I don't see anything specific at 7pm o'clock but I'm able to see recurrent error like:

Apr 04 19:00:58 nodes-s2-2g5v systemd[43508]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
Apr 04 19:01:00 nodes-s2-2g5v systemd[43511]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.

I go to older logs and I found at 5:30pm the start of this kind of messages:

Apr 04 17:26:50 nodes-s2-2g5v kubelet[1841]: I0404 17:25:05.168402    1841 prober.go:111] Readiness probe for "...
Apr 04 17:26:50 nodes-s2-2g5v kubelet[1841]: I0404 17:25:04.021125    1841 prober.go:111] Readiness probe for "...
-- Reboot --
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
Apr 04 17:31:31 nodes-s2-2g5v systemd[1699]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.

At this time node kubelet reboots and it corresponds to a Jenkins build. There is the same pattern with high CPU usage. I don't know why earlier it just rebooted and around 7pm the node just get stuck :/

I'm really sorry, it's a lot of information but I'm totally lost, that's not the first time it happens to me ^^

Thank you,

-- Thomas Ramé
jenkins
kops
kubernetes

1 Answer

4/11/2019

As mentioned by @Brandon, it was related to resource limits applied to my Jenkins slaves.

In my case even if precised in my Helm chart YAML file, the values were not set. I had to go deeper in the UI to set them manually.

From this modification, everything is now stable! :)

-- Thomas Ramé
Source: StackOverflow