I'm in desperate need of help. I'm noticing that my Kubernetes minions/nodes are rebooting at what appear to be random intervals a few times a day and I can't figure out why. This is a big problem for me because every reboot causes about 10 minutes of downtime for every app on the node.
When they reboot, I can see the node event like so
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
9m 9m 1 kubelet, kubernetes-minion-group-7j5x Normal Starting Starting kubelet.
9m 9m 1 kubelet, kubernetes-minion-group-7j5x Warning ImageGCFailed unable to find data for container /
9m 9m 2 kubelet, kubernetes-minion-group-7j5x Normal NodeHasSufficientDisk Node kubernetes-minion-group-7j5x status is now: NodeHasSufficientDisk
9m 9m 2 kubelet, kubernetes-minion-group-7j5x Normal NodeHasSufficientMemory Node kubernetes-minion-group-7j5x status is now: NodeHasSufficientMemory
9m 9m 2 kubelet, kubernetes-minion-group-7j5x Normal NodeHasNoDiskPressure Node kubernetes-minion-group-7j5x status is now: NodeHasNoDiskPressure
9m 9m 1 kubelet, kubernetes-minion-group-7j5x Warning Rebooted Node kubernetes-minion-group-7j5x has been rebooted, boot id: bed35a9d-584c-4458-8a04-49725200eb0c
9m 9m 1 kubelet, kubernetes-minion-group-7j5x Normal NodeNotReady Node kubernetes-minion-group-7j5x status is now: NodeNotReady
8m 8m 1 kubelet, kubernetes-minion-group-7j5x Normal NodeReady
When I check the reboot history in the node, it appears to happen fairly randomly.
kubernetes-minion-group-7j5x:~$ last reboot
reboot system boot 3.16.0-4-amd64 Wed Dec 13 00:36 - 01:01 (00:25)
reboot system boot 3.16.0-4-amd64 Tue Dec 12 23:24 - 01:01 (01:37)
reboot system boot 3.16.0-4-amd64 Mon Dec 11 05:43 - 01:01 (1+19:18)
reboot system boot 3.16.0-4-amd64 Sun Dec 10 23:46 - 01:01 (2+01:15)
Since, the reboot is in the Kubernetes events, does that mean Kubernetes is doing the rebooting, or could it be some other process? How can I troubleshoot this? I'm not sure how to go about investigating this now.
I can't seem to find anything in the kube-controller-manager.log
or the kubelet.log
or syslog
or messages
or kern.log
or node-problem-detector.log
or auth.log
or unattended-upgrades.log
.
I'm running Kubernetes 1.6.0 on Debian
Linux kubernetes-minion-group-7j5x 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux
I'm not sure what the problem was, but after switching the kubernetes nodes from the container-vm-v20170214 image to gci-stable-56-9000-84-2 the reboots seem to have stopped.
I chose gci-stable-56-9000-84-2 because that was what my kubernetes masters were running and they seemed stable. I'm not sure why Kubernetes 1.6.0 uses different images for masters and nodes by default.
Troubleshooting can be done by looking at the logs so that you can have more information to see what is making the node to reboot. When rebooting happens the kublet process restarts and it’s trying to get metrics before the first metrics have been collected. That is why you see the warning error after the restart of kublet. This is normally not a problem, as the kubelet eventually retries, and should succeed once metrics collection has started. This error is especially visible just after kubelet restarts.
This error does not mean it could be a Kubernetes problem since the node could be rebooting due to other problems. The initial troubleshooting would be to look at the instance logs as per the documentation:
You can also connect to the cluster and view the /var/log/messages file for any indication of errors. You can use a command similar to the following and see if there are any errors near the time the instance restarted:
cat /var/log/messages|egrep -i {"warning|error|kernel|restart"}
You can also use less as in ‘less /var/log/messages’ and use ‘/’ to search for the date and time the node rebooted.
Also look at the VM instance serial console output:
Go to ‘Compute engine’ > Instances and click on the ‘VM instance’ to view the VM instance details. Scroll down to the ‘Logs’ section and click on ‘Serial port 1 (console)’. You will get more logs on the instance this way.
I would also like to point out that you are not using an up-to-date version of Kubernetes and an upgrade might be useful.