Kubernetes Nodes Reboot Randomly

12/13/2017

I'm in desperate need of help. I'm noticing that my Kubernetes minions/nodes are rebooting at what appear to be random intervals a few times a day and I can't figure out why. This is a big problem for me because every reboot causes about 10 minutes of downtime for every app on the node.

When they reboot, I can see the node event like so

Events:
  FirstSeen     LastSeen        Count   From                                            SubObjectPath   Type            Reason                  Message
  ---------     --------        -----   ----                                            -------------   --------        ------                  -------
  9m            9m              1       kubelet, kubernetes-minion-group-7j5x                           Normal          Starting                Starting kubelet.
  9m            9m              1       kubelet, kubernetes-minion-group-7j5x                           Warning         ImageGCFailed           unable to find data for container /
  9m            9m              2       kubelet, kubernetes-minion-group-7j5x                           Normal          NodeHasSufficientDisk   Node kubernetes-minion-group-7j5x status is now: NodeHasSufficientDisk
  9m            9m              2       kubelet, kubernetes-minion-group-7j5x                           Normal          NodeHasSufficientMemory Node kubernetes-minion-group-7j5x status is now: NodeHasSufficientMemory
  9m            9m              2       kubelet, kubernetes-minion-group-7j5x                           Normal          NodeHasNoDiskPressure   Node kubernetes-minion-group-7j5x status is now: NodeHasNoDiskPressure
  9m            9m              1       kubelet, kubernetes-minion-group-7j5x                           Warning         Rebooted                Node kubernetes-minion-group-7j5x has been rebooted, boot id: bed35a9d-584c-4458-8a04-49725200eb0c
  9m            9m              1       kubelet, kubernetes-minion-group-7j5x                           Normal          NodeNotReady            Node kubernetes-minion-group-7j5x status is now: NodeNotReady
  8m            8m              1       kubelet, kubernetes-minion-group-7j5x                           Normal          NodeReady  

When I check the reboot history in the node, it appears to happen fairly randomly.

kubernetes-minion-group-7j5x:~$ last reboot
reboot   system boot  3.16.0-4-amd64   Wed Dec 13 00:36 - 01:01  (00:25)    
reboot   system boot  3.16.0-4-amd64   Tue Dec 12 23:24 - 01:01  (01:37)    
reboot   system boot  3.16.0-4-amd64   Mon Dec 11 05:43 - 01:01 (1+19:18)   
reboot   system boot  3.16.0-4-amd64   Sun Dec 10 23:46 - 01:01 (2+01:15)   

Since, the reboot is in the Kubernetes events, does that mean Kubernetes is doing the rebooting, or could it be some other process? How can I troubleshoot this? I'm not sure how to go about investigating this now.

I can't seem to find anything in the kube-controller-manager.log or the kubelet.log or syslog or messages or kern.log or node-problem-detector.log or auth.log or unattended-upgrades.log.

I'm running Kubernetes 1.6.0 on Debian

Linux kubernetes-minion-group-7j5x 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux
-- Jesse Shieh
debian
google-compute-engine
kubernetes
reboot

2 Answers

12/15/2017

I'm not sure what the problem was, but after switching the kubernetes nodes from the container-vm-v20170214 image to gci-stable-56-9000-84-2 the reboots seem to have stopped.

I chose gci-stable-56-9000-84-2 because that was what my kubernetes masters were running and they seemed stable. I'm not sure why Kubernetes 1.6.0 uses different images for masters and nodes by default.

-- Jesse Shieh
Source: StackOverflow

12/13/2017

Troubleshooting can be done by looking at the logs so that you can have more information to see what is making the node to reboot. When rebooting happens the kublet process restarts and it’s trying to get metrics before the first metrics have been collected. That is why you see the warning error after the restart of kublet. This is normally not a problem, as the kubelet eventually retries, and should succeed once metrics collection has started. This error is especially visible just after kubelet restarts.

This error does not mean it could be a Kubernetes problem since the node could be rebooting due to other problems. The initial troubleshooting would be to look at the instance logs as per the documentation:

  1. At the Google Cloud Platform click on Products & Services which is the icon with the four bars at the top left hand corner.
  2. On the menu go to the ‘Stackdriver monitoring’ section, hover on ‘logging’ and click on logs.
  3. At the basic selector menu hover on the resource that you want to view, e.g ‘GCE VM Instance’ and click on the instance that you want to retrieve logs for.
  4. The time-range selector drop-down menus let you filter for specific dates and times in the logs.
  5. The streaming selector, at the top of the page, controls whether new log entries are displayed as they arrive.
  6. The View Options menu, at the far right, has additional display options.
  7. The expander arrow (▸) in front of each log entry lets you look at the full contents of the entry.

You can also connect to the cluster and view the /var/log/messages file for any indication of errors. You can use a command similar to the following and see if there are any errors near the time the instance restarted:

cat /var/log/messages|egrep -i {"warning|error|kernel|restart"}

You can also use less as in ‘less /var/log/messages’ and use ‘/’ to search for the date and time the node rebooted.

Also look at the VM instance serial console output:

Go to ‘Compute engine’ > Instances and click on the ‘VM instance’ to view the VM instance details. Scroll down to the ‘Logs’ section and click on ‘Serial port 1 (console)’. You will get more logs on the instance this way.

I would also like to point out that you are not using an up-to-date version of Kubernetes and an upgrade might be useful.

-- JMD
Source: StackOverflow