GCP default user restart often my cluster

12/12/2017

I'm using GKE for one of my projects, I have a simple cluster with one node and 2 applications are deployed on it.

For some reason the default Google Compute service account is restarting my node every day (not at the same hour though).

After the restart (which is more a DELETE actually, but my node comes back afterwards), the various endpoints stop responding to all external traffic (timeing out) even though the healthchecks are still working.

I have to manually restart the cluster for it to come back to normal.

I'm not sure where to look to track down why the service account does that, from my understanding it should do so only during maintenance or critical error, but I didn't find any errors in the logs.

Any ideas about where should I look ?

-- David Medale
gcp
google-kubernetes-engine

1 Answer

3/22/2018

The only reason under normal operation (that I'm aware of) that could cause the behavior you describe is if your node is a preemptible virtual machine. The section "Preemptible instance limitations" says:

Compute Engine always terminates preemptible instances after they run for 24 hours.

If this issue is not caused by preemptible instances, then this sounds like a bug that should be investigated and you might want to use this link to create a private issue in which you can share your project number, cluster name and other necessary debug information:

https://issuetracker.google.com/issues/new?component=187164

Please add a comment here in your question with the issuetracker id.

-- Thomas Koch
Source: StackOverflow