Updating deployment in GCE leads to node restart

11/21/2016

We have some odd issue happening with GCE. We have 2 clusters dev and prod each consisting of 2 nodes.
Production nodes are n1-standard-2, dev - n1-standard-1. Typically dev cluster is busier with more pods eating more resources. We deploy updates mostly with deployments (few projects still recreate RCs to update to latest versions) Normally, the process is: build project, build docker image, docker push, create new deployment config and kubectl apply new config.

What's constantly happening on production is after applying new config, single or both nodes restart. Cluster does not seem to be starving with memory/cpu and we could not find anything in the logs that would explain those restarts.

Same procedure on staging never causes nodes to restart.

What can we do to diagnose the issue? Any specific events,logs we should be looking at?

Many thanks for any pointers.

UPDATE: This is still happening and I found following in Computer Engine - Operations: repair-1481931126173-543cefa5b6d48-9b052332-dfbf44a1

Operation type: compute.instances.repair.recreateInstance Status message : Instance Group Manager 'projects/.../zones/europe-west1-c/instanceGroupManagers/gke-...' initiated recreateInstance on instance 'projects/.../zones/europe-west1-c/instances/...'. Reason: instance's intent is RUNNING but instance's health status is TIMEOUT.

We still can't figure out why this is happening and it's having a negative effect on our production environment every time we deploy our code.

-- s3ncha
google-cloud-platform
google-kubernetes-engine
kubernetes

0 Answers