Given is a cluster rather static workloads that are deployed to one fixed-size node-pool (default). An additional node-pool holds elastic workloads, the pool size changes from 0 - ~10 instances. During the scaling most of the times cluster is not responsive:
get pods -wwould disconnect:
E0828 12:36:14.495621 10818 portforward.go:233] lost connection to pod
The connection to the server 184.108.40.206 was refused - did you specify the right host or port?
kube_pod_container_infoare missing data during that time
What I tried so far, is switching from a regional to a zonal cluster (no-single-node-master?) but that didn't help. Also, the issue does not occur on every scale of the node-pool but in most cases.
So my question is - how to debug/fix that?
This is an expected behavior.
When you create your cluster the machine used for the master is chosen based on the
nodepool size, then when the
autoscaler creates more
nodes the machine type of the master will be changed to be able to handle the new number of nodes.
The period during the master is updated to the new machine type you will lose connection to the API and receive the message reported, also since the communication with the API broken you can’t visualize in the cloud console any information related to the cluster as the attached image shows.
You can try to avoid this changing the minimum of nodes at the creation time, for example, you mentioned the limits used are 0 and 10, so when the cluster is created, you can use the middle point 5 which likely support the max number of nodes in case the workloads requires them.