Given is a cluster rather static workloads that are deployed to one fixed-size node-pool (default). An additional node-pool holds elastic workloads, the pool size changes from 0 - ~10 instances. During the scaling most of the times cluster is not responsive:
get pods -w
would disconnect:E0828 12:36:14.495621 10818 portforward.go:233] lost connection to pod
The connection to the server 35.205.157.182 was refused - did you specify the right host or port?
kube_pod_container_info
are missing data during that timeWhat I tried so far, is switching from a regional to a zonal cluster (no-single-node-master?) but that didn't help. Also, the issue does not occur on every scale of the node-pool but in most cases.
So my question is - how to debug/fix that?
This is an expected behavior.
When you create your cluster the machine used for the master is chosen based on the nodepool
size, then when the autoscaler
creates more nodes
the machine type of the master will be changed to be able to handle the new number of nodes.
The period during the master is updated to the new machine type you will lose connection to the API and receive the message reported, also since the communication with the API broken you can’t visualize in the cloud console any information related to the cluster as the attached image shows.
You can try to avoid this changing the minimum of nodes at the creation time, for example, you mentioned the limits used are 0 and 10, so when the cluster is created, you can use the middle point 5 which likely support the max number of nodes in case the workloads requires them.