I have few clusters with 3 node in each cluster's nodepools in my GCP project and has auto-upgrade and repair enabled.
The auto upgrade began approximately 3 days ago and is still running for the GKE version: 1.12.10-gke.17.
Now as my clusters are opted in for the auto-upgrade and auto repair, few clusters are getting upgraded without issues and few others are running update/upgrade with issues
ON my first cluster, few of my pods went unschedulable and the suggested possible actions by GCP is to
when I run "gcloud container clusters describe "clustername" "zone" "
I get details of the cluster. however, under the nodepools section
status: RUNNING_WITH_ERROR
statusMessage: 'asia-south1-a: Timed out waiting for cluster initialization; cluster
API may not be available: k8sclient: 7 - 404 status code returned. Requested resource
not found.'
version: 1.12.10-gke.17
NOTE:
I also see that the GCP suggests to
because there is low resource requests.
Please let me know what other logs I can provide to resolve this issue.
UPDATE:
We went through these logs and google support believes that it could be that the kubelet might be failing to submit a Certificate Signing Request (CSR) or that it might have old invalid credentials. To assist on the troubleshooting, might you answer these questions:
Any node that starts running 1.13.12-gke.13 fails to connect to master. Anything else that's happening to nodes (e.g. recreation) is because they are trying to fix them in a repair loop and doesn't seem to be causing additional problems.
This isn't exactly a solution but a working fix. We were able to do narrow down to this.
On the nodepools we had the labels "node-restriction" to what type of nodes should it be.
Google Support has also suggested that currently it is not possible to update the labels of an existing node-pool when it has begun an upgrade hence they suggested creating a new node-pool without any of these labels. In case if were able to deploy the node-pool successfully, we had to think of migrating our workloads to this newly created node-pool.
so we removed those two node selector labels and created a new nodepool. to our surprise it worked. We had to migrate the whole workload though.
we followed this Cloud Migration