Timed out waiting for cluster initialization, Auto upgrade of nodes fail / or Run with error

2/10/2020

I have few clusters with 3 node in each cluster's nodepools in my GCP project and has auto-upgrade and repair enabled.

The auto upgrade began approximately 3 days ago and is still running for the GKE version: 1.12.10-gke.17.

Now as my clusters are opted in for the auto-upgrade and auto repair, few clusters are getting upgraded without issues and few others are running update/upgrade with issues

ON my first cluster, few of my pods went unschedulable and the suggested possible actions by GCP is to

  • Enable Autoscaling in one or more node pools that have autoscaling disabled.
  • Increase size of one or more node pools manually.

when I run "gcloud container clusters describe "clustername" "zone" "

I get details of the cluster. however, under the nodepools section

 status: RUNNING_WITH_ERROR
  statusMessage: 'asia-south1-a: Timed out waiting for cluster initialization; cluster
    API may not be available: k8sclient: 7 - 404 status code returned. Requested resource
    not found.'
  version: 1.12.10-gke.17

NOTE:

I also see that the GCP suggests to

  • Enable autoscaling in one or more node pools that have autoscaling disabled.
  • Shrink one or more node pools manually.

because there is low resource requests.

Please let me know what other logs I can provide to resolve this issue.

Error Description and Activity

UPDATE:

We went through these logs and google support believes that it could be that the kubelet might be failing to submit a Certificate Signing Request (CSR) or that it might have old invalid credentials. To assist on the troubleshooting, might you answer these questions:

  1. sudo journalctl -u kubelet > kubelet.log
  2. sudo journalctl -u kube-node-installation > kube-node-installation.log
  3. sudo journalctl -u kube-node-configuration > kube-node-configuration.log
  4. sudo journalctl -u node-problem-detector > node-problem-detector.log
  5. sudo journalctl -u docker > docker.log
  6. sudo journalctl -u cloud-init > cloud-init.log

Any node that starts running 1.13.12-gke.13 fails to connect to master. Anything else that's happening to nodes (e.g. recreation) is because they are trying to fix them in a repair loop and doesn't seem to be causing additional problems.

-- Chronograph3r
devops
google-cloud-platform
google-iam
google-kubernetes-engine
kubernetes

1 Answer

3/2/2020

This isn't exactly a solution but a working fix. We were able to do narrow down to this.

On the nodepools we had the labels "node-restriction" to what type of nodes should it be.

Google Support has also suggested that currently it is not possible to update the labels of an existing node-pool when it has begun an upgrade hence they suggested creating a new node-pool without any of these labels. In case if were able to deploy the node-pool successfully, we had to think of migrating our workloads to this newly created node-pool.

so we removed those two node selector labels and created a new nodepool. to our surprise it worked. We had to migrate the whole workload though.

we followed this Cloud Migration

-- Chronograph3r
Source: StackOverflow