K
Q

Question

Timed out waiting for cluster initialization, Auto upgrade of nodes fail / or Run with error

2/10/2020

I have few clusters with 3 node in each cluster's nodepools in my GCP project and has auto-upgrade and repair enabled.

The auto upgrade began approximately 3 days ago and is still running for the GKE version: 1.12.10-gke.17.

Now as my clusters are opted in for the auto-upgrade and auto repair, few clusters are getting upgraded without issues and few others are running update/upgrade with issues

ON my first cluster, few of my pods went unschedulable and the suggested possible actions by GCP is to

Enable Autoscaling in one or more node pools that have autoscaling disabled.
Increase size of one or more node pools manually.

when I run "gcloud container clusters describe "clustername" "zone" "

I get details of the cluster. however, under the nodepools section

 status: RUNNING_WITH_ERROR
  statusMessage: 'asia-south1-a: Timed out waiting for cluster initialization; cluster
    API may not be available: k8sclient: 7 - 404 status code returned. Requested resource
    not found.'
  version: 1.12.10-gke.17

NOTE:

I also see that the GCP suggests to

Enable autoscaling in one or more node pools that have autoscaling disabled.
Shrink one or more node pools manually.

because there is low resource requests.

Please let me know what other logs I can provide to resolve this issue.

UPDATE:

We went through these logs and google support believes that it could be that the kubelet might be failing to submit a Certificate Signing Request (CSR) or that it might have old invalid credentials. To assist on the troubleshooting, might you answer these questions:

sudo journalctl -u kubelet > kubelet.log
sudo journalctl -u kube-node-installation > kube-node-installation.log
sudo journalctl -u kube-node-configuration > kube-node-configuration.log
sudo journalctl -u node-problem-detector > node-problem-detector.log
sudo journalctl -u docker > docker.log
sudo journalctl -u cloud-init > cloud-init.log

Any node that starts running 1.13.12-gke.13 fails to connect to master. Anything else that's happening to nodes (e.g. recreation) is because they are trying to fix them in a repair loop and doesn't seem to be causing additional problems.

-- Chronograph3r

devops

google-cloud-platform

google-iam

google-kubernetes-engine

kubernetes

1 Answer

3/2/2020

This isn't exactly a solution but a working fix. We were able to do narrow down to this.

On the nodepools we had the labels "node-restriction" to what type of nodes should it be.

Google Support has also suggested that currently it is not possible to update the labels of an existing node-pool when it has begun an upgrade hence they suggested creating a new node-pool without any of these labels. In case if were able to deploy the node-pool successfully, we had to think of migrating our workloads to this newly created node-pool.

so we removed those two node selector labels and created a new nodepool. to our surprise it worked. We had to migrate the whole workload though.

we followed this Cloud Migration

-- Chronograph3r

Source: StackOverflow

KQ

Timed out waiting for cluster initialization, Auto upgrade of nodes fail / or Run with error

Similar Questions

1 Answer

K
Q