Kops rolling-update fails with "Cluster did not pass validation" for master node

8/9/2019

For some reason my master node can no longer connect to my cluster after upgrading from kubernetes 1.11.9 to 1.12.9 via kops (version 1.13.0). In the manifest I'm upgrading kubernetesVersion from 1.11.9 -> 1.12.9. This is the only change I'm making. However when I run kops rolling-update cluster --yes I get the following error:

Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-01234567" has not yet joined cluster.
Cluster did not validate within 5m0s

After that if I run a kubectl get nodes I no longer see that master node in my cluster.

Doing a little bit of debugging by sshing into the disconnected master node instance I found the following error in my api-server log by running sudo cat /var/log/kube-apiserver.log:

controller.go:135] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: connect: connection refused

I suspect the issue might be related to etcd, because when I run sudo netstat -nap | grep LISTEN | grep etcd there is no output.

Anyone have any idea how I can get my master node back in the cluster or have advice on things to try?

-- dredbound
amazon-web-services
etcd
kops
kubernetes
linux

1 Answer

8/16/2019

I have made some research I got few ideas for you:

  1. If there is no output for the etcd grep it means that your etcd server is down. Check the logs for the 'Exited' etcd container | grep Exited | grep etcd and than logs <etcd-container-id>

  2. Try this instruction I found:

1 - I removed the old master from de etcd cluster using etcdctl. You will need to connect on the etcd-server container to do this.

2 - On the new master node I stopped kubelet and protokube services.

3 - Empty Etcd data dir. (data and data-events)

4 - Edit /etc/kubernetes/manifests/etcd.manifests and etcd-events.manifest changing ETCD_INITIAL_CLUSTER_STATE from new to existing.

5 - Get the name and PeerURLS from new master and use etcdctl to add the new master on the cluster. (etcdctl member add "name" "PeerULR")You will need to connect on the etcd-server container to do this.

6 - Start kubelet and protokube services on the new master.

  1. If that is not the case than you might have a problem with the certs. They are provisioned during the creation of the cluster and some of them have the allowed master's endpoints. If that is the case you'd need to create new certs and roll them for the api server/etcd clusters.

Please let me know if that helped.

-- OhHiMark
Source: StackOverflow