All kubectl commands (e.g. kubectl get pods and kubectl proxy etc.) are failing with error stating it can't connect to kubernetes api server (api.services.ourdomainname.com).
What might have caused it:
We were trying to add one more node to the cluster to increase capacity. For that, we ran following commands...
$ kops edit ig --name=ppe.services.ourdomainname.com nodes
$ kops upgrade cluster --name ppe.services.ourdomainname.com --yes
$ kops update cluster ppe.services.ourdomainname.com --yes
$ kops rolling-update cluster --yes
Issue happened after I tried to do rolling-update. Essentially rolling update failed while updating the master node.
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: etcd-server-events-ip-xx-xx-60-141.us-west-2.compute.internal, etcd-server-ip-xx-xx-60-141.us-west-2.compute.internal, kube-apiserver-ip-xx-xx-60-141.us-west-2.compute.internal, kube-controller-manager-ip-xx-xx-60-141.us-west-2.compute.internal, kube-proxy-ip-xx-xx-60-141.us-west-2.compute.internal, kube-scheduler-ip-xx-xx-60-141.us-west-2.compute.internal
pod "dns-controller-xxxx03014-fq2sj" evicted
pod "masked-tapir-aws-cluster-autoscaler-xxxx6cf8f-fpcqq" evicted
pod "kubernetes-dashboard-3313488171-t578p" evicted
node "ip-xx-xx-60-141.us-west-2.compute.internal" drained
I0501 17:30:23.679575 31176 instancegroups.go:237] Stopping instance "i-024deccc522cc2bf7", node "ip-xxx-xx-60-141.us-west-2.compute.internal", in group "master-us-west-2a.masters.ppe.services.ourdomainname.com".
I0501 17:35:24.345270 31176 instancegroups.go:161] Validating the cluster.
I0501 17:35:54.345805 31176 instancegroups.go:209] Cluster did not validate, will try again in "30s" util duration "5m0s" expires: cannot get nodes for "ppe.services.ourdomainname.com": Get https://api.ppe.services.ourdomainname.com/api/v1/nodes: dial tcp xx.xx.147.151:443: i/o timeout. ...
error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"
After this the kubectl stopped working. Based on some digging, we then ran kops rolling-update cluster --yes --cloudonly
. This removed old Ec2 nodes and added new EC2 nodes. But didn't fix the issues. And made it worse. Previously our apps were able to reach our servers but after this command run, now even our apps can't reach the servers. Essentially it broke nginx entry point and now the AWS ELB started returning 500s stating it can't connect. Because of this our live services are down! :-(
Any thoughts on what to do fix kubernetes cluster? Any ways to find out why the k8s api server is not reachable? What can we do to get this connectivity back? Many thanks for your help.
Sharing the learning on what were the issues and what we did to get out of those...
Looks like in March, dockerproject repository was taken down which was resulting in failure to start api server as kubernetes was trying to download certain dependencies from there. Also didn't have ssh key to ssh into these boxes which made it more complex. Also master was using t2.medium which was causing issues as it was running out of cpu credits.
What we did: