Kubectl is not able to reach kubernetes api. k8s deployment is unreachable

5/2/2020

All kubectl commands (e.g. kubectl get pods and kubectl proxy etc.) are failing with error stating it can't connect to kubernetes api server (api.services.ourdomainname.com).

What might have caused it:

We were trying to add one more node to the cluster to increase capacity. For that, we ran following commands...

$ kops edit ig --name=ppe.services.ourdomainname.com nodes

$ kops upgrade cluster --name ppe.services.ourdomainname.com --yes 

$ kops update cluster ppe.services.ourdomainname.com --yes 

$ kops rolling-update cluster --yes

Issue happened after I tried to do rolling-update. Essentially rolling update failed while updating the master node.

 WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: etcd-server-events-ip-xx-xx-60-141.us-west-2.compute.internal, etcd-server-ip-xx-xx-60-141.us-west-2.compute.internal, kube-apiserver-ip-xx-xx-60-141.us-west-2.compute.internal, kube-controller-manager-ip-xx-xx-60-141.us-west-2.compute.internal, kube-proxy-ip-xx-xx-60-141.us-west-2.compute.internal, kube-scheduler-ip-xx-xx-60-141.us-west-2.compute.internal

 pod "dns-controller-xxxx03014-fq2sj" evicted

 pod "masked-tapir-aws-cluster-autoscaler-xxxx6cf8f-fpcqq" evicted

 pod "kubernetes-dashboard-3313488171-t578p" evicted

 node "ip-xx-xx-60-141.us-west-2.compute.internal" drained


 I0501 17:30:23.679575   31176 instancegroups.go:237] Stopping instance "i-024deccc522cc2bf7", node "ip-xxx-xx-60-141.us-west-2.compute.internal", in group "master-us-west-2a.masters.ppe.services.ourdomainname.com". 

 I0501 17:35:24.345270   31176 instancegroups.go:161] Validating the cluster.

 I0501 17:35:54.345805   31176 instancegroups.go:209] Cluster did not validate, will try again in "30s" util duration "5m0s" expires: cannot get nodes for "ppe.services.ourdomainname.com": Get https://api.ppe.services.ourdomainname.com/api/v1/nodes: dial tcp xx.xx.147.151:443: i/o timeout. ... 

 error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"

After this the kubectl stopped working. Based on some digging, we then ran kops rolling-update cluster --yes --cloudonly. This removed old Ec2 nodes and added new EC2 nodes. But didn't fix the issues. And made it worse. Previously our apps were able to reach our servers but after this command run, now even our apps can't reach the servers. Essentially it broke nginx entry point and now the AWS ELB started returning 500s stating it can't connect. Because of this our live services are down! :-(

Any thoughts on what to do fix kubernetes cluster? Any ways to find out why the k8s api server is not reachable? What can we do to get this connectivity back? Many thanks for your help.

-- mi10
amazon-web-services
kubernetes

1 Answer

5/5/2020

Sharing the learning on what were the issues and what we did to get out of those...

Looks like in March, dockerproject repository was taken down which was resulting in failure to start api server as kubernetes was trying to download certain dependencies from there. Also didn't have ssh key to ssh into these boxes which made it more complex. Also master was using t2.medium which was causing issues as it was running out of cpu credits.

What we did:

  • updated the cluster after adding new ssh key in kops secrets so that it gets associated with our EC2 nodes and we can ssh into them.
  • upgraded the master and also nodes to use m5.large and m5a.large
  • updating kubernetes (1.16.8), kubectl (1.18.2), kops (1.16.1), helm etc. to use latest or much newer versions instead of older ones along with mongo replica set dependencies to be new ones as well (3.15.0).
-- mi10
Source: StackOverflow