I have a problem with controller-manager and scheduler not responding, that is not related to github issues I've found (rancher#11496, azure#173, …)
Two days ago we had a memory overflow by one POD on one Node in our 3-node HA cluster. After that rancher webapp was not accessible, we found the compromised pod and scaled it to 0 over kubectl. But that took some time, figuring everything out.
Since then rancher webapp is working properly, but there are continuous alerts from controller-manager and scheduler not working. Alerts are not consist, sometimes they are both working, some times their health check urls are refusing connection.
controller-manager Unhealthy Get dial tcp connect: connection refused
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
etcd-2 Healthy {"health": "true"}
etcd-1 Healthy {"health": "true"}
Restarting controller-manager and scheduler on compromised Node hasn’t been effective. Even reloading all of the components with
docker restart kube-apiserver kubelet kube-controller-manager kube-scheduler kube-proxy
wasn’t effective either.
Can someone please help me figure out the steps towards troubleshooting and fixing this issue without downtime on running containers?
Nodes are hosted on DigitalOcean on servers with 4 Cores and 8GB of RAM each (Ubuntu 16, Docker 17.03.3).
Thanks in advance !
The first area to look at would be your logs... Can you export the following logs and attach them?
The controller manager is an endpoint, so you will need to do a "get endpoint". Can you run the following:
kubectl -n kube-system get endpoints kube-controller-manager
kubectl -n kube-system describe endpoints kube-controller-manager
kubectl -n kube-system get endpoints kube-controller-manager -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'