Azure Kubernetes: cattle-cluster-agent in CrashLoopBackOff mode

4/6/2020

We've got an Azure environment where we created a AKS with 3 nodes. All seem to be successfully done. For this command kubectl get pods --all-namespaces I get the below output

NAMESPACE       NAME                                   READY   STATUS             RESTARTS   AGE
cattle-system   cattle-cluster-agent-b84447cd7-m6k5h   0/1     CrashLoopBackOff   823        3d2h
cattle-system   cattle-node-agent-rpcrw                1/1     Running            1          3d2h
cattle-system   cattle-node-agent-sjllb                1/1     Running            0          3d2h
cattle-system   cattle-node-agent-v8c76                1/1     Running            1          3d2h
kube-system     azure-cni-networkmonitor-cpsqx         1/1     Running            0          14d
kube-system     azure-cni-networkmonitor-pmrv4         1/1     Running            1          14d
kube-system     azure-cni-networkmonitor-x25p7         1/1     Running            1          14d
kube-system     azure-ip-masq-agent-8cds2              1/1     Running            0          14d
kube-system     azure-ip-masq-agent-gmnmr              1/1     Running            1          14d
kube-system     azure-ip-masq-agent-mjlh5              1/1     Running            1          14d
kube-system     coredns-6c66fc4fcb-g6ssg               1/1     Running            0          14d
kube-system     coredns-6c66fc4fcb-mkzn9               1/1     Running            1          14d
kube-system     coredns-autoscaler-567dc76d66-5krrx    1/1     Running            0          14d
kube-system     kube-proxy-h9j48                       1/1     Running            1          2d20h
kube-system     kube-proxy-hfqvg                       1/1     Running            0          2d20h
kube-system     kube-proxy-wlbdx                       1/1     Running            1          2d20h
kube-system     kubernetes-dashboard-9f5bf9974-955cp   1/1     Running            0          14d
kube-system     metrics-server-5695787788-pxsl8        1/1     Running            0          14d
kube-system     tunnelfront-746dc8557f-gsw2f           1/1     Running            0          57m

If you see, the pod "cattle-cluster-agent-b84447cd7-m6k5h" is going in "CrashLoopBackOff" mode constantly.

Following are my investigations

> kubectl -n cattle-system get pods -l app=cattle-agent -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE                       NOMINATED NODE   READINESS GATES
cattle-node-agent-rpcrw   1/1     Running   1          2d22h   XX.XXX.XX.1    aks-agentpool-XXXX-1   <none>           <none>
cattle-node-agent-sjllb   1/1     Running   0          2d22h   XX.XXX.XX.X2   aks-agentpool-XXXX-2   <none>           <none>
cattle-node-agent-v8c76   1/1     Running   1          2d22h   XX.XXX.XX.X3   aks-agentpool-XXXX-0   <none>           <none>

and

> kubectl -n cattle-system logs -l app=cattle-cluster-agent
Error from server: Get https://aks-agentpool-XXXX-1:YYYY/containerLogs/cattle-system/cattle-cluster-agent-b84447cd7-m6k5h/cluster-register?tailLines=10: dial tcp XX.XXX.XX.1:YYYY: i/o timeout

and

> kubectl -n kube-system get pods -l k8s-app=kube-dns -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP
NAME                       HOSTIP
coredns-6c66fc4fcb-g6ssg   XX.XXX.XX.X2
coredns-6c66fc4fcb-mkzn9   XX.XXX.XX.X3

On the last command, I find the coredns is not online on one of the worker nodes. Can this be the cause of the cluster agent to go into CrashLookBackOff mode? If yes, how do I get this coredns on the worker node 1 online? I exhausted all my options to get this working. Any help would be highly appreciated.

-- VikramV
azure
azure-aks
azure-kubernetes
devops
kubernetes

1 Answer

4/26/2020

Upgrading to kubernetes version v1.16.7 fixed the issue

-- VikramV
Source: StackOverflow