We've got an Azure environment where we created a AKS with 3 nodes. All seem to be successfully done. For this command kubectl get pods --all-namespaces
I get the below output
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-system cattle-cluster-agent-b84447cd7-m6k5h 0/1 CrashLoopBackOff 823 3d2h
cattle-system cattle-node-agent-rpcrw 1/1 Running 1 3d2h
cattle-system cattle-node-agent-sjllb 1/1 Running 0 3d2h
cattle-system cattle-node-agent-v8c76 1/1 Running 1 3d2h
kube-system azure-cni-networkmonitor-cpsqx 1/1 Running 0 14d
kube-system azure-cni-networkmonitor-pmrv4 1/1 Running 1 14d
kube-system azure-cni-networkmonitor-x25p7 1/1 Running 1 14d
kube-system azure-ip-masq-agent-8cds2 1/1 Running 0 14d
kube-system azure-ip-masq-agent-gmnmr 1/1 Running 1 14d
kube-system azure-ip-masq-agent-mjlh5 1/1 Running 1 14d
kube-system coredns-6c66fc4fcb-g6ssg 1/1 Running 0 14d
kube-system coredns-6c66fc4fcb-mkzn9 1/1 Running 1 14d
kube-system coredns-autoscaler-567dc76d66-5krrx 1/1 Running 0 14d
kube-system kube-proxy-h9j48 1/1 Running 1 2d20h
kube-system kube-proxy-hfqvg 1/1 Running 0 2d20h
kube-system kube-proxy-wlbdx 1/1 Running 1 2d20h
kube-system kubernetes-dashboard-9f5bf9974-955cp 1/1 Running 0 14d
kube-system metrics-server-5695787788-pxsl8 1/1 Running 0 14d
kube-system tunnelfront-746dc8557f-gsw2f 1/1 Running 0 57m
If you see, the pod "cattle-cluster-agent-b84447cd7-m6k5h" is going in "CrashLoopBackOff" mode constantly.
Following are my investigations
> kubectl -n cattle-system get pods -l app=cattle-agent -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cattle-node-agent-rpcrw 1/1 Running 1 2d22h XX.XXX.XX.1 aks-agentpool-XXXX-1 <none> <none>
cattle-node-agent-sjllb 1/1 Running 0 2d22h XX.XXX.XX.X2 aks-agentpool-XXXX-2 <none> <none>
cattle-node-agent-v8c76 1/1 Running 1 2d22h XX.XXX.XX.X3 aks-agentpool-XXXX-0 <none> <none>
and
> kubectl -n cattle-system logs -l app=cattle-cluster-agent
Error from server: Get https://aks-agentpool-XXXX-1:YYYY/containerLogs/cattle-system/cattle-cluster-agent-b84447cd7-m6k5h/cluster-register?tailLines=10: dial tcp XX.XXX.XX.1:YYYY: i/o timeout
and
> kubectl -n kube-system get pods -l k8s-app=kube-dns -o custom-columns=NAME:.metadata.name,HOSTIP:.status.hostIP
NAME HOSTIP
coredns-6c66fc4fcb-g6ssg XX.XXX.XX.X2
coredns-6c66fc4fcb-mkzn9 XX.XXX.XX.X3
On the last command, I find the coredns is not online on one of the worker nodes. Can this be the cause of the cluster agent to go into CrashLookBackOff mode? If yes, how do I get this coredns on the worker node 1 online? I exhausted all my options to get this working. Any help would be highly appreciated.
Upgrading to kubernetes version v1.16.7 fixed the issue