I'm facing a strange issue with my EKS cluster.
At random intervals, I see DNS requests timing out in my clusters for various pods. Sometimes my pods cannot access rds instances due to timeout:
dial TCP: lookup myapp.zzzz.eu-west-1.rds.amazonaws.com on 172.20.0.10:53: no such host"
And sometimes I can't even get to resolve GitHub url :/
I saw that there was a race condition issue some time ago https://github.com/awslabs/amazon-eks-ami/issues/357 but it got fixed at some point. My resolv.conf file looks like this in one of my pod:
nameserver 172.20.0.10
search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
options ndots:5
I'm using the CNI Calico with the default configuration, same as CoreDNS. I dont see any timeout or error in my CoreDNS logs.
eks version: 1.21
ami:amazon-eks-node-1.21-v20210813
Could you guys point me to the right redirection? I dont really know where to look at the moment..
turned out its a calico bug, created a ticket for it https://github.com/projectcalico/calico/issues/4866, "solution" is to downgrade to v3.19.1