I have a Azure Kubernetes Cluster with all pods and services in running state. The issue I have is when I do a curl from pod1 to the service url of pod2, it fails intermittently with a Unable to resolve host error.
To illustrate, I have 3 pods - pod1, pod2, pod3 When I get into pod1 using
kubectl exec -it pod1
and I run curl using service url of pod2 :
the command succeeds about every 6/10 times, the remaining 4/10 it fails with error "curl: (6) Could not resolve host:api-batchprocessing
".
When I tried calling another service running on pod3 using curl, I get the same issue.
I have tried below approaches without any success: - delete coredns pods in kube-system - delete and recreate azure kubernetes cluster. above seem to resolve it temporarily, but in few tries I get the same intermittent 'could not resolve host:' issue.
Any help/pointers on this issue will be much appreciated.
Problem may lay in DNS configuration. Looks like coredns uses the DNS server list in a differnt way that kube-dns did. If you have to resolve both public and private hostnames, always check only private DNS servers are on the list, or find the right configuration to route private DNS queries against your private premises.
Possible steps to find and get rid of problem:
The only thing you need is this YAML file:
apiVersion: v1
data:
log.override: |
log
kind: ConfigMap
metadata:
labels:
addonmanager.kubernetes.io/mode: EnsureExists
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
name: coredns-custom
namespace: kube-system
Use kubectl log or VSCode Kubernetes extension to open/review your coredns logs.
Attach to one of our environment pods and executed some DNS resolution actions, including nslookup and curl. Some of the executions where loop queries to put pressure on the DNS and networking components:
Review coredns the logs.
You will see before, curl is trying to resolve DNS by trying both A and AAAA entries, for all the search domains defined in our pods. In other words, to resolve "api-batchprocessing", curl is making DNS queries agains coredns. However, coredns is responding properly with an "NXDOMAIN" (no exist) or "NOERROR" (record found). So the problem is elsewhere.
Possible explanation to this random DNS resolution error is that, under high load scenarios, coredns is using on all the DNS servers defined at VNET level. Some queries were probably going to the on-prem servers. Others, to Google. Google doesn't know how to resolve your private hostnames.
Here you can find more information: random-dns-error.
I hope it helps.