intermittent pod communication issue

12/1/2019

I have a Azure Kubernetes Cluster with all pods and services in running state. The issue I have is when I do a curl from pod1 to the service url of pod2, it fails intermittently with a Unable to resolve host error.

To illustrate, I have 3 pods - pod1, pod2, pod3 When I get into pod1 using

kubectl exec -it pod1

and I run curl using service url of pod2 :

curl http://api-batchprocessing:3000/result

the command succeeds about every 6/10 times, the remaining 4/10 it fails with error "curl: (6) Could not resolve host:api-batchprocessing".

When I tried calling another service running on pod3 using curl, I get the same issue.

I have tried below approaches without any success: - delete coredns pods in kube-system - delete and recreate azure kubernetes cluster. above seem to resolve it temporarily, but in few tries I get the same intermittent 'could not resolve host:' issue.

Any help/pointers on this issue will be much appreciated.

-- jack
azure-kubernetes
kubernetes

1 Answer

12/2/2019

Problem may lay in DNS configuration. Looks like coredns uses the DNS server list in a differnt way that kube-dns did. If you have to resolve both public and private hostnames, always check only private DNS servers are on the list, or find the right configuration to route private DNS queries against your private premises.

Possible steps to find and get rid of problem:

  1. Turn coredns logs on.

The only thing you need is this YAML file:

apiVersion: v1
data:
  log.override: |
    log
kind: ConfigMap
metadata:
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
  name: coredns-custom
  namespace: kube-system
  1. Use kubectl log or VSCode Kubernetes extension to open/review your coredns logs.

  2. Attach to one of our environment pods and executed some DNS resolution actions, including nslookup and curl. Some of the executions where loop queries to put pressure on the DNS and networking components:

  3. Review coredns the logs.

You will see before, curl is trying to resolve DNS by trying both A and AAAA entries, for all the search domains defined in our pods. In other words, to resolve "api-batchprocessing", curl is making DNS queries agains coredns. However, coredns is responding properly with an "NXDOMAIN" (no exist) or "NOERROR" (record found). So the problem is elsewhere.

  1. Configure DNS servers at VNET level

Possible explanation to this random DNS resolution error is that, under high load scenarios, coredns is using on all the DNS servers defined at VNET level. Some queries were probably going to the on-prem servers. Others, to Google. Google doesn't know how to resolve your private hostnames.

  1. Remove the Google DNS servers from the VNET, restar the cluster and run checks.

Here you can find more information: random-dns-error.

I hope it helps.

-- MaggieO
Source: StackOverflow