How can I root cause DNS lookup failures in a (local) kubernetes instance (where sky-dns is healthy)

6/15/2017

It appears that local-up-cluster in kubernetes, on ubuntu, isn't able to resolve DNS queries when relying on cluster DNS.

setup

I'm running an ubuntu box, with environmental variables for DNS set in local-up-cluster:

# env  | grep KUBE
KUBE_ENABLE_CLUSTER_DNS=true
KUBE_DNS_SERVER_IP=172.17.0.1

running information

sky-dns seems happy:

I0615 00:04:13.563037 1 server.go:198] Skydns metrics enabled (/metrics:10055) I0615 00:04:13.563051 1 dns.go:147] Starting endpointsController I0615 00:04:13.563054 1 dns.go:150] Starting serviceController I0615 00:04:13.563125 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0] I0615 00:04:13.563141 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0] I0615 00:04:13.589840 1 dns.go:264] New service: kubernetes I0615 00:04:13.589971 1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:} I0615 00:04:14.063246 1 dns.go:171] Initialized services and endpoints from apiserver I0615 00:04:14.063267 1 server.go:129] Setting up Healthz Handler (/readiness) I0615 00:04:14.063274 1 server.go:134] Setting up cache handler (/cache) I0615 00:04:14.063288 1 server.go:120] Status HTTP port 8081

kube-proxy seems happy:

I0615 00:03:53.448369 5706 proxier.go:864] Setting endpoints for "default/kubernetes:https" to [172.31.44.133:6443] I0615 00:03:53.545124 5706 controller_utils.go:1001] Caches are synced for service config controller I0615 00:03:53.545146 5706 config.go:210] Calling handler.OnServiceSynced() I0615 00:03:53.545208 5706 proxier.go:979] Not syncing iptables until Services and Endpoints have been received from master I0615 00:03:53.545125 5706 controller_utils.go:1001] Caches are synced for endpoints config controller I0615 00:03:53.545224 5706 config.go:110] Calling handler.OnEndpointsSynced() I0615 00:03:53.545274 5706 proxier.go:309] Adding new service port "default/kubernetes:https" at 10.0.0.1:443/TCP I0615 00:03:53.545329 5706 proxier.go:991] Syncing iptables rules I0615 00:03:53.993514 5706 proxier.go:991] Syncing iptables rules I0615 00:03:54.008738 5706 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s I0615 00:04:24.008904 5706 proxier.go:991] Syncing iptables rules I0615 00:04:24.023057 5706 bounded_frequency_runner.go:221] sync-runner: ran, next possible in 0s, periodic in 30s

result

However, I don't seem to be able to resolve anything inside the cluster, same result with docker exec or kube exec:

➜  kubernetes git:(master) kc exec --namespace=kube-system kube-dns-2673147055-4j6wm -- nslookup kubernetes.default.svc.cluster.local
Defaulting container name to kubedns.
Use 'kubectl describe pod/kube-dns-2673147055-4j6wm' to see all of the 
containers in this pod.
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'kubernetes.default.svc.cluster.local': Name does not resolve

question

Whats the simplest way to further debug a system created using local-up-cluster where the DNS pods are running, but kubernetes.default.svc.cluster.local is not resolved ? Note that all other aspects of this cluster appear to be working perfectly.

System info : Linux ip-172-31-44-133 4.4.0-1018-aws #27-Ubuntu SMP Fri May 19 17:20:58 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.

Example of resolv.conf that is being placed in my containers...

/etc/cfssl # cat /etc/resolv.conf
nameserver 172.17.0.1
search default.svc.cluster.local svc.cluster.local cluster.local dc1.lan
options ndots:5  
-- jayunit100
dns
kubernetes

2 Answers

6/16/2017

I figured I'd post a systematic answer that usually works for me. I was hoping for something more elegant, and this isn't ideal, but I think its the best place to start.

1) Make sure your DNS nanny and your SkyDNS are running. The nanny and Sky DNS should both show in their docker logs that they've bound to a port.

2) When you create new services, make sure that SkyDNS is writing them to the logs and showing the creation of SRV and so on.

3) Look in /etc/resolv.conf in your docker containers. Make sure the nameserver looks like something on your internal docker IP addresses (i.e. 10.... in a regular docker0 config on fedora)

There are specific env variables you need to export correctly: API_HOST=true and KUBE_ENABLE_CLUSTER_DNS=true.

Theres alot of deeper tools you can use, like route -n and so on to debug container networking even more, but local up cluster should generally 'just work' and if the above steps surface something supsicious, its worth mentioning in the kubernetes community as a possible bug.

-- jayunit100
Source: StackOverflow

6/15/2017

I can't comment on your post so I'll attempt to answer this.

First of all, certain Alpine images have trouble resolving using nslookup. DNS might in fact be working normally in your cluster.

To validate this, read the logs of the pods (eg. traefik, heapster, calico) that communicates with kube-apiserver. If no errors are observed, what you have is probably a non-problem.

If you want to be doubly-sure, deploy a non-Alpine pod and try nslookup.

If it really is a DNS issue, I will debug in this sequence.

  1. kubectl exec into the kube-dns pod. Run nslookup kubernetes.default.svc.cluster.local localhost. If this works, DNS is in fact running. If it doesn't, kube-dns should have entered a CrashLoopbackOff state by now.
  2. kubectl exec into a deployed pod. Run nslookup kubernetes.default.svc.cluster.local <cluster-ip>. If this works, you're good to go. If it doesn't, something is up with the pod network. Without details, I can't recommend further steps.

Bonne chance!

-- Eugene Chow
Source: StackOverflow