GKE in-cluster DNS resolutions stopped working

10/25/2019

So this has been working forever. I have a few simple services running in GKE and they refer to each other via the standard service.namespace DNS names.

Today all DNS name resolution stopped working. I haven't changed anything, although this may have been triggered by a master upgrade.

/ambassador # nslookup ambassador-monitor.default
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'ambassador-monitor.default': Try again


/ambassador # cat /etc/resolv.conf  
search default.svc.cluster.local svc.cluster.local cluster.local c.snowcloud-01.internal google.internal 
nameserver 10.207.0.10 
options ndots:5

Master version 1.14.7-gke.14

I can talk cross-service using their IP addresses, it's just DNS that's not working.

Not really sure what to do about this...

-- user2013791
dns
gke-networking
google-cloud-platform
google-kubernetes-engine
kubernetes

3 Answers

10/26/2019

Start by debugging your kubernetes services [1]. This will tell you whether is a k8s resource issue or kubernetes itself is failing. Once you understand that, you can proceed to fix it. You can post results here if you want to follow up.

[1] https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/

-- Rodrigo Loza
Source: StackOverflow

10/26/2019

The easiest way to verify if there is a problem with your Kube DNS is to look at the logs StackDriver [https://cloud.google.com/logging/docs/view/overview\].

You should be able to find DNS resolution failures in the logs for the pods, with a filter such as the following:

resource.type="container"

("UnknownHost" OR "lookup fail" OR "gaierror")

Be sure to check logs for each container. Because the exact names and numbers of containers can change with the GKE version, you can find them like so:

kubectl get pod -n kube-system -l k8s-app=kube-dns -o \

jsonpath='{range .items[*].spec.containers[*]}{.name}{"\n"}{end}' | sort -u kubectl get pods -n kube-system -l k8s-app=kube-dns

Has the pod been restarted frequently? Look for OOMs in the node console. The nodes for each pod can be found like so:

kubectl get pod -n kube-system -l k8s-app=kube-dns -o \

jsonpath='{range .items[*]}{.spec.nodeName} pod={.metadata.name}{"\n"}{end}'

The kube-dns pod contains four containers:

  • kube-dns process watches the Kubernetes master for changes in Services and Endpoints, and maintains in-memory lookup structures to serve DNS requests,
  • dnsmasq adds DNS caching to improve performance,
  • sidecar provides a single health check endpoint while performing dual health checks (for dnsmasq and kubedns). It also collects dnsmasq metrics and exposes them in the Prometheus format,
  • prometheus-to-sd scraping the metrics exposed by sidecar and sending them to Stackdriver.

By default, the dnsmasq container accepts 150 concurrent requests. Requests beyond this are simply dropped and result in failed DNS resolution, including resolution for metadata. To check for this, view the logs with the following filter:

resource.type="container"
resource.labels.cluster_name="<cluster-name>"
resource.labels.namespace_id="kube-system"
logName="projects/<project-id>/logs/dnsmasq"
"Maximum number of concurrent DNS queries reached"

If legacy stackdriver logging of cluster is disabled, use the following filter:

resource.type="k8s_container"
resource.labels.cluster_name="<cluster-name>"
resource.labels.namespace_name="kube-system"
resource.labels.container_name="dnsmasq"
"Maximum number of concurrent DNS queries reached"

If Stackdriver logging is disabled, execute the following:

kubectl logs --tail=1000 --namespace=kube-system -l k8s-app=kube-dns -c dnsmasq | grep 'Maximum number of concurrent DNS queries reached'

Additionally, you can try to use the command [dig ambassador-monitor.default @10.207.0.10] from each nodes to verify if this is only impacting one node. If it is, you can simple re-create the impacted node.

-- Ilia Borovoi
Source: StackOverflow

10/27/2019

It appears that I hit a bug that caused the gke-metadata server to start crash pooling (which in turn prevented kube-dns from working).

Creating a new pool with a previous version (1.14.7-gke.10) and migrating to it fixed everything.

I am told a fix has already been submitted.

Thank you for your suggestions.

-- user2013791
Source: StackOverflow