DNS resolve problem in kubernetes cluster

12/14/2019

We have a kubernetes cluster consist of four worker and one master node. On the worker1 and worker2 we can't resolve the DNS names but in the two other nodes everything is OK! I follow the instructions by official documentation here and I realized that the queries from worker1 and 2 are not received by the coredns pods.
I repeat all thing is good in the worker3 and worker4, I have a problem with worker1 and worker2. For example, when I run the busybox container in the worker1 and do nslookup kubernetes.default it doesn't return any thing but when it run in the worker3 the DNS resolving is OK.

Cluster information:

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:43:08Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl get pod -n kube-system
NAME                                    READY   STATUS    RESTARTS   AGE
coredns-576cbf47c7-6dtrc                1/1     Running   5          82d
coredns-576cbf47c7-jvx5l                1/1     Running   6          82d
etcd-master                             1/1     Running   35         298d
kube-apiserver-master                   1/1     Running   14         135m
kube-controller-manager-master          1/1     Running   42         298d
kube-proxy-22f49                        1/1     Running   9          91d
kube-proxy-2s9sx                        1/1     Running   34         298d
kube-proxy-jh2m7                        1/1     Running   5          81d
kube-proxy-rc5r8                        1/1     Running   5          63d
kube-proxy-vg8jd                        1/1     Running   6          104d
kube-scheduler-master                   1/1     Running   39         298d
kubernetes-dashboard-65c76f6c97-7cwwp   1/1     Running   45         293d
tiller-deploy-779784fbd6-dzq7k          1/1     Running   5          87d
weave-net-556ml                         2/2     Running   12         66d
weave-net-h9km9                         2/2     Running   15         81d
weave-net-s88z4                         2/2     Running   0          145m
weave-net-smrgc                         2/2     Running   14         63d
weave-net-xf6ng                         2/2     Running   15         82d

$ kubectl logs coredns-576cbf47c7-6dtrc -n kube-system | tail -20
10.44.0.28:32837 - [14/Dec/2019:12:22:51 +0000] 2957 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000661167s
10.44.0.28:51373 - [14/Dec/2019:12:25:09 +0000] 46278 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000440918s
10.44.0.28:51373 - [14/Dec/2019:12:25:09 +0000] 47697 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.00059741s
10.44.0.28:44969 - [14/Dec/2019:12:27:27 +0000] 33222 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.00044739s
10.44.0.28:44969 - [14/Dec/2019:12:27:27 +0000] 52126 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.000310494s
10.44.0.28:39392 - [14/Dec/2019:12:29:11 +0000] 41041 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000481309s
10.44.0.28:40999 - [14/Dec/2019:12:29:11 +0000] 695 "AAAA IN spark-master.svc.cluster.local. udp 48 false 512" NXDOMAIN qr,aa,rd,ra 141 0.000247078s
10.44.0.28:54835 - [14/Dec/2019:12:29:12 +0000] 59604 "AAAA IN spark-master. udp 30 false 512" NXDOMAIN qr,rd,ra 106 0.020408006s
10.44.0.28:38604 - [14/Dec/2019:12:29:15 +0000] 53244 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.000209231s
10.44.0.28:38604 - [14/Dec/2019:12:29:15 +0000] 23079 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,rd,ra 149 0.000191722s
10.44.0.28:57478 - [14/Dec/2019:12:32:15 +0000] 15451 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000383919s
10.44.0.28:57478 - [14/Dec/2019:12:32:15 +0000] 45086 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.001197812s
10.40.0.34:54678 - [14/Dec/2019:12:52:31 +0000] 6509 "A IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000522769s
10.40.0.34:60234 - [14/Dec/2019:12:52:31 +0000] 15538 "AAAA IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000851171s
10.40.0.34:43989 - [14/Dec/2019:12:52:31 +0000] 2712 "AAAA IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000306038s
10.40.0.34:59265 - [14/Dec/2019:12:52:31 +0000] 23765 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd,ra 106 0.000274748s
10.40.0.34:45622 - [14/Dec/2019:13:26:31 +0000] 38766 "AAAA IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000436681s
10.40.0.34:42759 - [14/Dec/2019:13:26:31 +0000] 56753 "A IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000706638s
10.40.0.34:39563 - [14/Dec/2019:13:26:31 +0000] 37876 "AAAA IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000445999s
10.40.0.34:57224 - [14/Dec/2019:13:26:31 +0000] 33157 "A IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000536896s

$ kubectl get svc -n kube-system
NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)         AGE
kube-dns               ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP   298d
kubernetes-dashboard   ClusterIP   10.96.204.236   <none>        443/TCP         298d
tiller-deploy          ClusterIP   10.110.41.66    <none>        44134/TCP       123d

$ kubectl get ep kube-dns --namespace=kube-system
NAME       ENDPOINTS                                               AGE
kube-dns   10.32.0.98:53,10.44.0.21:53,10.32.0.98:53 + 1 more...   298d

When busybox is in the worker1:

$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10

nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

But when busybox is in the worker3:

$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10
Name:      kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

All Nodes are : Ubuntu 16.04

The content of /etc/resolve.conf for all pods are same.

The only difference which I can find is in the kube-proxy logs:

The working node kube-proxy logs:

$ kubectl logs kube-proxy-vg8jd -n kube-system

W1214 06:12:19.201889       1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I1214 06:12:19.321747       1 server_others.go:148] Using iptables Proxier.
W1214 06:12:19.332725       1 proxier.go:317] clusterCIDR not specified, unable to distinguish between internal and external traffic
I1214 06:12:19.332949       1 server_others.go:178] Tearing down inactive rules.
I1214 06:12:20.557875       1 server.go:447] Version: v1.12.1
I1214 06:12:20.601081       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1214 06:12:20.601393       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1214 06:12:20.601958       1 conntrack.go:83] Setting conntrack hashsize to 32768
I1214 06:12:20.602234       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1214 06:12:20.602300       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1214 06:12:20.602544       1 config.go:202] Starting service config controller
I1214 06:12:20.602561       1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I1214 06:12:20.602585       1 config.go:102] Starting endpoints config controller
I1214 06:12:20.602619       1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1214 06:12:20.702774       1 controller_utils.go:1034] Caches are synced for service config controller
I1214 06:12:20.702827       1 controller_utils.go:1034] Caches are synced for endpoints config controller

The not working node kube-proxy logs:

$ kubectl logs kube-proxy-fgzpf -n kube-system

W1215 12:47:12.660749       1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I1215 12:47:12.679348       1 server_others.go:148] Using iptables Proxier.
W1215 12:47:12.679538       1 proxier.go:317] clusterCIDR not specified, unable to distinguish between internal and external traffic
I1215 12:47:12.679665       1 server_others.go:178] Tearing down inactive rules.
E1215 12:47:12.760702       1 proxier.go:529] Error removing iptables rules in ipvs proxier: error deleting chain "KUBE-MARK-MASQ": exit status 1: iptables: Too many links.
I1215 12:47:12.799926       1 server.go:447] Version: v1.12.1
I1215 12:47:12.832047       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1215 12:47:12.833067       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1215 12:47:12.833266       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1215 12:47:12.833498       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1215 12:47:12.833934       1 config.go:202] Starting service config controller
I1215 12:47:12.834061       1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I1215 12:47:12.834253       1 config.go:102] Starting endpoints config controller
I1215 12:47:12.834338       1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1215 12:47:12.934408       1 controller_utils.go:1034] Caches are synced for service config controller
I1215 12:47:12.934564       1 controller_utils.go:1034] Caches are synced for endpoints config controller

Line five doesn't appears in first one. I don't know that is related to the issue or not.

Any suggestions are welcomed.

-- Majid Rajabi
coredns
dns
kubeadm
kubernetes

1 Answer

12/14/2019

The double svc.svc in kubernetes.default.svc.svc.cluster.local looks stange. Check if that is the same in the coredns-576cbf47c7-6dtrc pod.

Shutdown the coredns-576cbf47c7-6dtrc pod to guarantee that the single remaining DNS instance will be answering the DNS queries from all worker nodes.

According to the docs, problems like this "... indicate a problem with the coredns/kube-dns add-on or associated Services". Restarting coredns may solve the issue.

I'd add to the list of things to look into to check and compare /etc/resolv.conf on the nodes.

-- apisim
Source: StackOverflow