We have a kubernetes cluster consist of four worker and one master node. On the worker1
and worker2
we can't resolve the DNS names but in the two other nodes everything is OK! I follow the instructions by official documentation here and I realized that the queries from worker1 and 2 are not received by the coredns pods.
I repeat all thing is good in the worker3
and worker4
, I have a problem with worker1
and worker2
. For example, when I run the busybox container in the worker1
and do nslookup kubernetes.default
it doesn't return any thing but when it run in the worker3
the DNS resolving is OK.
Cluster information:
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:43:08Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-576cbf47c7-6dtrc 1/1 Running 5 82d
coredns-576cbf47c7-jvx5l 1/1 Running 6 82d
etcd-master 1/1 Running 35 298d
kube-apiserver-master 1/1 Running 14 135m
kube-controller-manager-master 1/1 Running 42 298d
kube-proxy-22f49 1/1 Running 9 91d
kube-proxy-2s9sx 1/1 Running 34 298d
kube-proxy-jh2m7 1/1 Running 5 81d
kube-proxy-rc5r8 1/1 Running 5 63d
kube-proxy-vg8jd 1/1 Running 6 104d
kube-scheduler-master 1/1 Running 39 298d
kubernetes-dashboard-65c76f6c97-7cwwp 1/1 Running 45 293d
tiller-deploy-779784fbd6-dzq7k 1/1 Running 5 87d
weave-net-556ml 2/2 Running 12 66d
weave-net-h9km9 2/2 Running 15 81d
weave-net-s88z4 2/2 Running 0 145m
weave-net-smrgc 2/2 Running 14 63d
weave-net-xf6ng 2/2 Running 15 82d
$ kubectl logs coredns-576cbf47c7-6dtrc -n kube-system | tail -20
10.44.0.28:32837 - [14/Dec/2019:12:22:51 +0000] 2957 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000661167s
10.44.0.28:51373 - [14/Dec/2019:12:25:09 +0000] 46278 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000440918s
10.44.0.28:51373 - [14/Dec/2019:12:25:09 +0000] 47697 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.00059741s
10.44.0.28:44969 - [14/Dec/2019:12:27:27 +0000] 33222 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.00044739s
10.44.0.28:44969 - [14/Dec/2019:12:27:27 +0000] 52126 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.000310494s
10.44.0.28:39392 - [14/Dec/2019:12:29:11 +0000] 41041 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000481309s
10.44.0.28:40999 - [14/Dec/2019:12:29:11 +0000] 695 "AAAA IN spark-master.svc.cluster.local. udp 48 false 512" NXDOMAIN qr,aa,rd,ra 141 0.000247078s
10.44.0.28:54835 - [14/Dec/2019:12:29:12 +0000] 59604 "AAAA IN spark-master. udp 30 false 512" NXDOMAIN qr,rd,ra 106 0.020408006s
10.44.0.28:38604 - [14/Dec/2019:12:29:15 +0000] 53244 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.000209231s
10.44.0.28:38604 - [14/Dec/2019:12:29:15 +0000] 23079 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,rd,ra 149 0.000191722s
10.44.0.28:57478 - [14/Dec/2019:12:32:15 +0000] 15451 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000383919s
10.44.0.28:57478 - [14/Dec/2019:12:32:15 +0000] 45086 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.001197812s
10.40.0.34:54678 - [14/Dec/2019:12:52:31 +0000] 6509 "A IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000522769s
10.40.0.34:60234 - [14/Dec/2019:12:52:31 +0000] 15538 "AAAA IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000851171s
10.40.0.34:43989 - [14/Dec/2019:12:52:31 +0000] 2712 "AAAA IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000306038s
10.40.0.34:59265 - [14/Dec/2019:12:52:31 +0000] 23765 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd,ra 106 0.000274748s
10.40.0.34:45622 - [14/Dec/2019:13:26:31 +0000] 38766 "AAAA IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000436681s
10.40.0.34:42759 - [14/Dec/2019:13:26:31 +0000] 56753 "A IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000706638s
10.40.0.34:39563 - [14/Dec/2019:13:26:31 +0000] 37876 "AAAA IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000445999s
10.40.0.34:57224 - [14/Dec/2019:13:26:31 +0000] 33157 "A IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000536896s
$ kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 298d
kubernetes-dashboard ClusterIP 10.96.204.236 <none> 443/TCP 298d
tiller-deploy ClusterIP 10.110.41.66 <none> 44134/TCP 123d
$ kubectl get ep kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 10.32.0.98:53,10.44.0.21:53,10.32.0.98:53 + 1 more... 298d
When busybox is in the worker1:
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
But when busybox is in the worker3:
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
All Nodes are : Ubuntu 16.04
The content of /etc/resolve.conf for all pods are same.
The only difference which I can find is in the kube-proxy logs:
The working node kube-proxy logs:
$ kubectl logs kube-proxy-vg8jd -n kube-system
W1214 06:12:19.201889 1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I1214 06:12:19.321747 1 server_others.go:148] Using iptables Proxier.
W1214 06:12:19.332725 1 proxier.go:317] clusterCIDR not specified, unable to distinguish between internal and external traffic
I1214 06:12:19.332949 1 server_others.go:178] Tearing down inactive rules.
I1214 06:12:20.557875 1 server.go:447] Version: v1.12.1
I1214 06:12:20.601081 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1214 06:12:20.601393 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1214 06:12:20.601958 1 conntrack.go:83] Setting conntrack hashsize to 32768
I1214 06:12:20.602234 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1214 06:12:20.602300 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1214 06:12:20.602544 1 config.go:202] Starting service config controller
I1214 06:12:20.602561 1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I1214 06:12:20.602585 1 config.go:102] Starting endpoints config controller
I1214 06:12:20.602619 1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1214 06:12:20.702774 1 controller_utils.go:1034] Caches are synced for service config controller
I1214 06:12:20.702827 1 controller_utils.go:1034] Caches are synced for endpoints config controller
The not working node kube-proxy logs:
$ kubectl logs kube-proxy-fgzpf -n kube-system
W1215 12:47:12.660749 1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I1215 12:47:12.679348 1 server_others.go:148] Using iptables Proxier.
W1215 12:47:12.679538 1 proxier.go:317] clusterCIDR not specified, unable to distinguish between internal and external traffic
I1215 12:47:12.679665 1 server_others.go:178] Tearing down inactive rules.
E1215 12:47:12.760702 1 proxier.go:529] Error removing iptables rules in ipvs proxier: error deleting chain "KUBE-MARK-MASQ": exit status 1: iptables: Too many links.
I1215 12:47:12.799926 1 server.go:447] Version: v1.12.1
I1215 12:47:12.832047 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1215 12:47:12.833067 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1215 12:47:12.833266 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1215 12:47:12.833498 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1215 12:47:12.833934 1 config.go:202] Starting service config controller
I1215 12:47:12.834061 1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I1215 12:47:12.834253 1 config.go:102] Starting endpoints config controller
I1215 12:47:12.834338 1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1215 12:47:12.934408 1 controller_utils.go:1034] Caches are synced for service config controller
I1215 12:47:12.934564 1 controller_utils.go:1034] Caches are synced for endpoints config controller
Line five doesn't appears in first one. I don't know that is related to the issue or not.
Any suggestions are welcomed.
The double svc.svc
in kubernetes.default.svc.svc.cluster.local
looks stange. Check if that is the same in the coredns-576cbf47c7-6dtrc
pod.
Shutdown the coredns-576cbf47c7-6dtrc
pod to guarantee that the single remaining DNS instance will be answering the DNS queries from all worker nodes.
According to the docs, problems like this "... indicate a problem with the coredns/kube-dns add-on or associated Services". Restarting coredns may solve the issue.
I'd add to the list of things to look into to check and compare /etc/resolv.conf
on the nodes.