Debugging DNS resolutions in kubernetes

2/8/2019

I have initialized kubernetes v1.13.1 cluster on Ubuntu 16.04 using below command:

sudo kubeadm init --token-ttl=0 --apiserver-advertise-address=192.168.88.142

and installed weave using:

kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

I have 10 raspberry pi acting as worker nodes and connected to the cluster. All of them are running the deployment fine. There nodes are running pods which try to connect to iot hub visdwk-azure-devices.net and publish some data. Out of 10 nodes, only few nodes are able to connect and other throws error unable to connect to iot hub. I did a ping test and found out that they were not able to ping google while they were pinging the public IP address of google.

This made me think that something is wrong with the coredns pod. I followed this documentation and did below test.

Pod has below contents in /etc/resolv.conf

nameserver 10.96.0.10
search visdwk.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

which looks normal to me. All the coredns pods are running fine.

coredns-86c58d9df4-42xqc               1/1     Running   8         1d11h
coredns-86c58d9df4-p6d98               1/1     Running   7         1d6h

I have also done nslookup kubernetes.default from the busybox container and got the proper response. Below are the logs of coredns-86c58d9df4-42xqc

.:53
2019-02-08T08:40:10.038Z [INFO] CoreDNS-1.2.6
2019-02-08T08:40:10.039Z [INFO] linux/amd64, go1.11.2, 756749c
CoreDNS-1.2.6
linux/amd64, go1.11.2, 756749c
 [INFO] plugin/reload: Running configuration MD5 = 
f65c4821c8a9b7b5eb30fa4fbc167769
t

Above logs also looks normal.

I can also not say that the pod is not able to resolve the iot hub because of any error from weave because if weave is throwing error then I believe the pod will never start and will always be in failed state but in actual the pod remains in running state. Please correct me here if I am wrong.

DNS service also seems to be in running state:

NAME                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
kube-dns               ClusterIP   10.96.0.10     <none>        53/UDP,53/TCP   1d6h

But still I am not able to figure out as to why few nodes in the cluster are not able to resolve the iot hub. Can anyone please give me some suggestions here. Please help. Thanks.

Logs from failed pod:

 1550138544: New connection from 127.0.0.1 on port 1883.
1550138544: New client connected from 127.0.0.1 as 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (c1, k60).
1550138544: Sending CONNACK to 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (0, 0)
1550138544: Received PUBLISH from 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (1211 bytes))
1550138544: Received DISCONNECT from 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504
1550138544: Client 6f1e2c4f-c44d-4c27-b9a9-0fb91f816504 disconnected.
1550138547: Saving in-memory database to /mqtt/data/mosquitto.db.
1550138547: Bridge local.machine6 doing local SUBSCRIBE on topic devices/machine6/messages/events/#
1550138547: Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
1550138552: Error creating bridge: Try again.
1550138566: New connection from 127.0.0.1 on port 1883.
1550138566: New client connected from 127.0.0.1 as afb6cc2a-ee78-482e-aff0-fc595e06f86a (c1, k60).
1550138566: Sending CONNACK to afb6cc2a-ee78-482e-aff0-fc595e06f86a (0, 0)
1550138566: Received PUBLISH from afb6cc2a-ee78-482e-aff0-fc595e06f86a (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (1211 bytes))
1550138566: Received DISCONNECT from afb6cc2a-ee78-482e-aff0-fc595e06f86a
1550138566: Client afb6cc2a-ee78-482e-aff0-fc595e06f86a disconnected.
1550138567: New connection from 127.0.0.1 on port 1883.
1550138567: New client connected from 127.0.0.1 as 01b9e135-fbc8-4d67-9962-356e8cf9f080 (c1, k60).
1550138567: Sending CONNACK to 01b9e135-fbc8-4d67-9962-356e8cf9f080 (0, 0)
1550138567: Received PUBLISH from 01b9e135-fbc8-4d67-9962-356e8cf9f080 (d0, q0, r0, m0, 'devices/machine6/messages/events/', ... (755 bytes))
1550138567: Received DISCONNECT from 01b9e135-fbc8-4d67-9962-356e8cf9f080
1550138567: Client 01b9e135-fbc8-4d67-9962-356e8cf9f080 disconnected.
1550138578: Saving in-memory database to /mqtt/data/mosquitto.db.
1550138583: Bridge local.machine6 doing local SUBSCRIBE on topic devices/machine6/messages/events/#
1550138583: Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
1550138588: Error creating bridge: Try again.

Pod is running a mosquitto container which try to connect to visdwk.azure-devices.net and throws error.

Connecting bridge iothub-bridge (visdwk.azure-devices.net:8883)
Error creating bridge: Try again.
-- S Andrew
coredns
dns
docker
kubernetes
ubuntu

1 Answer

2/18/2019

It would appear that one of your DNS Pods is not providing DNS services.

The evidence is is in the statement that "only few nodes are able to connect and other throws error unable to connect to iot hub"

This is a classic symptom of load-balancing with a failed node in the loop.

Try:

  1. Remove the DNS server pod that gave the message: visdwk.azure-devices.net.visdwknamespace.svc.cluster.local. udp 82 false 512" NXDOMAIN qr,aa,rd,ra 175 0.000651078s where visdwk.azure-devices.net
  2. Wait for the changes to propagate through the cluster.
  3. Test the connections.

If this is correct they should all connect.

To confirm, add the pod back and remove the other one. Retest, they should all fail to connect.

-- Strom
Source: StackOverflow