Network connectivity/DNS issues on a GKE 1.10 kubernetes cluster

7/29/2018

I'm running into DNS issues on a GKE 1.10 kubernetes cluster. Occasionally pods start without any network connectivity. Restarting the pod tends to fix the issue.

Here's the result of the same few commands inside a container without network, and one with.

BROKEN:

kc exec -it -n  iotest app1-b67598997-p9lqk  -c userapp sh

/app $ nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve

/app $ cat /etc/resolv.conf
nameserver 10.63.240.10
search iotest.svc.cluster.local svc.cluster.local cluster.local c.myproj.internal google.internal
options ndots:5

/app $ curl -I 10.63.240.10
curl: (7) Failed to connect to 10.63.240.10 port 80: Connection refused

/app $ netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:8001          0.0.0.0:*               LISTEN      1/python
tcp        0      0 ::1:50051               :::*                    LISTEN      1/python
tcp        0      0 ::ffff:127.0.0.1:50051  :::*                    LISTEN      1/python

WORKING:

kc exec -it -n  iotest app1-7d985bfd7b-h5dbr -c userapp sh

/app $ nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve

Name:      www.google.com
Address 1: 74.125.206.147 wk-in-f147.1e100.net
Address 2: 74.125.206.105 wk-in-f105.1e100.net
Address 3: 74.125.206.99 wk-in-f99.1e100.net
Address 4: 74.125.206.104 wk-in-f104.1e100.net
Address 5: 74.125.206.106 wk-in-f106.1e100.net
Address 6: 74.125.206.103 wk-in-f103.1e100.net
Address 7: 2a00:1450:400c:c04::68 wk-in-x68.1e100.net

/app $ cat /etc/resolv.conf
nameserver 10.63.240.10
search iotest.svc.cluster.local svc.cluster.local cluster.local c.myproj.internal google.internal
options ndots:5

/app $ curl -I 10.63.240.10
HTTP/1.1 404 Not Found
date: Sun, 29 Jul 2018 15:13:47 GMT
server: envoy
content-length: 0

/app $ netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:15000         0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:15001           0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:8001          0.0.0.0:*               LISTEN      1/python
tcp        0      0 10.60.2.6:56508         10.60.48.22:9091        ESTABLISHED -
tcp        0      0 127.0.0.1:57768         127.0.0.1:50051         ESTABLISHED -
tcp        0      0 10.60.2.6:43334         10.63.255.44:15011      ESTABLISHED -
tcp        0      0 10.60.2.6:15001         10.60.45.26:57160       ESTABLISHED -
tcp        0      0 10.60.2.6:48946         10.60.45.28:9091        ESTABLISHED -
tcp        0      0 127.0.0.1:49804         127.0.0.1:50051         ESTABLISHED -
tcp        0      0 ::1:50051               :::*                    LISTEN      1/python
tcp        0      0 ::ffff:127.0.0.1:50051  :::*                    LISTEN      1/python
tcp        0      0 ::ffff:127.0.0.1:50051  ::ffff:127.0.0.1:49804  ESTABLISHED 1/python
tcp        0      0 ::ffff:127.0.0.1:50051  ::ffff:127.0.0.1:57768  ESTABLISHED 1/python

These pods are identical, just one was restarted.

Does anyone have advice about how to analyse and fix this issue?

-- MarkNS
google-kubernetes-engine
kubernetes

2 Answers

7/30/2018

Some steps to try:

1) ifconfig eth0 or whatever the primary interface is. Is the interface up? Are the tx and rx packet counts increasing?

2)If interface is up, you can try tcpdump as you are running the nslookup command that you posted. See if the dns request packets are getting sent out.

3) See which node the pod is scheduled on, when network connectivity gets broken. Maybe it is on the same node every time? If yes, are other pods on that node running into similar problem?

-- prameshj
Source: StackOverflow

8/7/2018

I also faced the same problem, and I simply worked around it for now by switching to the 1.9.x GKE version (after spending many hours trying to debug why my app wasn't working).

Hope this helps!

-- Subhash Ramesh
Source: StackOverflow