Our team has recently started to use GKE, but have encountered an intermittent problem on some of our pods that serve HTTP on port 8080. Other pods in the cluster, even on the same node, get a "connection refused" response when trying to connect using its cluster IP:
$ kubectl run -i --tty busybox --image=busybox --restart=Never -- sh!
/ # ping 10.28.2.141
PING 10.28.2.141 (10.28.2.141): 56 data bytes
64 bytes from 10.28.2.141: seq=0 ttl=62 time=2.212 ms
64 bytes from 10.28.2.141: seq=1 ttl=62 time=1.993 ms
64 bytes from 10.28.2.141: seq=2 ttl=62 time=4.662 ms
^C
--- 10.28.2.141 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1.993/2.955/4.662 ms
/ # wget http://10.28.2.141:8080/health-check
Connecting to 10.28.2.141:8080 (10.28.2.141:8080)
wget: can't connect to remote host (10.28.2.141): Connection refused
However, the service is indeed running and listening on that port: if I exec
onto the pod and run the same command, it works happily.
For other almost identical pods, this connectivity works correctly, but intermittently some fraction (maybe 10-20%) of pods end up in this state.
There are no errors in the pod logs.
This is a freshly provisioned GKE cluster on version 1.11.6-gke.3 with two nodes, no network policies, and Istio is not installed.
Any ideas on what the problem might be, or how to diagnose further? Happy to add any other information if it would be useful.