Occasionally pods will be created with no network which results in the pod failing repeatedly with CrashLoopBackOff

2/14/2017

Occasionally, I see an issue where a pod will start up without network connectivity. Because of this, the pod goes into a CrashLoopBackOff and is unable to recover. The only way I am able to get the pod running again is by running a kubectl delete pod and waiting for it to reschedule. Here's an example of a liveness probe failing due to this issue:

Liveness probe failed: Get http://172.20.78.9:9411/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I've also noticed that there are no iptables entries for the pod IP when this happens. When the pod is deleted and rescheduled (and is in a working state) I have the iptables entries.

If I turn off the livenessprobe in the container and exec into it, I confirmed it has no network connectivity to the cluster or the local network or internet.

Would like to hear any suggestions as to what it could be or what else I can look into to further troubleshoot this scenario.

Currently running:

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.7",
GitCommit:"92b4f971662de9d8770f8dcd2ee01ec226a6f6c0", 
GitTreeState:"clean", BuildDate:"2016-12-10T04:49:33Z", 
GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.7",  
GitCommit:"92b4f971662de9d8770f8dcd2ee01ec226a6f6c0", 
GitTreeState:"clean", BuildDate:"2016-12-10T04:43:42Z", 
GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

OS:

NAME=CoreOS
ID=coreos
VERSION=1235.0.0
VERSION_ID=1235.0.0
BUILD_ID=2016-11-17-0416
PRETTY_NAME="CoreOS 1235.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
-- jswoods
kubelet
kubernetes

4 Answers

2/17/2017

I don't have enough points to comment so this answer is in response to Prashanth B (https://stackoverflow.com/users/5446771/prashanth-b)

Let me describe "without network connectivity" in more detail. When I exec into one of the pods that is suffering from the originally described symptoms this is what sort of network issues I see.

In this example we have a pod which is suffering from what appears to be a pod without any network connectivity.

First I ping the routeable ip of the physical node (eth0 interface) from the pod. This works from pods on the same node that are working normally.

# ping 10.30.8.66
PING 10.30.8.66 (10.30.8.66): 56 data bytes
92 bytes from tv-dmx-prototype-3638746950-l8fgu (172.20.68.16): 
Destination Host Unreachable
^C

Trying internal or external DNS resolution. I don't expect the ping's to work but this is the only available tool in the container to do name resolution. I can't install anything else because of no networking.

# ping kubernetes
^C
# ping www.google.com
^C
#

From another pod in the same cluster and on the same physical node as the not working pod I will attempt to connect to a port that is open on the pod.

/ # telnet 172.20.68.16 80
telnet: can't connect to remote host (172.20.68.16): Host is unreachable
/ #

From the physical node I cannot connect the pod ip on port 80

core@ip-10-30-8-66 ~ $ curl 172.20.68.16:80
curl: (7) Failed to connect to 172.20.68.16 port 80: No route to host

I looked through the troubleshooting guide at https://kubernetes.io/docs/user-guide/debugging-services/ but that guide is targeted at diagnosing problems connecting a kubernetes service to one or more pods. In my scenario we experience an unpredictable behavior with the creation of a pod which is not service specific. For example we are seeing this 1 - 3 times a week across 3 different clusters spanning dozens of 'deployments'. It's never the same deployment which has the problem and our only recourse is to delete the pod after which it get's instantiated correctly.

I have gone through the relevant pieces of the troubleshooting guide and posted them here.

Here we see that kubelet and kube-proxy are running

root       7186   7167  2 Jan19 ?        15:14:25 /hyperkube proxy          --master=https://us-east-1-services-kubernetes.XXXXX.com 
 --proxy-mode=iptables --kubeconfig=/var/lib/kube-proxy/kubeconfig
core      25646  26300  0 19:22 pts/0    00:00:00 grep --colour=auto -i hyperkube


kubelet --address=0.0.0.0 --pod-manifest-path=/etc/kubernetes/manifests --enable-server --logtostderr=true --port=10250 --allow-privileged=True --max-pods=110 --v=2 --api_servers=https://us-east-1-services-kubernetes.XXXXXX.com --enable-debugging-handlers=true --cloud-provider=aws --cluster_dns=172.16.0.10 --cluster-domain=cluster.local --kubeconfig=/var/lib/kubelet/kubeconfig --node-labels=beta.kubernetes.io/instance-type=c4.8xlarge,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1d,kubernetes.io/hostname=ip-10-30-8-66.ec2.internal,public-hostname=ec2-52-207-185-19.compute-1.amazonaws.com,instance-id=i-03074c6772d89ede8

I've verified kube-proxy is proxying by hitting other pods on this same node.

core@ip-10-30-8-66 ~ $ curl 172.20.68.12 80
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.11.4</center>
</body>
</html>
curl: (7) Couldn't connect to server

A new version of the app just got deployed and I lost my pod that I was troubleshooting with. I will start preparing some additional commands to run when this symptom occurs again. I will also run a high volume of deployment creations since the number of bad pods we get is in relation to the volume of new pods that are being created.

-- Guido Pepper
Source: StackOverflow

2/17/2017

In response to freehan (https://stackoverflow.com/users/7577983/freehan)

We are using the default network plugin which as you pointed out is the native docker one.

Regarding the suggestion to use tcpdump to capture the packet's path. Do you know an easy way to determine which veth is associated with a given pod?

I plan on running a container that has tcpdump installed and watch the traffic on the veth associated with the problem pod while initiating a outbound network traffic from the pod (eg: ping, dig, curl or whatever is available in the given pod).

Let me know if you had something else in mind and I will try that.

-- Guido Pepper
Source: StackOverflow

2/21/2017

I am thinking that we are hitting this bug https://github.com/coreos/bugs/issues/1785. I've verified that I can reproduce the bug listed on our version of docker/coreos. Will coreos/docker and verify.

-- Guido Pepper
Source: StackOverflow

2/16/2017

Looks like your network driver is not working properly. Without more information about your setup, I could only suggest you the following:

  1. Find out what network driver was used? You can tell by checking kubelet --network-plugin flag. If no network plugin is specified, then it is using native docker network.
  2. Given the network driver, examine the pod network setup and see what is missing. Use tcpdump to see where the packet goes.
-- freehan
Source: StackOverflow