Kubernetes pod network hang up

10/22/2019

I'm running a Kubernetes cluster on Google Cloud (version 1.13.7-gke.24). The same code is running on the machine for more than 3 months without any problems. Today I found that one of the pods became disconnected from the network for more than 24 hours.

First I checked if the pod had internet connectivity, normally it has. I used curl to query some well-known internet websites - all of them were out of reach. The same thing happened when I tried to run apt-get update or apt-get upgrade.

Second, I check for logs of my applications and I found exceptions like this:

Unable to log to provider GoogleStackdriverLogProvider, ex: Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="Connect Failed")
   at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>c__DisplayClass0_0`2.<<WithRetry>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at ***.LogService.Providers.GoogleStackdriverLogProvider.WriteAsync(IEnumerable`1 entries) in LogService/Providers/GoogleStackdriverLogProvider.cs:line 71

Those logs are coming from code I running that send new logs to Google Stackdriver. Note that those log a stored in the same datacenter - no need internet to send them out and still the application couldn't reach the destination.

Last, this is a strange one, the connectivity to the queue system was keep running. Unfortunately, the application was continued to download new messages from the queue but all of them was ended with failure because of the network connectivity.

Summary:

Internet connectivity - NO
VPC connectivity - YES
GCP services connectivity - YES

Other notes:

  • I was able to ssh into the problematic pod.
  • Restarting the pod fixed the issue.
  • It never happened before. I'm running this deployment for more than a year.
  • The problematic pod was 4 and a half days old when I was killing it.
  • Only one pod was affected by this problem. All other (100+ pods) was running without any problem.

What to do in order to prevent this problem in the further?

-- No1Lives4Ever
google-cloud-platform
google-kubernetes-engine
kubernetes

1 Answer

10/27/2019

This sounds like a transient issue, possibly due to a failure with the virtual interface created for the pod. These types of failure are rare and hard to prevent. However, you can build your deployment to be more resilient using livenessProbes so that this type of error will cause the container to fail and be recreated.

Unfortunately, if restating the container is not enough, the pod will go into crashLoopBackOff state. You could set up alerts that notify you if pod's do go into this state to trigger pod deletion.

Though it may not be possible to prevent, you can automate its recovery

-- Patrick W
Source: StackOverflow