I'm running a Kubernetes cluster on Google Cloud (version 1.13.7-gke.24). The same code is running on the machine for more than 3 months without any problems. Today I found that one of the pods became disconnected from the network for more than 24 hours.
First I checked if the pod had internet connectivity, normally it has. I used curl
to query some well-known internet websites - all of them were out of reach. The same thing happened when I tried to run apt-get update
or apt-get upgrade
.
Second, I check for logs of my applications and I found exceptions like this:
Unable to log to provider GoogleStackdriverLogProvider, ex: Grpc.Core.RpcException: Status(StatusCode=Unavailable, Detail="Connect Failed")
at Google.Api.Gax.Grpc.ApiCallRetryExtensions.<>c__DisplayClass0_0`2.<<WithRetry>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at ***.LogService.Providers.GoogleStackdriverLogProvider.WriteAsync(IEnumerable`1 entries) in LogService/Providers/GoogleStackdriverLogProvider.cs:line 71
Those logs are coming from code I running that send new logs to Google Stackdriver. Note that those log a stored in the same datacenter - no need internet to send them out and still the application couldn't reach the destination.
Last, this is a strange one, the connectivity to the queue system was keep running. Unfortunately, the application was continued to download new messages from the queue but all of them was ended with failure because of the network connectivity.
Summary:
Internet connectivity - NO
VPC connectivity - YES
GCP services connectivity - YES
Other notes:
ssh
into the problematic pod.What to do in order to prevent this problem in the further?
This sounds like a transient issue, possibly due to a failure with the virtual interface created for the pod. These types of failure are rare and hard to prevent. However, you can build your deployment to be more resilient using livenessProbes so that this type of error will cause the container to fail and be recreated.
Unfortunately, if restating the container is not enough, the pod will go into crashLoopBackOff state. You could set up alerts that notify you if pod's do go into this state to trigger pod deletion.
Though it may not be possible to prevent, you can automate its recovery