Google Kubernetes Engine outage every 6 hours like clockwork?

5/7/2018

We have hit a strange issue in GKE on GCP where we have a few seconds to a minute if intermittent HTTP 500/520/525 errors trying to access our API every 6h10m give or take a couple minutes, and our logs haven't given us much to go on yet.

Our pipeline looks like:

user request -> CloudFlare -> GKE nginx LoadBalancer (ssl termination) -> GKE router pod -> API

Hitting CloudFlare or the GKE loadbalancer directly shows the same error, so seems like the issue is within our GCP setup somewhere.

In the past I've run into a CloudSQL Proxy issue where it renews an SSL cert every hour and caused very predictable, very brief outages.

Does GKE have a similar system we might be running into where it does something every 6h that is causing these errors for us?

Pingdom report: brief outage every 6h10m

-- xref
google-cloud-platform
google-kubernetes-engine
kubernetes

1 Answer

5/17/2018

The problem turned out to be that only 1 of 2 required healthcheck IPs for internal load balancing was whitelisted. not sure how that caused the error to be so clockwork, but updating our firewall rules has stopped the issue. Hope that helps someone in the future!

-- xref
Source: StackOverflow