Unexplained outage due to faulty (?) Readiness check results

10/30/2018

TL;DR: At an unexpected point in time, all the "web"-pods handling traffic coming from our ingress became unhealthy. About an hour or so, and then everything was healthy again. I'm trying to figure out what happened, as it does not seem like any of my actions caused the sudden repair. I had about an hour of downtime and I don't know why, which is scary (luckily it is not production...yet).


I'll describe the situation as best as I can; At a moment no significant changes were happening to the platform/sources the external uptime check for our kubernetes cluster (GKE) alerted me that the platform was unreachable. Sure enough. When performing a request to the endpoint, I got an HTTP Error 502.

When describing one of the web-pods, I noticed the health-check was failing:

  Warning  Unhealthy              11m (x3 over 11m)  kubelet, gke-test-001-web-6a31v8hq-p5ff  Readiness probe failed: Get http://10.52.25.179:3000/healthz/full: dial tcp 10.52.25.179:3000: getsockopt: connection refused

Investigating further, it was clear that all Readiness probes on all web-pods were failing; This was causing the outage.

Another weird thing to note is the following: Currently for these web-pods the Readiness and Liveness probes are exactly the same. While the Readiness checks were consistently marked as Failed, the Liveness probes never did.

I decided to investigate the issue further and I noticed that the endpoints set for the Readiness checks worked perfectly fine from the following locations:

From the POD:

root@webpod-76c8ctc6t8-2prjz:/var/lib/webapp# curl -iL 10.52.25.179:3000/healthz/full.json
HTTP/1.1 200 OK
Connection: close
Content-Type: application/json; charset=utf-8

{"healthy":true,"message":"success"}

From the NODE:

root@gke-test-001-web-6a31v8hq-p5ff ~ $ curl -iL http://10.52.25.179:3000/healthz/full.json
HTTP/1.1 200 OK
Connection: close
Content-Type: application/json; charset=utf-8

{"healthy":true,"message":"success"}

This is while the health checks keep coming back as Failed. Somehow I'm getting different results than what the kubelet on each of these nodes is getting?

What I did notice is the following:

[Mon Oct 29 16:06:57 2018] cbr0: port 16(veth34a6a4ce) entered disabled state

That looks to me like the pod-overlay-network's network bridge got disabled, but if that was really causing an issue I wouldn't be able to reach the pod's IP at all...

I tried the following things:

  • Validate the deemed "unhealthy" pods (they were healthy according to me)
  • ifup and ifdown the cbr0 interface
  • Kill the kubelet on one of the nodes and check if that fixed the respective Readyness checks (it didn't)
  • Reboot a node and check if that fixed the respective Readyness checks (it didn't)
  • Delete all nodes in the node pool assigned to the web-pods and see if fresh ones fixed the issue (it didn't)

And as suddenly as it came, before I was able to determine the issue, after about exactly an hour, my pods became healthy again and the platform was functioning fine...

Anyone who knows what happened here? Any tips on what I can do if this occurs again?

(Please note that the time in the snippets might differ greatly, as these have been taken from different points in time; Timestamps are UTC)

-- st0ne2thedge
google-kubernetes-engine
kubernetes
kubernetes-health-check
kubernetes-pod

1 Answer

11/7/2018

This issue was eventually unrelated to failed readiness checks.

The actual cause was a configmap that did not get loaded in the correct location due to human error!

-- st0ne2thedge
Source: StackOverflow