Kubernetes outage, pods just vanished, refused to start

4/4/2019

I am after some advice please.

We had a Kubernetes (1.8.x) cluster running on AWS, setup with KOPS. 1 master and 2 nodes.

Over the weekend, half of our pods vanished and refused to start. The deployments still existed but the pods would not run. I tried terminating the nodes in AWS and they were replaced by Kubernetes automatically, but still the pods were not re-instated.

This was a production application, and so after leaving it for about 8 hours to recover by itself (it didn't), I deleted the cluster using KOPS and recreated the whole thing successfully using a newer version of Kubernetes.

This whole experience was quite troubling, especially in that I couldn't find out what was wrong with the cluster.

I would like some help with the following:

  1. What could/should I have checked in order to diagnose the issue?
  2. What could have conceivably caused the issue in the first place? I realise it's impossible to pinpoint it now, but please feel free to conjecture.
  3. How can I mitigate the future risk of this happening?

Thanks very much for any and all responses.

-- arrkaye
amazon-ec2
kops
kubernetes

1 Answer

4/5/2019

What could/should I have checked in order to diagnose the issue?

journalctl -u kubelet.service and/or docker logs --tail=150 ${anything_that_talks_to_the_apiserver} to look for error messages. Based on your experience with the x509 certificate expiry, I would guess the entire cluster would be awash with error messages

It is also very likely that your Nodes went NotReady as kubelet failed to check in with the apiserver after a fixed duration. If you're using an SDN that communicates with the apiserver, such as some flannel setups or some calico setups, then Pod networking will start to fail, too, which is a cluster bloodbath

What could have conceivably caused the issue in the first place? I realise it's impossible to pinpoint it now, but please feel free to conjecture.

Certificates always have a lifespan, which includes a start time and an end time; that end time can be very long -- 10 years, 100 years, whatever, but it does exist and when that time passes, the certificate is now invalid and anyone who does certificate validation will reject its use.

How can I mitigate the future risk of this happening?

There are actually several ways you can monitor the certificate expiry of important certs in your system, including a handy prometheus exporter returning probe_ssl_earliest_cert_expiry allowing you to set an alert based on that metric. Modern kubernetes -- of which 1.8 is not -- allow the cluster to rotate its own certificates, conceptually side-stepping this mess entirely.

-- mdaniel
Source: StackOverflow