I am after some advice please.
We had a Kubernetes (1.8.x) cluster running on AWS, setup with KOPS. 1 master and 2 nodes.
Over the weekend, half of our pods vanished and refused to start. The deployments still existed but the pods would not run. I tried terminating the nodes in AWS and they were replaced by Kubernetes automatically, but still the pods were not re-instated.
This was a production application, and so after leaving it for about 8 hours to recover by itself (it didn't), I deleted the cluster using KOPS and recreated the whole thing successfully using a newer version of Kubernetes.
This whole experience was quite troubling, especially in that I couldn't find out what was wrong with the cluster.
I would like some help with the following:
Thanks very much for any and all responses.
What could/should I have checked in order to diagnose the issue?
journalctl -u kubelet.service
and/or docker logs --tail=150 ${anything_that_talks_to_the_apiserver}
to look for error messages. Based on your experience with the x509 certificate expiry, I would guess the entire cluster would be awash with error messages
It is also very likely that your Nodes went NotReady
as kubelet
failed to check in with the apiserver after a fixed duration. If you're using an SDN that communicates with the apiserver, such as some flannel setups or some calico setups, then Pod networking will start to fail, too, which is a cluster bloodbath
What could have conceivably caused the issue in the first place? I realise it's impossible to pinpoint it now, but please feel free to conjecture.
Certificates always have a lifespan, which includes a start time and an end time; that end time can be very long -- 10 years, 100 years, whatever, but it does exist and when that time passes, the certificate is now invalid and anyone who does certificate validation will reject its use.
How can I mitigate the future risk of this happening?
There are actually several ways you can monitor the certificate expiry of important certs in your system, including a handy prometheus exporter returning probe_ssl_earliest_cert_expiry
allowing you to set an alert based on that metric. Modern kubernetes -- of which 1.8 is not -- allow the cluster to rotate its own certificates, conceptually side-stepping this mess entirely.