Last night my Kubernetes cluster terimated 2 of my nodes and I can't figure out the details of what happened.
kubectl describe nodes
gives the following on the nodes that failed
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 04 Sep 2018 21:57:00 +0000 Tue, 04 Sep 2018 21:57:00 +0000 RouteCreated RouteController created a route
OutOfDisk False Wed, 05 Sep 2018 12:12:33 +0000 Tue, 04 Sep 2018 21:56:27 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 05 Sep 2018 12:12:33 +0000 Tue, 04 Sep 2018 21:56:27 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 05 Sep 2018 12:12:33 +0000 Tue, 04 Sep 2018 21:56:27 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Wed, 05 Sep 2018 12:12:33 +0000 Tue, 04 Sep 2018 21:57:01 +0000 KubeletReady kubelet is posting ready status
So I know that OutOfDisk, MemoryPressure and DiskPressure
were all in error state at some point last night, but what caused that to happen?
I also checked kubectl get events --all-namespaces
and I get nothing.
Finally kubectl describe pods
simply gave me this unhelpful information
State: Running
Started: Tue, 04 Sep 2018 22:03:47 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 30 Aug 2018 14:36:48 +0000
Finished: Tue, 04 Sep 2018 21:25:16 +0000
Is there a way to do a post mortem on this? I'd like to know more than that it was just out of disk space.
Try using this grafana dashboard: https://grafana.com/grafana/dashboards/11802
At node level you can find below details which can help you relate the events:
Uptime Node readiness CPU, memory and load on node. Kubelet errors which can be related to PLEG pod count on node by namespace Memory/Disk/PID pressure Top 5 memory guzzling pods NTP time deviation Kubelet eviction stats
I would reocommand you reading the following documentation: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
The first thought that came to my mind is examining the logs of your nodes/pods.
kubectl logs