Debug Kubernetes node termination

9/5/2018

Last night my Kubernetes cluster terimated 2 of my nodes and I can't figure out the details of what happened.

kubectl describe nodes gives the following on the nodes that failed

Conditions:
Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
----                 ------  -----------------                 ------------------                ------                       -------
NetworkUnavailable   False   Tue, 04 Sep 2018 21:57:00 +0000   Tue, 04 Sep 2018 21:57:00 +0000   RouteCreated                 RouteController created a route
OutOfDisk            False   Wed, 05 Sep 2018 12:12:33 +0000   Tue, 04 Sep 2018 21:56:27 +0000   KubeletHasSufficientDisk     kubelet has sufficient disk space available
MemoryPressure       False   Wed, 05 Sep 2018 12:12:33 +0000   Tue, 04 Sep 2018 21:56:27 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
DiskPressure         False   Wed, 05 Sep 2018 12:12:33 +0000   Tue, 04 Sep 2018 21:56:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
Ready                True    Wed, 05 Sep 2018 12:12:33 +0000   Tue, 04 Sep 2018 21:57:01 +0000   KubeletReady                 kubelet is posting ready status

So I know that OutOfDisk, MemoryPressure and DiskPressure were all in error state at some point last night, but what caused that to happen?

I also checked kubectl get events --all-namespaces and I get nothing.

Finally kubectl describe pods simply gave me this unhelpful information

State:          Running
  Started:      Tue, 04 Sep 2018 22:03:47 +0000
Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Thu, 30 Aug 2018 14:36:48 +0000
  Finished:     Tue, 04 Sep 2018 21:25:16 +0000

Is there a way to do a post mortem on this? I'd like to know more than that it was just out of disk space.

-- Nate Bosscher
kubernetes

2 Answers

3/7/2020

Try using this grafana dashboard: https://grafana.com/grafana/dashboards/11802

At node level you can find below details which can help you relate the events:

Uptime Node readiness CPU, memory and load on node. Kubelet errors which can be related to PLEG pod count on node by namespace Memory/Disk/PID pressure Top 5 memory guzzling pods NTP time deviation Kubelet eviction stats

-- Mukund Sharma
Source: StackOverflow

9/5/2018

I would reocommand you reading the following documentation: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

The first thought that came to my mind is examining the logs of your nodes/pods.

kubectl logs
-- nzoueidi
Source: StackOverflow