How can kubernetes recover Pods stuck in Error or Terminating

10/23/2018

I have a cluster where the free memory on the nodes recently dipped to %5. When this happens, the nodes CPU (load) spikes while it tries to free up some memory, from cache/buffer. One consequence of the high load, low memory is that I sometimes end up with Pods that get into an Error state or get stuck in Terminating. These Pods sit around until I manually intervene, which can further exacerbate the low memory issue that caused it.

My question is why Kubernetes leaves these Pods stuck in this state? My hunch is that kubernetes didn’t get the right feedback from the Docker daemon and never tries again. I need to know how to have Kubernetes cleanup or repair Error and Terminating Pods. Any ideas?

I'm currently on:

~ # kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

UPDATE: Here are some of the Events listed in pods. You can see that some of them sit around for days. You will also see that one shows a Warning, but the others show Normal.

Events:
  Type     Reason         Age                  From                 Message
  ----     ------         ----                 ----                 -------
  Warning  FailedKillPod  25m                  kubelet, k8s-node-0  error killing pod: failed to "KillContainer" for "kubectl" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
  Normal   Killing        20m (x2482 over 3d)  kubelet, k8s-node-0  Killing container with id docker://docker:Need to kill Pod
  Normal   Killing        15m (x2484 over 3d)  kubelet, k8s-node-0  Killing container with id docker://maven:Need to kill Pod
  Normal   Killing        8m (x2487 over 3d)   kubelet, k8s-node-0  Killing container with id docker://node:Need to kill Pod
  Normal   Killing        4m (x2489 over 3d)   kubelet, k8s-node-0  Killing container with id docker://jnlp:Need to kill Pod

Events:
  Type    Reason   Age                 From                 Message
  ----    ------   ----                ----                 -------
  Normal  Killing  56m (x125 over 5h)  kubelet, k8s-node-2  Killing container with id docker://owasp-zap:Need to kill Pod
  Normal  Killing  47m (x129 over 5h)  kubelet, k8s-node-2  Killing container with id docker://jnlp:Need to kill Pod
  Normal  Killing  38m (x133 over 5h)  kubelet, k8s-node-2  Killing container with id docker://dind:Need to kill Pod
  Normal  Killing  13m (x144 over 5h)  kubelet, k8s-node-2  Killing container with id docker://maven:Need to kill Pod
  Normal  Killing  8m (x146 over 5h)   kubelet, k8s-node-2  Killing container with id docker://docker-cmds:Need to kill Pod
  Normal  Killing  1m (x149 over 5h)   kubelet, k8s-node-2  Killing container with id docker://pmd:Need to kill Pod

Events:
  Type    Reason   Age                  From                 Message
  ----    ------   ----                 ----                 -------
  Normal  Killing  56m (x2644 over 4d)  kubelet, k8s-node-0  Killing container with id docker://openssl:Need to kill Pod
  Normal  Killing  40m (x2651 over 4d)  kubelet, k8s-node-0  Killing container with id docker://owasp-zap:Need to kill Pod
  Normal  Killing  31m (x2655 over 4d)  kubelet, k8s-node-0  Killing container with id docker://pmd:Need to kill Pod
  Normal  Killing  26m (x2657 over 4d)  kubelet, k8s-node-0  Killing container with id docker://kubectl:Need to kill Pod
  Normal  Killing  22m (x2659 over 4d)  kubelet, k8s-node-0  Killing container with id docker://dind:Need to kill Pod
  Normal  Killing  11m (x2664 over 4d)  kubelet, k8s-node-0  Killing container with id docker://docker-cmds:Need to kill Pod
  Normal  Killing  6m (x2666 over 4d)   kubelet, k8s-node-0  Killing container with id docker://maven:Need to kill Pod
  Normal  Killing  1m (x2668 over 4d)   kubelet, k8s-node-0  Killing container with id docker://jnlp:Need to kill Pod
-- Daniel Watrous
cpu-usage
docker
kubernetes
memory-management
out-of-memory

3 Answers

3/11/2020

Removing the finalizers is a workaround by running the kubectl patch. This can happen to different type of resources like persistentvolume or deployment. More common to PV/PVC in my experience.

# for pods
$ kubectl patch pod pod-name-123abc -p '{"metadata":{"finalizers":null}}' -n your-app-namespace

# for pvc
$ kubectl patch pvc pvc-name-123abc -p '{"metadata":{"finalizers":null}}' -n your-app-namespace
-- alltej
Source: StackOverflow

2/6/2020

I had to restart all the nodes. I noticed one minion was slow and unresponsive, probably that one was the culprit. After restart all Terminating pods disappeared.

-- Tudor
Source: StackOverflow

10/23/2018

This is typically related to the metadata.finalizers on your objects (pod, deployment, etc)

You can also read more about Foreground Cascading Deleting and how it uses metadata.finalizers.

If not it could be a networking issue, you could check the kubelet logs, typically:

journalctl -xeu kubelet 

You can also check the docker daemon logs, typically:

cat /var/log/syslog | grep dockerd
-- Rico
Source: StackOverflow