Kubernetes Node NotReady: ContainerGCFailed / ImageGCFailed context deadline exceeded

3/7/2019

Worker node is getting into "NotReady" state with an error in the output of kubectl describe node:

ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded

Environment:

Ubuntu, 16.04 LTS

Kubernetes version: v1.13.3

Docker version: 18.06.1-ce

There is a closed issue on that on Kubernetes GitHub k8 git, which is closed on the merit of being related to Docker issue.

Steps done to troubleshoot the issue:

  1. kubectl describe node - error in question was found(root cause isn't clear).
  2. journalctl -u kubelet - shows this related message:

    skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]

    it is related to this open k8 issue Ready/NotReady with PLEG issues

  3. Check node health on AWS with cloudwatch - everything seems to be fine.

  4. journalctl -fu docker.service : check docker for errors/issues - the output doesn't show any erros related to that.

  5. systemctl restart docker - after restarting docker, the node gets into "Ready" state but in 3-5 minutes becomes "NotReady" again.

It all seems to start when I deployed more pods to the node( close to its resource capacity but don't think that it is direct dependency) or was stopping/starting instances( after restart it is ok, but after some time node is NotReady).

Questions:

What is the root cause of the error?

How to monitor that kind of issue and make sure it doesn't happen?

Are there any workarounds to this problem?

-- Alexz
kubernetes

1 Answer

6/12/2019

What is the root cause of the error?

From what I was able to find it seems like the error happens when there is an issue contacting Docker, either because it is overloaded or because it is unresponsive. This is based on my experience and what has been mentioned in the GitHub issue you provided.

How to monitor that kind of issue and make sure it doesn't happen?

There seem to be no clarified mitigation or monitoring to this. But it seems like the best way would be to make sure your node will not be overloaded with pods. I have seen that it is not always shown on disk or memory pressure of the Node - but this is probably a problem of not enough resources allocated to Docker and it fails to respond in time. Proposed solution is to set limits for your pods to prevent overloading the Node.

In case of managed Kubernetes in GKE (not sure but other vendors probably have similar feature) there is a feature called node auto-repair. Which will not prevent node pressure or Docker related issue but when it detects an unhealthy node it can drain and redeploy the node/s.

If you already have resources and limits it seems like the best way to make sure this does not happen is to increase memory resource requests for pods. This will mean fewer pods per node and the actual used memory on each node should be lower.

Another way of monitoring/recognizing this could be done by SSH into the node check the memory, the processes with PS, monitoring the syslog and command $docker stats --all

-- aurelius
Source: StackOverflow