How to troubleshoot long pod kill time for GKE?

10/24/2019

When using helm upgrade --install I'm every so often running into timeouts. The error I get is:

UPGRADE FAILED
Error: timed out waiting for the condition
ROLLING BACK

If I look in the GKE cluster logs on GCP, I see that when this happens its because this step takes an unusually long time to execute:

Killing container with id docker://{container-name}:Need to kill Pod

I've seen it range from a few seconds to 9 minutes. If I go into the log message's metadata to find the specific container and look at its logs, there is nothing in them suggesting a difference between it and a quickly killed container.

Any suggestions on how to keep troubleshooting this?

-- stumpbeard
google-cloud-platform
google-kubernetes-engine
kubernetes-helm

1 Answer

10/24/2019

You could refer this troubleshooting guide for general issues connected with Google Kubernetes Engine.

As mentioned there, you may need to use the 'Troubleshooting Application' guide for further debugging the application pods or its controller objects.

I am assuming that you checked the logs(1) of the container that resides in the respective pod OR described(2)( look at the reason for termination) it using the below commands. If not, you can try these as well to get more valuable information.

1. kubectl logs POD_NAME -c CONTAINER_NAME -p
2. kubectl describe pods POD_NAME

Note: I saw a similar discussion thread reported in github.com about helm upgrade failure. You can have a look over there as well.

-- Digil
Source: StackOverflow