Controlling pod recovery from "Error: ImagePullBackOff" when Contrainer Registry is also inaccessible

2/24/2022

We had a major outage when both our container registry and the entire K8S cluster lost power. When the cluster recovered faster than the container registry, my pod (part of a statefulset) is stuck in Error: ImagePullBackOff.

Is there a config setting to retry downloading the image from the CR periodically or recover without manual intervention?

I looked at imagePullPolicy but that does not apply for a situation when the CR is unavailable.

-- ucipass
container-registry
kubernetes
kubernetes-pod

1 Answer

2/25/2022

The BackOff part in ImagePullBackOff status means that Kubernetes is keep trying to pull the image from the registry, with an exponential back-off delay (10s, 20s, 40s, …). The delay between each attempt is increased until it reaches a compiled-in limit of 300 seconds (5 minutes) - more on it in Kubernetes docs.

backOffPeriod parameter for the image pulls is a hard-coded constant in Kuberenets and unfortunately is not tunable now, as it can affect the node performance - otherwise, it can be adjusted in the very code for your custom kubelet binary. There is still ongoing issue on making it adjustable.

-- anarxz
Source: StackOverflow