Kubernetes pods occasionally throw "ImagePullBackOff" or "ErrImagePull"

8/3/2019

I understand that ImagePullBackOff or ErrImagePull happens when K8 cannot pull containers, but I do not think that this is the case here. I say this because this error is randomly thrown by only some of the pods as my service scales, while others come up perfectly fine, with OK status.

For instance, please refer to this replica set here.

I retrieved the events from one such failed pod.

Events:
  Type     Reason     Age                   From                                                          Message
  ----     ------     ----                  ----                                                          -------
  Normal   Scheduled  3m45s                 default-scheduler                                             Successfully assigned default/storefront-jtonline-prod-6dfbbd6bd8-jp5k5 to gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl
  Normal   Pulling    2m8s (x4 over 3m44s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
  Warning  Failed     2m7s (x4 over 3m43s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Failed to pull image "gcr.io/square1-2019/storefront-jtonline-prod:latest": rpc error: code = Unknown desc = Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
  Warning  Failed     2m7s (x4 over 3m43s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Error: ErrImagePull
  Normal   BackOff    113s (x6 over 3m42s)  kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Back-off pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
  Warning  Failed     99s (x7 over 3m42s)   kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl  Error: ImagePullBackOff

The logs tell me it failed to pull the container because of incorrect credentials, which seems... confusing? This pod was created automatically when autoscaling exactly like the others.

I have a feeling this might have to do with resourcing. I have seen a much higher rate of these errors when the cluster spins off new nodes really fast due to a spike in traffic, or when I set lower resource requests in my deployment configurations.

How do I go about debugging this error, and what could be a possible reason this is happening?

Here is my configuation:

apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
  name: "storefront-_STOREFRONT-_ENV"
  namespace: "default"
  labels:
    app: "storefront-_STOREFRONT-_ENV"
spec:
  replicas: 10
  selector:
    matchLabels:
      app: "storefront-_STOREFRONT-_ENV"
  template:
    metadata:
      labels:
        app: "storefront-_STOREFRONT-_ENV"
    spec:
      containers:
      - name: "storefront-_STOREFRONT-_ENV"
        image: "gcr.io/square1-2019/storefront-_STOREFRONT-_ENV"
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /?healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 1
        imagePullPolicy: Always
apiVersion: "autoscaling/v2beta1"
kind: "HorizontalPodAutoscaler"
metadata:
  name: "storefront-_STOREFRONT-hpa"
  namespace: "default"
  labels:
    app: "storefront-_STOREFRONT-_ENV"
spec:
  scaleTargetRef:
    kind: "Deployment"
    name: "storefront-_STOREFRONT-_ENV"
    apiVersion: "apps/v1beta1"
  minReplicas: 10
  maxReplicas: 1000
  metrics:
  - type: "Resource"
    resource:
      name: "cpu"
      targetAverageUtilization: 75

EDIT: I have been able to verify that this is in fact an auth issue. This only happens for "some" pods, since it only occurs for pods scheduled on nodes created automatically because of vertical scaling. I do not know how to fix this yet, though.

-- Naved Khan
kubernetes

1 Answer

8/26/2019

As we can read in the Kubernetes docs regarding images there is no need to do anything if You are running cluster on GKE.

Note: If you are running on Google Kubernetes Engine, there will already be a .dockercfg on each node with credentials for Google Container Registry. You cannot use this approach.

But it's also stated that:

Note: This approach is suitable if you can control node configuration. It will not work reliably on GCE, and any other cloud provider that does automatic node replacement.

Also in section Specifying ImagePullSecrets on a Pod.

Note: This approach is currently the recommended approach for Google Kubernetes Engine, GCE, and any cloud-providers where node creation is automated.

It's recommended to use create a Secret with a Docker Config.

This can be done in a following way:

kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
-- Crou
Source: StackOverflow