Google GKE Workloads Suddenly show offline with Error: ErrImagePull

9/18/2019

I have a GKE cluster which has been running fine up until recently. Now I see a whole bunch of Kubernetes Workloads showing as offline with the following error msg:

 Type     Reason          Age                    From                                                          Message
  ----     ------          ----                   ----                                                          -------
  Normal   Scheduled       6m23s                  default-scheduler                                               Warning  Failed          5m39s (x3 over 6m22s)  kubelet, gke-platsol-bots-staging-default-pool-f489f2f3-rjrq  Error: ErrImagePull
  Normal   BackOff         5m2s (x7 over 6m21s)   kubelet, gke-platsol-bots-staging-default-pool-f489f2f3-rjrq  Back-off pulling image "us.gcr.io/project/poc-app-bot@sha256:b99b5fb1b77407ade49d9bf42a94919e90422fee26c1a46ec6247370bd96c4d8"
  Normal   Pulling         4m49s (x4 over 6m22s)  kubelet, gke-platsol-bots-staging-default-pool-f489f2f3-rjrq  pulling image "us.gcr.io/project/poc-app-bot@sha256:b99b5fb1b77407ade49d9bf42a94919e90422fee26c1a46ec6247370bd96c4d8"
  Warning  Failed          81s (x22 over 6m21s)   kubelet, gke-platsol-bots-staging-default-pool-f489f2f3-rjrq  Error: ImagePullBackOff

Not sure what could have changed to cause this issue.

This is the ouput of kubectl

Name:               project-5dddbd66b5-vpw8q
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               gke-platsol-bots-staging-default-pool-f489f2f3-rjrq/10.x.x.x
Start Time:         Wed, 18 Sep 2019 16:48:23 +0100
Labels:             app=bot
                    pod-template-hash=5dddbd66b5
Annotations:        kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container project
Status:             Pending
IP:                 10.20.1.9
Controlled By:      ReplicaSet/bot-5dddbd66b5
Containers:
  project:
    Container ID:
    Image:          us.gcr.io/project/project@sha256:b99b5fb1b77407ade49d9bf42a94919e90422fee26c1a46ec6247370bd96c4d8
    Image ID:
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-99cns:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-99cns
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason  Age                     From                                                          Message
  ----     ------  ----                    ----                                                          -------
  Warning  Failed  4m38s (x793 over 3h4m)  kubelet, gke-platsol-bots-staging-default-pool-f489f2f3-rjrq  Error: ImagePullBackOff

Below is what i have in my YAML definition for the deployment. I have not defined a secret as one was not required to pull the image from Google Container Registry,

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      annotations:
        deployment.kubernetes.io/revision: "3"
        kubectl.kubernetes.io/last-applied-configuration: |
<redacted annotations>
      creationTimestamp: 2019-06-06T08:37:01Z
      generation: 3
      labels:
        app: project
      name: bot
      namespace: default
      resourceVersion: "68945490"
      selfLink: /apis/apps/v1/namespaces/default/deployments/bot
      uid: 412ce711-8836-11e9-905f-42010a8e016c
     image: us.gcr.io/project/app-bot@sha256:b99b5fb1b77407ade49d9bf42a94919e90422fee26c1a46ec6247370bd96c4d8
        imagePullPolicy: IfNotPresent

Okay so I followed this guide to patch the service account with a "secret" when pulling images from GCR https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

I SSH onto a single node and can pull an image for one Application successfully,

vinay@cloudshell:~ (project-id)$ docker pull us.gcr.io/project-id/project2-bot@sha256:9817462c743a93bb9206e4b8685
5322f731a768dca18e26b8bfc39b0cc886d31
sha256:9817462c743a93bb9206e4b86855322f731a768dca18e26b8bfc39b0cc886d31: Pulling from project-id/project2-bot
092586df9206: Pull complete
ef599477fae0: Pull complete
4530c6472b5d: Pull complete
d34d61487075: Pull complete
272f46008219: Pull complete
12ff6ccfe7a6: Pull complete
f26b99e1adb1: Pull complete
bb50901cd579: Pull complete
64a286652062: Pull complete
283785ced197: Pull complete
ed5a2062edd6: Pull complete
Digest: sha256:9817462c743a93bb9206e4b86855322f731a768dca18e26b8bfc39b0cc886d31
Status: Downloaded newer image for us.gcr.io/project-id/project2-bot@sha256:9817462c743a93bb9206e4b86855322f731a768dca18e26b8
bfc39b0cc886d31
us.gcr.io/project-id/project2-bot@sha256:9817462c743a93bb9206e4b86855322f731a768dca18e26b8bfc39b0cc886d31

But this application seems to throw an error,

vinay@cloudshell:~ (project-id)$ docker pull us.gcr.io/project-id/project1-plug@sha256:c53ac1c536a1187ce940f9221730cc0eae3103f4313033659e2162a70bc66c59
    sha256:c53ac1c536a1187ce940f9221730cc0eae3103f4313033659e2162a70bc66c59: Pulling from project-id/project1-plug
    a4d8138d0f6b: Pulling fs layer
    dbdc36973392: Pulling fs layer
    f59d6d019dd5: Pulling fs layer
    aaef3e026258: Waiting
    5e86b04a4500: Waiting
    1a6643a2873a: Waiting
    2ad1e30fc17c: Waiting
    ddb5baaf3393: Waiting
    0a7edc889b3c: Waiting
    31a1f16c256b: Waiting
    172a500f7b4d: Waiting
    error pulling image configuration: unknown blob
-- Vin
google-kubernetes-engine

1 Answer

9/18/2019

ErrImagePull is quite possibly the most common, and is fortunately straightforward to debug and diagnose. You'll see ErrImagePull as the status message when this occurs, indicating that Kubernetes was not able to retrieve the image you specified in the manifest (maybe the image was deleted from the register).

You can immediately get more detailed information about why this error occurred using the kubectl describe [pod] command. It's not entirely an error condition, as Kubernetes is technically in a waiting state hoping that the image will become available

-- Ernesto U
Source: StackOverflow