I understand that ImagePullBackOff or ErrImagePull happens when K8 cannot pull containers, but I do not think that this is the case here. I say this because this error is randomly thrown by only some of the pods as my service scales, while others come up perfectly fine, with OK status.
For instance, please refer to this replica set here.
I retrieved the events from one such failed pod.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m45s default-scheduler Successfully assigned default/storefront-jtonline-prod-6dfbbd6bd8-jp5k5 to gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl
Normal Pulling 2m8s (x4 over 3m44s) kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
Warning Failed 2m7s (x4 over 3m43s) kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl Failed to pull image "gcr.io/square1-2019/storefront-jtonline-prod:latest": rpc error: code = Unknown desc = Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
Warning Failed 2m7s (x4 over 3m43s) kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl Error: ErrImagePull
Normal BackOff 113s (x6 over 3m42s) kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl Back-off pulling image "gcr.io/square1-2019/storefront-jtonline-prod:latest"
Warning Failed 99s (x7 over 3m42s) kubelet, gke-square1-prod-clu-nap-n1-highcpu-2-82b95c00-p5gl Error: ImagePullBackOff
The logs tell me it failed to pull the container because of incorrect credentials, which seems... confusing? This pod was created automatically when autoscaling exactly like the others.
I have a feeling this might have to do with resourcing. I have seen a much higher rate of these errors when the cluster spins off new nodes really fast due to a spike in traffic, or when I set lower resource requests in my deployment configurations.
How do I go about debugging this error, and what could be a possible reason this is happening?
Here is my configuation:
apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
name: "storefront-_STOREFRONT-_ENV"
namespace: "default"
labels:
app: "storefront-_STOREFRONT-_ENV"
spec:
replicas: 10
selector:
matchLabels:
app: "storefront-_STOREFRONT-_ENV"
template:
metadata:
labels:
app: "storefront-_STOREFRONT-_ENV"
spec:
containers:
- name: "storefront-_STOREFRONT-_ENV"
image: "gcr.io/square1-2019/storefront-_STOREFRONT-_ENV"
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /?healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 1
imagePullPolicy: Always
apiVersion: "autoscaling/v2beta1"
kind: "HorizontalPodAutoscaler"
metadata:
name: "storefront-_STOREFRONT-hpa"
namespace: "default"
labels:
app: "storefront-_STOREFRONT-_ENV"
spec:
scaleTargetRef:
kind: "Deployment"
name: "storefront-_STOREFRONT-_ENV"
apiVersion: "apps/v1beta1"
minReplicas: 10
maxReplicas: 1000
metrics:
- type: "Resource"
resource:
name: "cpu"
targetAverageUtilization: 75
EDIT: I have been able to verify that this is in fact an auth issue. This only happens for "some" pods, since it only occurs for pods scheduled on nodes created automatically because of vertical scaling. I do not know how to fix this yet, though.
As we can read in the Kubernetes docs regarding images there is no need to do anything if You are running cluster on GKE.
Note: If you are running on Google Kubernetes Engine, there will already be a
.dockercfg
on each node with credentials for Google Container Registry. You cannot use this approach.
But it's also stated that:
Note: This approach is suitable if you can control node configuration. It will not work reliably on GCE, and any other cloud provider that does automatic node replacement.
Also in section Specifying ImagePullSecrets on a Pod.
Note: This approach is currently the recommended approach for Google Kubernetes Engine, GCE, and any cloud-providers where node creation is automated.
It's recommended to use create a Secret with a Docker Config.
This can be done in a following way:
kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL