Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0" while installing Velero in GKE Cluster

8/14/2020

I'm trying to install and configure Velero for kubernetes backup. I have followed the link to configure it in my GKE cluster. The installation went fine, but velero is not working.

I am using google cloud shell for running all my commands (I have installed and configured velero client in my google cloud shell)

On further inspection on velero deployment and velero pods, I found out that it is not able to pull the image from the docker repository.

kubectl get pods -n velero
NAME                      READY   STATUS              RESTARTS   AGE
velero-5489b955f6-kqb7z   0/1     Init:ErrImagePull   0          20s

Error from velero pod (kubectl describe pod) (output redacted for readability - only relevant info shown below)

    Events:
  Type     Reason     Age               From                                                  Message
  ----     ------     ----              ----                                                  -------
  Normal   Scheduled  38s               default-scheduler                                     Successfully assigned velero/velero-5489b955f6-kqb7z to gke-gke-cluster1-default-pool-a354fba3-8674
  Warning  Failed     22s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Failed     22s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Error: ErrImagePull
  Normal   BackOff    21s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Back-off pulling image "velero/velero-plugin-for-gcp:v1.1.0"
  Warning  Failed     21s               kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Error: ImagePullBackOff
  Normal   Pulling    8s (x2 over 37s)  kubelet, gke-gke-cluster1-default-pool-a354fba3-8674  Pulling image "velero/velero-plugin-for-gcp:v1.1.0"

Command used to install velero: (some of the values are given as variables)

velero install \
     --provider gcp \
     --plugins velero/velero-plugin-for-gcp:v1.1.0 \
     --bucket $storagebucket \
     --secret-file ~/velero-backup-storage-sa-key.json

Velero Version

velero version
Client:
        Version: v1.4.2
        Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5
<error getting server version: timed out waiting for server status request to be processed>

GKE version

v1.15.12-gke.2
-- srsn
backup
google-cloud-platform
google-kubernetes-engine
kubernetes
velero

1 Answer

8/14/2020

Isn't this a Private Cluster ? – mario 31 mins ago

@mario this is a private cluster but I can deploy other services without any issues (for eg: I have deployed nginx successfully) – Sreesan 15 mins ago

Well, this is a know limitation of GKE Private Clusters. As you can read in the documentation:

Can't pull image from public Docker Hub

Symptoms

A Pod running in your cluster displays a warning in kubectl describe such as Failed to pull image: rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Potential causes

Nodes in a private cluster do not have outbound access to the public internet. They have limited access to Google APIs and services, including Container Registry.

Resolution

You cannot fetch images directly from Docker Hub. Instead, use images hosted on Container Registry. Note that while Container Registry's Docker Hub mirror is accessible from a private cluster, it should not be exclusively relied upon. The mirror is only a cache, so images are periodically removed, and a private cluster is not able to fall back to Docker Hub.

You can also compare it with this answer.

It can be easily verified on your own by making a simple experiment. Try to run two different nginx deployments. First based on image nginx (which equals to nginx:latest) and the second one based on nginx:1.14.2.

While the first scenario is perfectly feasible because the nginx:latest image can be pulled from Container Registry's Docker Hub mirror which is accessible from a private cluster, any attempt of pulling nginx:1.14.2 will fail which you'll see in Pod events. It happens because the kubelet is not able to find this version of the image in GCR and it tries to pull it from public docker registry (https://registry-1.docker.io/v2/), which in Private Clusters is not possible. "The mirror is only a cache, so images are periodically removed, and a private cluster is not able to fall back to Docker Hub." - as you can read in docs.

If you still have doubts, just ssh into your node and try to run following commands:

curl https://cloud.google.com/container-registry/

curl https://registry-1.docker.io/v2/

While the first one works perfectly, the second one will eventually fail:

curl: (7) Failed to connect to registry-1.docker.io port 443: Connection timed out

Reason ? - "Nodes in a private cluster do not have outbound access to the public internet."

Solution ?

You can search what is currently available in GCR here.

In many cases you should be able to get the required image if you don't specify it's exact version (by default latest tag is used). While it can help with nginx, unfortunatelly no version of velero/velero-plugin-for-gcp is currently available in Google Container Registry's Docker Hub mirror.

Granting private nodes outbound internet access by using Cloud NAT seems the only reasonable solution that can be applied in your case.

-- mario
Source: StackOverflow