My EKS cluster in us-east-1 stopped working with all nodes NotReady since kubelet cannot pull the pause container. This is the kubelet command that gets executed at boot
/usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --network-plugin cni --node-ip=10.0.21.107 --pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1 --node-labels=kubernetes.io/lifecycle=spot
The problem is with pulling the image
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1
Other required containers are also not available, for example:
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.14.6
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/coredns:v1.3.1
On the other hand, the container images are available from other regions, just not the one where the cluster is.
Kubernetes events mention cni plugin not initialized. That is expected since the aws-node pods do not start.
The VPC where the worker nodes live has a PrivateLink endpoint for ECR. That endpoint and the DNS entries that come with it, make the ECR domains within the same region resolve to a private IP. That's why docker pull was failing only for ECRs in the same region.
The security groups of the worker nodes need to allow (https) traffic out to the PrivateLink endpoint security group.