EKS worker nodes not ready and ECRs not reachable

10/10/2019

My EKS cluster in us-east-1 stopped working with all nodes NotReady since kubelet cannot pull the pause container. This is the kubelet command that gets executed at boot

/usr/bin/kubelet --cloud-provider aws --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime docker --network-plugin cni --node-ip=10.0.21.107 --pod-infra-container-image=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1 --node-labels=kubernetes.io/lifecycle=spot

The problem is with pulling the image

602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1

Other required containers are also not available, for example:

602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.14.6
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/coredns:v1.3.1

On the other hand, the container images are available from other regions, just not the one where the cluster is.

Kubernetes events mention cni plugin not initialized. That is expected since the aws-node pods do not start.

-- Miguel Ferreira
amazon-eks
amazon-web-services
kubernetes

1 Answer

10/10/2019

The VPC where the worker nodes live has a PrivateLink endpoint for ECR. That endpoint and the DNS entries that come with it, make the ECR domains within the same region resolve to a private IP. That's why docker pull was failing only for ECRs in the same region.

The security groups of the worker nodes need to allow (https) traffic out to the PrivateLink endpoint security group.

-- Miguel Ferreira
Source: StackOverflow