AWS EKS NodeGroup "Create failed": Instances failed to join the kubernetes cluster

10/24/2020

I am able to create an EKS cluster but when I try to add nodegroups, I receive a "Create failed" error with details: "NodeCreationFailure": Instances failed to join the kubernetes cluster

I tried a variety of instance types and increasing larger volume sizes (60gb) w/o luck. Looking at the EC2 instances, I only see the below problem. However, it is difficult to do anything since i'm not directly launching the EC2 instances (the EKS NodeGroup UI Wizard is doing that.)

How would one move forward given the failure happens even before I can jump into the ec2 machines and "fix" them?

Amazon Linux 2

Kernel 4.14.198-152.320.amzn2.x86_64 on an x86_64

ip-187-187-187-175 login: 54.474668 cloud-init3182: One of the configured repositories failed (Unknown), 54.475887 cloud-init3182: and yum doesn't have enough cached data to continue. At this point the only 54.478096 cloud-init3182: safe thing yum can do is fail. There are a few ways to work "fix" this: 54.480183 cloud-init3182: 1. Contact the upstream for the repository and get them to fix the problem. 54.483514 cloud-init3182: 2. Reconfigure the baseurl/etc. for the repository, to point to a working 54.485198 cloud-init3182: upstream. This is most often useful if you are using a newer 54.486906 cloud-init3182: distribution release than is supported by the repository (and the 54.488316 cloud-init3182: packages for the previous distribution release still work). 54.489660 cloud-init3182: 3. Run the command with the repository temporarily disabled 54.491045 cloud-init3182: yum --disablerepo=<repoid> ... 54.491285 cloud-init3182: 4. Disable the repository permanently, so yum won't use it by default. Yum 54.493407 cloud-init3182: will then just ignore the repository until you permanently enable it 54.495740 cloud-init3182: again or use --enablerepo for temporary usage: 54.495996 cloud-init3182: yum-config-manager --disable <repoid>

-- CoderOfTheNight
amazon-ec2
amazon-eks
amazon-web-services
kubernetes

7 Answers

10/15/2021

Try adding a tag to your private subnets where the worker nodes are deployed.

kubernetes.io/cluster/<cluster_name> = shared

-- 2stacks
Source: StackOverflow

4/18/2022

we need to check what type of nat gateway we configured. It should be public one but in my case i configured as private.

Once i changed from private to public the issue resolved.

-- iloveindia
Source: StackOverflow

5/10/2021

I noticed there was no answer here, but about 2k visits to this question over the last six months. There seems to be a number of reasons why you could be seeing these failures. To regurgitate the AWS documentation found here: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html

  • The aws-auth-cm.yaml file does not have the correct IAM role ARN for your nodes. Ensure that the node IAM role ARN (not the instance profile ARN) is specified in your aws-auth-cm.yaml file. For more information, see Launching self-managed Amazon Linux nodes.

  • The ClusterName in your node AWS CloudFormation template does not exactly match the name of the cluster you want your nodes to join. Passing an incorrect value to this field results in an incorrect configuration of the node's /var/lib/kubelet/kubeconfig file, and the nodes will not join the cluster.

  • The node is not tagged as being owned by the cluster. Your nodes must have the following tag applied to them, where <cluster-name> is replaced with the name of your cluster.

    Key	Value kubernetes.io/cluster/<cluster-name> 
    Value owned
  • The nodes may not be able to access the cluster using a public IP address. Ensure that nodes deployed in public subnets are assigned a public IP address. If not, you can associate an Elastic IP address to a node after it's launched. For more information, see Associating an Elastic IP address with a running instance or network interface. If the public subnet is not set to automatically assign public IP addresses to instances deployed to it, then we recommend enabling that setting. For more information, see Modifying the public IPv4 addressing attribute for your subnet. If the node is deployed to a private subnet, then the subnet must have a route to a NAT gateway that has a public IP address assigned to it.

  • The STS endpoint for the Region that you're deploying the nodes to is not enabled for your account. To enable the region, see Activating and deactivating AWS STS in an AWS Region.

  • The worker node does not have a private DNS entry, resulting in the kubelet log containing a node "" not found error. Ensure that the VPC where the worker node is created has values set for domain-name and domain-name-servers as Options in a DHCP options set. The default values are domain-name:<region>.compute.internal and domain-name-servers:AmazonProvidedDNS. For more information, see DHCP options sets in the Amazon VPC User Guide.

I myself had an issue with the tagging where I needed an uppercase letter. In reality, if you can use another avenue to deploy your EKS cluster I would recommend it (eksctl, aws cli, terraform even).

-- Gregory Martin
Source: StackOverflow

8/22/2021

Adding another reason to the list:

In my case the Nodes were running in a private subnets and I haven't configured a private endpoint under API server endpoint access.

After the update the nodes groups weren't updated automatically so I had to recreate them.

-- RtmY
Source: StackOverflow

11/3/2021

In my case, the problem was that I was deploying my node group in a private subnet, but this private subnet had no NAT gateway associated, hence no internet access. What I did was:

  1. Create a NAT gateway

  2. Create a new routetable with the following routes (the second one is the internet access route, through nat):

  • Destination: VPC-CIDR-block Target: local
  • Destination: 0.0.0.0/0 Target: NAT-gateway-id
  1. Associate private subnet with the routetable created in the second-step.

After that, nodegroups joined the clusters without problem.

-- manavellam
Source: StackOverflow

2/14/2022

I will try to make the answer short by highlighting a few things that can go wrong in frontline.

1. Add the IAM role which is attached to EKS worker node, to the aws-auth config map in kube-system namespace. Ref

2. Login to the worker node which is created and failed to join the cluster. Try connecting to API server from inside using nc. Eg: nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443

3. If you are not using the EKS node from the drop down in AWS Console (which means you are using a LT or LC in the AWS EC2), dont forget to add the userdata section in the Launch template. Ref

set -o xtrace
/etc/eks/bootstrap.sh ${ClusterName} ${BootstrapArguments}

4. Check the EKS worker IAM node policy and see it has the appropriate permissions added. AmazonEKS_CNI_Policy is a must.

5. Your nodes must have the following tag applied to them, where cluster-name is replaced with the name of your cluster. kubernetes.io/cluster/cluster-name: owned

I hope your problem lies within this list.

Ref: https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html https://aws.amazon.com/premiumsupport/knowledge-center/resolve-eks-node-failures/

-- Kishor U
Source: StackOverflow

9/8/2021

Firstly, I had the NAT Gateway in my private subnet. Then I moved the NAT gateway back to public subnet which worked fine.

Terraform code is as follows:

resource "aws_internet_gateway" "gw" {
  vpc_id = aws_vpc.dev-vpc.id
  tags = {
    Name = "dev-IG"
  }
}

resource "aws_eip" "lb" {
  depends_on    = [aws_internet_gateway.gw]
  vpc           = true
}

resource "aws_nat_gateway" "natgw" {
  allocation_id = aws_eip.lb.id
  subnet_id     = aws_subnet.dev-public-subnet.id
  depends_on = [aws_internet_gateway.gw]
  tags = {
    Name = "gw NAT"
  }
}
-- Shashwot Risal
Source: StackOverflow