K
Q

Question

gpu worker node unable to join cluster

12/30/2020

I've a EKS setup (v1.16) with 2 ASG: one for compute ("c5.9xlarge") and the other gpu ("p3.2xlarge"). Both are configured as Spot and set with desiredCapacity 0.

K8S CA works as expected and scale out each ASG when necessary, the issue is that the newly created gpu instance is not recognized by the master and running kubectl get nodes emits nothing. I can see that the ec2 instance was in Running state and also I could ssh the machine.

I double checked the the labels and tags and compared them to the "compute". Both are configured almost similarly, the only difference is that the gpu nodegroup has few additional tags.

Since I'm using eksctl tool (v.0.35.0) and the compute nodeGroup vs. gpu nodeGroup is basically copy&paste, I can't figured out what could be the problem.

UPDATE: ssh the instance I could see the following error (/var/log/messages)

failed to run Kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"

and the kubelet service crashed.

would it possible the my GPU uses wrong AMI (amazon-eks-gpu-node-1.18-v20201211)?

-- Cowabunga

amazon-eks

aws-auto-scaling

eksctl

kubernetes

3 Answers

12/30/2020

There is some issue with EKS 1.16, even the graviton processors machine won't join the cluster. To fix it first you try upgrading your CNI version. Please refer the documentation here:

https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

And if that doesn't work, then upgrade your EKS version to the latest available version then should work.

-- Vikrant Dubey

Source: StackOverflow

12/31/2020

I've found out the issue. It seems to be mis-alignment between eksctl (v0.35.0) and the AL2-GPU AMI.

AWS team change the control group in docker to be "systemd" instead of "cgroup" (github) while the eksctl tool I used didn't absorb the changes.

A temporary solution is to edit the /etc/eksctl/kubelet.yaml file using preBootstrapCommands

-- Cowabunga

Source: StackOverflow

5/30/2021

As a simple you can use this preBootstrapCommands in eksctl yaml config file:

- name: test-node-group
  preBootstrapCommands: 
   - "sed -i 's/cgroupDriver:.*/cgroupDriver: cgroupfs/' /etc/eksctl/kubelet.yaml"

-- Wael Gaith

Source: StackOverflow

KQ

gpu worker node unable to join cluster

Similar Questions

3 Answers

K
Q