I have an EKS cluster and a nodegroup running 6 nodes. For some reson nodes get marked as unschedulable
randomly, once a week or two and they stay that way. When I notice that I uncordon them manually and everything works fine.
Why does this happen and how can I debug it, prevent it or configure cluster to fix it automatically?
In my case the problem was AWS Termination Handler
daemonset that was running. It was outdated and not really used in the cluster and after removing it, problems with nodes getting marked Unschedulable just went away.