On our EKS cluster running K8s 1.20, I have an incident where the Scheduler attempted to schedule a pod on a node that had just been removed by the Control Plane.
Events log:
Node ip-10-65-94-144.eu-west-1.compute.internal event: Removing Node ip-10-65-94-144.eu-west-1.compute.internal from Controller
AttachVolume.Attach failed for volume "pvc-d80d24be-653f-45f0-90ca-321bda1ef0ab" : error finding instance ip-AA-BB-CC-DD.eu-west-1.compute.internal: "instance not found"
Control Plane:
controller_utils.go:185] Recording status change NodeNotReady event message for node ip-AA-BB-CC-DD.eu-west-1.compute.internal
This indicates a miscommunication between the Scheduler and the Control Plane.
Has anyone seen this before?
Any suggestions on how to prevent it, other than upgrading to a more recent K8s version?