K8s: 1.18.18
Awhile back we ran into a situation where if a node dies and pods are deployed to that node, K8s takes ~15 mins spin up the pods that were running on that node to a new node.
In an attempt to address this, our research has pointed us to both taint-based evictions and extensions to the K8s API to increase 'node awareness'. Unfortunately, neither have been reliable.
Has anyone who's run into been able to overcome it successfully?
TIA!
Unfortunately there are no other built-in solutions other, then the ones you mentioned.
You can change reschedule period with TaintBasedEvictions
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 2
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 2
Additionaly you can set --pod-eviction-timeout
flag on a node to a shorter value (default is 5 minutes).