Kubernetes Dead Node Awareness

7/21/2021

K8s: 1.18.18

Awhile back we ran into a situation where if a node dies and pods are deployed to that node, K8s takes ~15 mins spin up the pods that were running on that node to a new node.

In an attempt to address this, our research has pointed us to both taint-based evictions and extensions to the K8s API to increase 'node awareness'. Unfortunately, neither have been reliable.

Has anyone who's run into been able to overcome it successfully?

TIA!

-- thepip3r
kubernetes

1 Answer

7/22/2021

Unfortunately there are no other built-in solutions other, then the ones you mentioned.

You can change reschedule period with TaintBasedEvictions

    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2

Additionaly you can set --pod-eviction-timeout flag on a node to a shorter value (default is 5 minutes).

-- p10l
Source: StackOverflow