i have setup myself a simple 1 master and 3 nodes setup running on Ubuntu based on the book "Kuberenetes Up & Running" in combination with the official documentation.
It basically works until i shutdown one of the worker
nodes. After a few seconds the nodes-running-state switches to unknown
. The pods keep report the state running
even if the pods are located on the offline node.
Shouldn't k8s move these pods to a different healthy host? Am i missing something?
thanks in advice!
I was able to work around this using this script to force drain any node that has gone into Not Ready status for greater than 5 mins (adjustable) then it will un cordon node the after it returns.
With Kubernetes version 1.13 and higher, pod eviction on node failures/not-ready conditions is actually controlled by taints and tolerations. --pod-eviction-timeout
parameter is not used anymore.
When a node goes down or is not ready, node-controller/kubelet will add the following taints to the node - node.kubernetes.io/unreachable
and node.kubernetes.io/not-ready
. All pods tolerate these taints for 300 seconds by default. You can control this toleration time cluster wide for all pods with flags to kube-api-server
and also per pod using tolerations
object in pod spec.
Cluster Wide configuration:
You can modify the toleration time cluster wide using --default-not-ready-toleration-seconds
and --default-unreachable-toleration-seconds
flags to kube-api-server
.
From docs:
--default-not-ready-toleration-seconds int Default: 300
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration.
--default-unreachable-toleration-seconds int Default: 300
Per pod configuration:
You can also modify the toleration time per pod using the following configuration.
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions
By default pods won't be moved for 5m minutes which is configurable via the following flag on the controller manager --pod-eviction-timeout duration
.
After 5 min if it still not happening(stateful sets) you need to delete the node using kubectl delete node
which would trigger a reschedule of the pods on the node.
From Kubernetes version 1.13 and higher, pod eviction on node failures/not-ready conditions is controlled by taints and tolerations. --pod-eviction-timeout parameter is ignored.
Cluster wide configuration can be configured via kubelet parameter.
--default-not-ready-toleration-seconds int Default: 300Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a me toleration.
--default-unreachable-toleration-seconds int Default: 300Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration.
If you want to manage this attribute in POD level, you can add tolerations.
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 30
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 30
Checkout this related issue
https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions