Kubernetes recreate pod if node becomes offline timeout

12/5/2018

I've started working with the docker images and set up Kubernetes. I have fixed everything but I am having problems with the timeout of pod recreations.

If one pod is running on one particular node and if I shut it down, it will take ~5 minutes to recreate the pod on another online node.

I've checked all the possible config files, also set all pod-eviction-timeout, horizontal-pod-autoscaler-downscale, horizontal-pod-autoscaler-downscale-delay flags but it is still not working.

Current kube controller manager config:

spec:
 containers:
 - command:
   - kube-controller-manager
   - --address=192.168.5.135
   - --allocate-node-cidrs=false
   - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
   - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
   - --client-ca-file=/etc/kubernetes/pki/ca.crt
   - --cluster-cidr=192.168.5.0/24
   - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
   - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
   - --controllers=*,bootstrapsigner,tokencleaner
   - --kubeconfig=/etc/kubernetes/controller-manager.conf
   - --leader-elect=true
   - --node-cidr-mask-size=24
   - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
   - --root-ca-file=/etc/kubernetes/pki/ca.crt
   - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
   - --use-service-account-credentials=true
   - --horizontal-pod-autoscaler-downscale-delay=20s
   - --horizontal-pod-autoscaler-sync-period=20s
   - --node-monitor-grace-period=40s
   - --node-monitor-period=5s
   - --pod-eviction-timeout=20s
   - --use-service-account-credentials=true
   - --horizontal-pod-autoscaler-downscale-stabilization=20s
image: k8s.gcr.io/kube-controller-manager:v1.13.0

Thank you.

-- Jure Potocnik
kube-controller-manager
kubernetes

2 Answers

12/6/2018

This is what happens when node dies or go into offline mode:

  1. The kubelet posts its status to masters by --node-status-update-fequency=10s.
  2. Node goes offline
  3. kube-controller-manager is monitoring all the nodes by --node-monitor-period=5s
  4. kube-controller-manager will see the node is unresponsive and has the grace period --node-monitor-grace-period=40s until it considers node unhealthy. PS: This parameter should be in N x node-status-update-fequency
  5. Once the node marked unhealthy, the kube-controller-manager will remove the pods based on --pod-eviction-timeout=5m

Now, if you tweaked the parameter pod-eviction-timeout to say 30 seconds, it will still take

 node status update frequency: 10s
 node-monitor-period: 5s
 node-monitor-grace-period: 40s
 pod-eviction-timeout: 30s

Total 70 seconds to evict the pod from node The node-status-update-fequecy and node-monitor-grace-period time counts in node-monitor-grace-period also. You can tweak these variable as well to further lower down your total node eviction time.

This is my kube-controller-manager.yaml (present at /etc/kubernetes/manifests for kubeadm) file:

containers:
  - command:
    - kube-controller-manager
    - --controllers=*,bootstrapsigner,tokencleaner
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --pod-eviction-timeout=30s
    - --address=127.0.0.1
    - --use-service-account-credentials=true
    - --kubeconfig=/etc/kubernetes/controller-manager.conf

I am effectively seeing my pods get evicted in 70s once I turn off my node.

EDIT2:

Run following command on master and check that the --pod-eviction-timeout comes as 20s.

[root@ip-10-0-1-12]# docker ps --no-trunc | grep "kube-controller-manager"

9bc26f99dcfe6ac0e7b2abf22bff67af6060561ee8c0cdff08e11c3a479f182c   sha256:40c8d10b2d11cbc3db2e373a5ffce60dd22dbbf6236567f28ac6abb7efbfc8a9                     
"kube-controller-manager --leader-elect=true --use-service-account-credentials=true --root-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key \
**--pod-eviction-timeout=30s** --address=127.0.0.1 --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --service-account-private-key-file=/etc/kubernetes/pki/sa.key --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --allocate-node-cidrs=true --cluster-cidr=192.168.13.0/24 --node-cidr-mask-size=24"        

If here --pod-eviction-timeout is 5m and not 20s then your changes are not applied properly.

-- Prafull Ladha
Source: StackOverflow

9/5/2019

If Taint Based Evictions are present in the pod definition, controller manager will not be able to evict the pod that tolerates the taint. Even if you don't define an eviction policy in your configuration, it gets a default one since Default Toleration Seconds admission controller plugin is enabled by default.

Default Toleration Seconds admission controller plugin configures your pod like below:

tolerations:
- key: node.kubernetes.io/not-ready
  effect: NoExecute
  tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 300

You can verify this by inspecting definition of your pod:

kubectl get pods -o yaml -n <namespace> <pod-name>`

According to above toleration it takes more than 5 minutes to recreate the pod on another ready node since pod can tolerate not-ready taint for up to 5 minutes. In this case, even if you set --pod-eviction-timeout to 20s, there is nothing controller manager can do because of the tolerations.

But why it takes more than 5 minutes? Because the node will be considered as down after --node-monitor-grace-period which defaults to 40s. After that, pod toleration timer starts.


Recommended Solution

If you want your cluster to react faster for node outages, you should use taints and tolerations without modifying options. For example, you can define your pod like below:

tolerations:
- key: node.kubernetes.io/not-ready
  effect: NoExecute
  tolerationSeconds: 0
- key: node.kubernetes.io/unreachable
  effect: NoExecute
  tolerationSeconds: 0

With above toleration your pod will be recreated on a ready node just after the current node marked as not ready. This should take less then a minute since --node-monitor-grace-period is default to 40s.

Available Options

If you want to be in control of these timings below you will find plenty of options to do so. However, modifying these options should be avoided. If you use tight timings which might create an overhead on etcd as every node will try to update its status very often.

In addition to this, currently it is not clear how to propagate changes in controller manager, api server and kubelet configuration to all nodes in a living cluster. Please see Tracking issue for changing the cluster and Dynamic Kubelet Configuration. As of this writing, reconfiguring a node's kubelet in a live cluster is in beta.

You can configure control plane and kubelet during kubeadm init or join phase. Please refer to Customizing control plane configuration with kubeadm and Configuring each kubelet in your cluster using kubeadm for more details.

Assuming you have a single node cluster:

  • controller manager includes:
    • --node-monitor-grace-period default 40s
    • --node-monitor-period default 5s
    • --pod-eviction-timeout default 5m0s
  • api server includes:
    • --default-not-ready-toleration-seconds default 300
    • --default-unreachable-toleration-seconds default 300
  • kubelet includes:
    • --node-status-update-frequency default 10s

If you set up the cluster with kubeadm you can modify:

  • /etc/kubernetes/manifests/kube-controller-manager.yaml for controller manager options.
  • /etc/kubernetes/manifests/kube-apiserver.yaml for api server options.

Note: Modifying these files will reconfigure and restart the respective pod in the node.

In order to modify kubelet config you can add below line:

KUBELET_EXTRA_ARGS="--node-status-update-frequency=10s"

To /etc/default/kubelet (for DEBs), or /etc/sysconfig/kubelet (for RPMs) and then restart kubelet service:

sudo systemctl daemon-reload && sudo systemctl restart kubelet
-- Root G
Source: StackOverflow