How Can I Reduce Detecting the Node Failure Time on Kubernetes?

4/22/2019

I have 2 Slave and 1 Master node kubernetes cluster.When a node down it takes approximately 5 minutes to kubernetes see that failure.I am using dynamic provisioning for volumes and this time is a little bit much for me.How can i reduce that detecting failure time ? I found a post about it: https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/

At the bottom of the post,it says, we can reduce that detection time by changing that parameters:

kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)

i can change node-status-update-frequency parameter from kubelet but i don't have any controller manager program or command on the cli.How can i change that parameters? Any other suggestions about reducing detect downtime will be appreciated.

-- Adi Soyadi
kubernetes

2 Answers

4/22/2019

It's actually kube-controller-manager. You may also decrease --attach-detach-reconcile-sync-period from 1m to 15 or 30 seconds for kube-controller-manager. This will allow for more speedy volumes attach-detach actions. How you change those parameters depends on how you set up the cluster.

-- Vasily Angapov
Source: StackOverflow

4/22/2019

..but i don't have any controller manager program or command on the cli.How can i change that parameters?

You can change/add that parameter in controller-manger systemd unit file and restart the daemon. Please check the man pages for controller-manager here.

If you deploy controller-manager as micro service(pod), check the manifest file for that pod and change the parameters at container's command section(For example like this)

-- Veerendra Kakumanu
Source: StackOverflow