I have a cluster of 3 nodes that I'd like to recover fast after a single node loss. By recovering I mean that I resume communication with my service after a reasonable amount of time (preferably configurable).
Following are various details:
k8s version:
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T10:00:30Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.7", GitCommit:"8eb75a5810cba92ccad845ca360cf924f2385881", GitTreeState:"clean", BuildDate:"2017-04-27T09:42:05Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
I have a service distributed over all 3 nodes. With one node failing I observe the following behavior:
10.100.0.1
(its cluster IP)kubectl get ep --namespace=kube-system
shows no ready addresses for all endpoints)The service has both readiness/liveness probes and only a single instance is ready
at any given time with all being live
. I've checked that the instance that is supposed to be available is also available - i.e. both ready
/live
.
This continues for more than 15min before the service Pod that was running on the lost node receives a NodeLost
status, at which point the endpoints are re-populated, and I can access the service as usual.
I have tried fiddling with pod-eviction-timeout
, node-monitor-grace-period
settings to no avail - the time is always roughly the same.
Hence, my questions: