I have been experimenting with kubernetes recently, and I have been trying to test the failover in pods, by having a replication controller, in which containers crash as soon as they are used (thus causing a restart).
I have adapted the bashttpd project for this: https://github.com/Chronojam/bashttpd
(Where in I have set it up so that it serves the hostname of the container, then exits)
This works great, except the restart is far to slow for what I am trying to do, as it works for the first couple of requests, then stops for a while - then starts working again when the pods are restarted. (ideally id like to see no interruption at all when accessing the service).
I think (but not sure) that the backup delay mentioned here is to blame: https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/pod-states.md#restartpolicy
some output:
#] kubectl get pods
NAME READY STATUS RESTARTS AGE
chronojam-blog-a23ak 1/1 Running 0 6h
chronojam-blog-abhh7 1/1 Running 0 6h
chronojam-serve-once-1cwmb 1/1 Running 7 4h
chronojam-serve-once-46jck 1/1 Running 7 4h
chronojam-serve-once-j8uyc 1/1 Running 3 4h
chronojam-serve-once-r8pi4 1/1 Running 7 4h
chronojam-serve-once-xhbkd 1/1 Running 4 4h
chronojam-serve-once-yb9hc 1/1 Running 7 4h
chronojam-tactics-is1go 1/1 Running 0 5h
chronojam-tactics-tqm8c 1/1 Running 0 5h
#] curl http://serve-once.chronojam.co.uk
<h3> chronojam-serve-once-j8uyc </h3>
#] curl http://serve-once.chronojam.co.uk
<h3> chronojam-serve-once-r8pi4 </h3>
#] curl http://serve-once.chronojam.co.uk
<h3> chronojam-serve-once-yb9hc </h3>
#] curl http://serve-once.chronojam.co.uk
<h3> chronojam-serve-once-46jck </h3>
#] curl http://serve-once.chronojam.co.uk
#] curl http://serve-once.chronojam.co.uk
You'll also note that even though there should be 2 still-healthy pods there, it stops returning after the 4th.
So my question is two fold:
1)
Can I tweak the backoff delay?
2)
Why does my service not send my request to the healthy containers?
I think that it might be the webserver itself not being able to start serving requests that quickly, so kubernetes is reckonizing those pods as healthy, and sending requests there (but coming back with nothing because the process hasnt started?)
I filed an issue to document the recommended practice. I put a sketch of the approach in the issue:
https://github.com/kubernetes/kubernetes/issues/20473
Container restarts, especially when they pull images, are fairly expensive for the system. The Kubelet backs off restarts of crashing containers in order to degrade gracefully with DOSing docker, the registry, the apiserver, etc.