OpenShift takes 20 seconds after a pod is killed before balancing to the other available pod when using sticky sessions

5/20/2019

I have come across a strange behaviour in OpenShift's routing:

  • I have deployed an application with 2 instances/pods. By default, OpenShift is using sticky sessions to send every user traffic to the same pod.

  • The application uses an external cookie-based HTTP session storage. The idea is that if the pod dies and the other one takes over, it will have the user session available with its previous state.

  • To validate this, I kill the pod which is currently assigned to a user. The next request should be sent to the other available pod. However the request is held for ~20s and finally is processed by the remaining pod.

Why is it taking so long to OpenShift's Router to realize the previously assigned pod is down and send traffic to the other one?

Can this be tweaked to make it faster?

UPDATE:

I've checked the state of the endpoints as suggested by Graham Dumpleton:

gt; oc
get endpoints myapp --watch myapp 172.26.23.93:8080,172.26.32.244:8080 361d //1 myapp 172.26.32.244:8080 361d //2 myapp 172.26.32.244:8080 361d ...
  • Step //1: There are two available endpoints before killing any pod
  • Step //2: Right after killing the pod, ~1s, I get an update showing the only endpoint available.

So even though the endpoints are updated right away, the request still need almost 20s to finish. Ideas?

-- codependent
haproxy
kubernetes
openshift
openshift-origin

0 Answers