OpenShift: Pod terminated prematurely as "not alive" during application shutdown with long grace

3/2/2019

The Context

I am maintaining a couple of Spring Boot web service applications (war), currently running on four identical Tomcat instances.

A load balancer in front makes shure trafic is spread across the four instances.

We do manual rolling deployment.

Before taking an instance down for upgrade, we divert new traffic away from it. We then give active requests a grace period of two minutes, before terminating the applications.

The Problem

Now I am in the process of migrating these applications to OpenShift. This is all going very well, except that I have a hard time making the rolling deployment work to my satisfaction.

Googling for help I have reached a solution based on:

  • Readyness probes and liveness probes based on actuator/health endpoint.
  • A custom HealthIndicator bean allowing me to programatically toggle actuator/health endpoint to respond with HTTP-503 (OUT_OF_SERVICE).
  • A ShutdownHook which when invoked will:
    • Toggle the HealthIndicator to OUT_OF_SERVICE.
    • Wait 30 seconds to allow Kupernetes to realize the OUT_OF_SERVICE status and divert new traffic.
    • Pause the Tomcat connector and give active requests a grace period of two minutes.

At first this seemed to work, but it turns out that the livenes probe sometimes kick in and kill the pod, even if the ShutdownHook hasnt finished yet.

If I remove the livenes probe it works, but I dont see that as a real solution.

Experiments has revealed to me, that once the ShutdownHook pauses the Tomcat connector, the actuator/health endpoint is responding with "connection refused" - which makes sense, but is not what I need, because it makes the liveness probe deem the application dead.

I have tried moving the actuator endpoints to another port number, but this is even worse, as they now stop responding immediately when the shutdown starts.

I assume this is caused by the actuator endpoints now belonging to a Tomcat connector different from my main connector, and not under the control of my main Spring application context.

Can any of you tell me how to stall the shutdown of the actuator endpoints when on a separate port number?

Or any other suggestion really - allowing me to:

  • Divert new traffic.
  • Give active requests a grace period of 2 minutes.
  • And at the same time allowing a livenes probe to know that the application is shutting down, but is not dead.
-- Jens Krogsboell
java
kubernetes
openshift
spring-boot

1 Answer

3/5/2019

Given that you just want to prevent traffic from going to your pod while it performs a graceful shutdown, you could use a low Readiness probe timeout, that upon failure, removes your pod from the list of serviceable pods. Then increase your liveness probe timeout to allow your pod plenty of time to shutdown gracefully while still having a fallback in case your pod truly is stuck.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes

-- Will Gordon
Source: StackOverflow