We have a situation where we have an abundance of Spring boot applications running in containers (on OpenShift) that access centralized infrastructure (external to the pod) such as databases, queues, etc.
If a piece of central infrastructure is down, the health check returns "unhealthy" (rightfully so). The problem is that the liveliness check sees this, and restarts the pod (the readiness check then sees it's down too, so won't start the app). This is fine when only a few are available, but if many (potentially hundreds) of applications are using this, it forces restarts on all of them (crash loop).
I understand that central infrastructure being down is a bad thing. It "should" never happen. But... if it does (Murphy's law), it throws containers into a frenzy. Just seems like we're either doing something wrong, or we should reconfigure something.
A couple questions:
Using actuator checks for liveness/readiness is the de-facto way to check for healthy app in Spring Boot Pod. Your application, once up, should ideally not go down or become unhealthy if a central piece, such as DB or Queueing service goes down , ideally you should add some sort of a resiliency which will either connect to alternate DR site or wait for certain time period for central service to come back up and app to reconnect. This is more of a technical failure on backend side causing a functional failure of your Application after it was started up cleanly.
Yes , both liveness and readiness is required as they both serve different purposes. Read this
In one of my previous projects, the settings used for readiness was around 30 seconds and liveness at around 90, but to be honest this is completely dependent on your application , if your app takes 1 minute to start , that is what your readiness time should be configured at , and your liveness should factor in the same along with any time required for making failover switch of your backend services.
Hope this helps.