How to signal "bad" but not "fatal" health check from spring boot to Kubernetes?

11/15/2019

What we're looking for is a way for an actuator health check to signal some intention like "I am limping but not dead. If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp."

We have a rest service hosted in clustered Kubernetes containers that periodically call out to fetch fresh data from an external resource. Occasionally we have failures reaching those external resources, and sometimes, but not every time, a restart of the pod will resolve the issue.

The services can operate just fine on possibly stale data. Although we wouldn't want to continue operating on stale data, that's preferable to just going down entirely.

In the interim, we're planning on having a node unilaterally decide not to report any problems through actuator until X amount of time has passed since the last successful sync, but that really only delays the point at which all nodes would still report failure.

-- Brian Deacon
kubernetes
kubernetes-health-check
spring-boot-actuator

1 Answer

11/15/2019

In Kubernetes you can use LivenessProbe and ReadinessProbe to let a controller to heal your service, but some situations is better handled with HTTP response codes or alternative degraded service.

LivenessPobe

Use a LivenessProbe to resolve a deadlock situation. When your pod does not respond on a LivenessProbe, it will be killed and a new pod will replace it.

ReadinessProbe

Use a ReadinessProbe when your pod is not prepared for serving requests, e.g. if your pod need to read some files or need to connect to an external service before serving requests.

Fault affecting all replicas

If you have a problem that all your replicas depends on, e.g. an external service is down, then you can not solve it by restarting your pods. You may use an OpsToogle or a circuit breaker in this situation and notifying other services that you are degraded or show a message about temporary error.

For your situations

If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp.

You can not delegate that logic to Kubernetes. Your application need to understand each fault situation, e.g. if an error was a transient network error or if your error will affect all replicas.

-- Jonas
Source: StackOverflow