Is there a way to configure a liveness probe to stop running when a Pod has successfully completed?
I'm using a liveness probe to ensure that batch Jobs (which are expected to run to completion over the course of a few minutes to weeks) are responsive and running properly. However, when a Pod completes successfully, there seems to be a delay between when the Pod stops serving the liveness probe (in this case, touch-ing a file) and when the Pod is deleted after completing successfully. During this delay, the liveness probe fails enough times to trigger Kubernetes to restart the Pod.
Aside from increasing the liveness probe's failure threshold or period, or decreasing the Pod's termination grace period, I haven't come across any possible mitigations, and no robust solutions, for this issue. In fact, I haven't found any mention in the Kubernetes' docs of using a liveness probe in a batch Job.
The events log from kubectl describe pod <pod>
is below. Of particular interest to me, and what's guiding my thinking that the liveness probe is failing during the Pod's completion, is the message Liveness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 55m kubelet, pascal0 Liveness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
Normal Created 55m (x8 over 21h) kubelet, pascal0 Created container
Normal Pulled 55m (x7 over 18h) kubelet, pascal0 Container image "<image>" already present on machine
Normal Started 55m (x8 over 21h) kubelet, pascal0 Started container
Some relevant Job configuration values are included below.
backoffLimit: 10
restartPolicy: OnFailure
livenessProbe:
exec:
command:
- test
- $(stat -c %Y /tmp/healthy) -gt $(($(date +%s) - 10))
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
Aside from increasing the liveness probe's failure threshold or period, or decreasing the Pod's termination grace period, I haven't come across any possible mitigations, and no robust solutions, for this issue.
There is nothing wrong with tweaking those parameters to meet your needs. Default pod graceful termination period is 30 seconds, so if your container needs more time to terminate, you should change the probe timings accordingly. Or I've probably missed the main point why it can be an issue in your case.
In fact, I haven't found any mention in the Kubernetes' docs of using a liveness probe in a batch Job.
Me neither. Apparently, it's not very popular approach and probably therefore is not tested well enough.
Thinking about workarounds, I was about suggesting to use a preStop hook, but after reading the whole story, I found an alternative suggestion made by srikumarb in the issue #55807:
I ended up using livenessProbe with a timestamp file to know the liveliness of the container from sidecar container. Hope that helps as a workaround for you also
You may also think about configuring a different kind of the liveness probe, e.g checking the uptime (or whatever is not related to the filesystem)
Alternatively, you can try to use EmptyDir volume as a placeholder for your probe file.