Liveness probe on batch Job restarts Pod after completion

1/16/2019

Is there a way to configure a liveness probe to stop running when a Pod has successfully completed?

I'm using a liveness probe to ensure that batch Jobs (which are expected to run to completion over the course of a few minutes to weeks) are responsive and running properly. However, when a Pod completes successfully, there seems to be a delay between when the Pod stops serving the liveness probe (in this case, touch-ing a file) and when the Pod is deleted after completing successfully. During this delay, the liveness probe fails enough times to trigger Kubernetes to restart the Pod.

Aside from increasing the liveness probe's failure threshold or period, or decreasing the Pod's termination grace period, I haven't come across any possible mitigations, and no robust solutions, for this issue. In fact, I haven't found any mention in the Kubernetes' docs of using a liveness probe in a batch Job.

The events log from kubectl describe pod <pod> is below. Of particular interest to me, and what's guiding my thinking that the liveness probe is failing during the Pod's completion, is the message Liveness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown.

Events:
  Type     Reason     Age                From              Message
  ----     ------     ----               ----              -------
  Warning  Unhealthy  55m                kubelet, pascal0  Liveness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
  Normal   Created    55m (x8 over 21h)  kubelet, pascal0  Created container
  Normal   Pulled     55m (x7 over 18h)  kubelet, pascal0  Container image "<image>" already present on machine
  Normal   Started    55m (x8 over 21h)  kubelet, pascal0  Started container

Some relevant Job configuration values are included below.

backoffLimit: 10
restartPolicy: OnFailure
livenessProbe:
  exec:
    command:
      - test
      - $(stat -c %Y /tmp/healthy) -gt $(($(date +%s) - 10))
  initialDelaySeconds: 30
  periodSeconds: 60
  timeoutSeconds: 1
  successThreshold: 1
  failureThreshold: 3
-- shappenny
kubernetes

1 Answer

3/12/2019

Aside from increasing the liveness probe's failure threshold or period, or decreasing the Pod's termination grace period, I haven't come across any possible mitigations, and no robust solutions, for this issue.

There is nothing wrong with tweaking those parameters to meet your needs. Default pod graceful termination period is 30 seconds, so if your container needs more time to terminate, you should change the probe timings accordingly. Or I've probably missed the main point why it can be an issue in your case.

In fact, I haven't found any mention in the Kubernetes' docs of using a liveness probe in a batch Job.

Me neither. Apparently, it's not very popular approach and probably therefore is not tested well enough.

Thinking about workarounds, I was about suggesting to use a preStop hook, but after reading the whole story, I found an alternative suggestion made by srikumarb in the issue #55807:

I ended up using livenessProbe with a timestamp file to know the liveliness of the container from sidecar container. Hope that helps as a workaround for you also

You may also think about configuring a different kind of the liveness probe, e.g checking the uptime (or whatever is not related to the filesystem)

Alternatively, you can try to use EmptyDir volume as a placeholder for your probe file.

-- VAS
Source: StackOverflow