Should I do liveness probe and readiness probe every second?


In my K8S workloads, I implement Readiness probe and Liveness probe for pods health check.

I'm wondering that should I set the interval (periodSeconds) as low as 1 sec, as it will consume more resources, right?

Is there best practices when doing the pod health check?

Firstly, it is important to understand the difference between Liveness and Readiness. The tl;dr is: Liveness is about whether K8s should kill and restart the container, Readiness is about whether the container is able to accept requests. It is likely that you want different parameters for both.

Whether K8s takes any action based on the outcome of the probe depends on the failureThreshold. This is the number of times in a row the probe has to fail before K8s does something. If you combine this with periodSeconds you can tune the sensitivity of your probes.

In general you want to balance:

  • the time it takes K8s to take action with how quickly your service can be expected to recover based on the probe
  • the "cost" of the probes. For example if your Readiness probe connects to a database, then you are adding 1 Query Per Second (QPS) load to your database per replica (With 100 replicas, you would be generating 100QPS just through probes!)
  • the reliability of your probe, also known as "flakiness". What is the false negative rate - i.e what proportion of the time the probe reports failed but the service is actually running with in expected performance rates

Here is one way of thinking about it:

  • Work out how long your service can be in the failed state before K8s should take action. This should be based on how long it would take to recover (e.g. restart in the case of Liveness)
  • If a probe is "expensive", have a longer periodSeconds and smaller failureThreshold
  • If a probe is "flaky" (i.e. occasionally reports failed and then reports working very quickly afterwards) have a shorter periodSeconds and larger failureThreashold.
