Rediness & Livenesss failing at random interval

1/28/2020

We are running flask microservices on GKE. The main application that accepts all traffic and divides it in other services is restarting.

POD's readiness & liveness start going timeout at random intervals. however, we are running 5 pods of particular service and it's stateless application. One thing I have noticed memory also keeps increasing with time.

is it due to the docker python-slim image at a certain level no able to handle the application and for continuous memory increasing in pod is it like OS python-slim not releasing memory?

Note : This behavior only on production not on staging(running single application pod).

what cloud be the reason behind this, please help. Thanks

Update liveness & readiness probe config

readinessProbe:
            httpGet:
              path: /k8/readiness
              port: 9595
            initialDelaySeconds: 25
            periodSeconds: 8
            timeoutSeconds: 10
            successThreshold: 1
            failureThreshold: 30
        livenessProbe:
            httpGet:
              path: /k8/liveness
              port: 9595
            initialDelaySeconds: 30
            periodSeconds: 8
            timeoutSeconds: 10
            successThreshold: 1
            failureThreshold: 30
-- Harsh Manvar
google-cloud-platform
google-kubernetes-engine
kubernetes
microservices
python

1 Answer

1/30/2020

While it's a bit difficult to provide an answer without looking at the manifests, events, or other traces from your cluster, I've often seen this case happens when people misunderstand/misconfigure readiness and liveness probes and/or not scaling correctly.

For example, it is possible that you have a probe:

readinessProbe:
  httpGet:
    path: /healthz
    port: 443
  failureThreshold: 1
  periodSeconds: 10

This means, check every 10 seconds, if http GET to /healthz:443 is OK, if it fails once, stop sending traffic (since it's a readiness probe only).

If you don't set timeoutSeconds, by default the value is 1 second.

What can often happen under load, is that the /healthz:443 endpoint takes longer and longer to respond if additional pods are not added and the latency keeps increasing.

Eventually, once it hovers around the 1 second, a single timeout would cause readiness to fail - this is a best case scenario.

If your liveness probe is configured this way, you would have pod restarts which is way worse.

This article does a great job of explaining why it's not always wise to use liveness probes (unless you have a very specific check for it).

In the case of load causing this, you might actually be OK with a 1 second timeout (default value), but you might want to use something like HPA to add additional pods if your latency is increasing.

-- Eytan Avisror
Source: StackOverflow