We are running flask microservices on GKE. The main application that accepts all traffic and divides it in other services is restarting.
POD's readiness & liveness start going timeout at random intervals. however, we are running 5 pods of particular service and it's stateless application. One thing I have noticed memory also keeps increasing with time.
is it due to the docker python-slim
image at a certain level no able to handle the application and for continuous memory increasing in pod is it like OS python-slim
not releasing memory?
Note : This behavior only on production not on staging(running single application pod).
what cloud be the reason behind this, please help. Thanks
Update liveness & readiness probe config
readinessProbe:
httpGet:
path: /k8/readiness
port: 9595
initialDelaySeconds: 25
periodSeconds: 8
timeoutSeconds: 10
successThreshold: 1
failureThreshold: 30
livenessProbe:
httpGet:
path: /k8/liveness
port: 9595
initialDelaySeconds: 30
periodSeconds: 8
timeoutSeconds: 10
successThreshold: 1
failureThreshold: 30
While it's a bit difficult to provide an answer without looking at the manifests, events, or other traces from your cluster, I've often seen this case happens when people misunderstand/misconfigure readiness and liveness probes and/or not scaling correctly.
For example, it is possible that you have a probe:
readinessProbe:
httpGet:
path: /healthz
port: 443
failureThreshold: 1
periodSeconds: 10
This means, check every 10 seconds, if http GET
to /healthz:443
is OK, if it fails once, stop sending traffic (since it's a readiness probe only).
If you don't set timeoutSeconds
, by default the value is 1 second.
What can often happen under load, is that the /healthz:443
endpoint takes longer and longer to respond if additional pods are not added and the latency keeps increasing.
Eventually, once it hovers around the 1 second, a single timeout would cause readiness to fail - this is a best case scenario.
If your liveness probe is configured this way, you would have pod restarts which is way worse.
This article does a great job of explaining why it's not always wise to use liveness probes (unless you have a very specific check for it).
In the case of load causing this, you might actually be OK with a 1 second timeout (default value), but you might want to use something like HPA to add additional pods if your latency is increasing.