I have a really simple flask application running on kubernetes (GKE). The pods get a fair amount of traffic (60req/s +-) and they run under an autoscaling group with a minimum of 4 active and 10 max.
At every 4-5 hours the liveness probe starts failing and all pods get restarted. I sometimes find that my pods got restarted 11-12 times during a single night. When I describe the pods I get the same error:
Liveness probe failed: Get http://10.12.5.23:5000/_status/healthz/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
All pods have the same number of restarts so it's not a load issue (and I also have autoscaling).
The _status/healthz/
endpoint is as simple as it gets:
@app.route('/')
@app.route('/_status/healthz/')
def healthz():
return jsonify({
"success": True
})
I have one other route on this application which connects to mysql and verifies some data. I had the same applications distributed on digitalocean droplets running under much higher load for months without issues.
I can't seem to find out why the liveness checks start failing al lat once and my pods get restarted.
The allocated resources are also decent and really close with what I had on digitalocean droplets:
"resources": {
"requests": {
"cpu": "500m",
"memory": "1024Mi"
},
"limits": {
"cpu": "800m",
"memory": "1024Mi"
}
}
I had the same pods running with 100m
for cpu limits and with 900m
. Same result, every few hours all pods are restarting.
Liveness settings:
"livenessProbe": {
"initialDelaySeconds": 30,
"httpGet": {
"path": "/_status/healthz/",
"port": 5000
},
"timeoutSeconds": 5
},
UPDATE: added Readiness
probe, increased CPU = same results, 7 restarts on each of the 4 pods.