Kubernetes pods failing all at once

2/14/2017

I have a really simple flask application running on kubernetes (GKE). The pods get a fair amount of traffic (60req/s +-) and they run under an autoscaling group with a minimum of 4 active and 10 max.

At every 4-5 hours the liveness probe starts failing and all pods get restarted. I sometimes find that my pods got restarted 11-12 times during a single night. When I describe the pods I get the same error:

Liveness probe failed: Get http://10.12.5.23:5000/_status/healthz/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

All pods have the same number of restarts so it's not a load issue (and I also have autoscaling).

The _status/healthz/ endpoint is as simple as it gets:

@app.route('/')
@app.route('/_status/healthz/')
def healthz():
    return jsonify({
        "success": True
    })

I have one other route on this application which connects to mysql and verifies some data. I had the same applications distributed on digitalocean droplets running under much higher load for months without issues.

I can't seem to find out why the liveness checks start failing al lat once and my pods get restarted.

The allocated resources are also decent and really close with what I had on digitalocean droplets:

"resources": {
    "requests": {
        "cpu": "500m",
        "memory": "1024Mi"
    },
    "limits": {
        "cpu": "800m",
        "memory": "1024Mi"
    }
}

I had the same pods running with 100m for cpu limits and with 900m. Same result, every few hours all pods are restarting.

Liveness settings:

"livenessProbe": {
    "initialDelaySeconds": 30,
    "httpGet": {
        "path": "/_status/healthz/",
        "port": 5000
    },
    "timeoutSeconds": 5
},

UPDATE: added Readiness probe, increased CPU = same results, 7 restarts on each of the 4 pods.

-- Romeo Mihalcea
google-kubernetes-engine
kubernetes
kubernetes-health-check

0 Answers