How to determine a failed kubernetes deployment?

12/27/2017

I create a Pod with Replica count of say 2, which runs an application ( a simple web-server), basically it's always running command - However due to mis-configuration, sometimes the command exits and the pod is then Terminated.

Due to default restartPolicy of Always the pod (and hence the container) is restarted and eventually the Pod status is CrashLoopBackOff.

If I do a kubectl describe deployment, it shows Condition as Progressing=True and Available=False.

This looks fine - the question is - how do I mark my deployment as 'failed' in the above case?

Adding a spec.ProgressDeadlineSeconds doesn't seem to be having an effect.

Will simply saying restartPolicy as Never be enough in the Pod specification?

A related question, is there a way of getting this information as a trigger/webhook, without doing a rollout status watch?

-- gabhijit
kubernetes

1 Answer

12/27/2017

There is no Kubernetes concept for a "failed" deployment. Editing a deployment registers your intent that the new ReplicaSet is to be created, and k8s will repeatedly try to make that intent happen. Any errors that are hit along the way will cause the rollout to block, but they will not cause k8s to abort the deployment.

AFAIK, the best you can do (as of 1.9) is to apply a deadline to the Deployment, which will add a Condition that you can detect when a deployment gets stuck; see https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment and https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds.

It's possible to overlay your own definitions of failure on top of the statuses that k8s provides, but this is quite difficult to do in a generic way; see this issue for a (long!) discussion on the current status of this: https://github.com/kubernetes/kubernetes/issues/1899

Here's some Python code (using pykube) that I wrote a while ago that implements my own definition of ready; I abort my deploy script if this condition does not obtain after 5 minutes.

def _is_deployment_ready(d, deployment):
    if not deployment.ready:
        _log.debug('Deployment not completed.')
        return False

    if deployment.obj["status"]["replicas"] > deployment.replicas:
        _log.debug('Old replicas not terminated.')
        return False

    selector = deployment.obj['spec']['selector']['matchLabels']
    pods = Pod.objects(d.api).filter(namespace=d.namespace, selector=selector)
    if not pods:
        _log.info('No pods found.')
        return False

    for pod in pods:
        _log.info('Is pod %s ready? %s.', pod.name, pod.ready)
        if not pod.ready:
            _log.debug('Pod status: %s', pod.obj['status'])
            return False
    _log.info('All pods ready.')
    return True

Note the individual pod check, which is required because a deployment seems to be deemed 'ready' when the rollout completes (i.e. all pods are created), not when all of the pods are ready.

-- Symmetric
Source: StackOverflow