NGINX 502 Bad Gateway when using a single replication in Kubernetes

7/31/2019

I have a requirement to deploy an HTTP application in K8s with zero downtime. I also have a restriction of using a single pod (replica=1). But the problem is when I did that some of the HTTP requests get 502 Bad gateway when I did some changes to the K8s pod.

I refer the following two issues [1] [2], but those issues work fine when I have more than a single replica. For a single replica, NGINX ingress still has a slight downtime which is less than 1 millisecond.

The lifecycle spec and rolling updates spec of my deployment set as below according to the answers given by the above issues [1] [2].

 spec:
  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
    type: RollingUpdate
    ...
    spec:
        ....
        lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "30"

Note that I have config maps that mount to this deployment. I'm not sure that would affect this downtime or not.

Also, I refer to these two blogs [3] [4], but they did not solve my problem too. But when I refer this blog [4] it shows that K8s can achieve zero downtime even with a single replica. Unfortunately, in [4] he did not use an ingress-nginx controller.

In brief, I wanted to know that, is it possible to achieve zero-downtime in ingress-nginx with a single replication of pod?

References

1 https://github.com/kubernetes/ingress-nginx/issues/489

2 https://github.com/kubernetes/ingress-nginx/issues/322

3 https://blog.sebastian-daschner.com/entries/zero-downtime-updates-kubernetes

4 http://rahmonov.me/posts/zero-downtime-deployment-with-kubernetes/

-- Buddhi
kubernetes
kubernetes-ingress
nginx
nginx-ingress

2 Answers

7/31/2019

Nginx as a revers proxy is able to handle a 0 down time if the IP address of the backend didn't change, but in your case I think that the requirements of only 1 replica and the volumes mounted always makes the down process a little bit more slow, is not possible to achieve the 0 down time because if you are mounting the same volume on the new pod this need to wait the old pod to be destroyed and release the volume to start the wake up process.

In your referenced blog post where it explains how to achieve that, the example didn't use volumes and uses a very small image that makes the pull process and wake up very fast.

I recommend you to study your volume needs and try to not have this as a blocking thing on the weak up process.

-- wolmi
Source: StackOverflow

7/31/2019

I suppose that your single-pod restriction is at runtime and not during the upgrade, otherwise, you can't achieve your goal.

My opinion is your rolling upgrade strategy is good, you can add a PodDistruptionBudget to manage distruptions to be sure that at least 1 pod is available.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: sample-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      <your_app_label>

Another very important thing is the probes, according to documentation:

The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs.

The kubelet uses readiness probes to know when a Container is ready to start accepting traffic. A Pod is considered ready when all of its Containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

You should set the liveness probe, but most of all the readiness probe, to return a success response only when your new pod is really ready to accept a new connection, otherwise k8s think that the new pod is up and the old pod will be destroyed before the new one can accept connections.

-- Federico Bevione
Source: StackOverflow