Kubernetes Zero Downtime deployment not working - Gives 503 Service Temporarily Unavailable

3/29/2019

I am trying to achieve zero-downtime deployment using Kubernetes. But every time I do the upgrade of the deployment using a new image, I am seeing 2-3 seconds of downtime. I am testing this using a Hello-World sort of application but still could not achieve it. I am deploying my application using the Helm charts.

Following the online blogs and resources, I am using Readiness-Probe and Rolling Update strategy in my Deployment.yaml file. But this gives me no success. I have created a /health end-point which simply returns 200 status code as a check for readiness probe. I expected that after using readiness probes and RollingUpdate strategy in Kubernetes I would be able to achieve zero-downtime of my service when I upgrade the image of the container. The request to my service goes through an Amazon ELB.

Deployment.yaml file is as below:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: wine-deployment
  labels:
    app: wine-store
    chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: wine-store
  replicas: 2
  template:
    metadata:
      labels:
        app: wine-store
    spec:
      containers:
        - name: {{ .Chart.Name }}
          resources:
            limits:
              cpu: 250m
            requests:
              cpu: 200m
          image: "my-private-image-repository-with-tag-and-version-goes-here-which-i-have-hidden-here"
          imagePullPolicy: Always
          env:
          - name: GET_HOSTS_FROM
            value: dns
          ports:
          - containerPort: 8089
            name: testing-port
          readinessProbe:
            httpGet:
              path: /health
              port: 8089
            initialDelaySeconds: 3
            periodSeconds: 3 

Service.yaml file:

apiVersion: v1
kind: Service
metadata:
  name: wine-service
  labels:
    app: wine-store
spec:
  ports:
    - port: 80
      targetPort: 8089
      protocol: TCP
  selector:
    app: wine-store

Ingress.yaml file:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: wine-ingress
  annotations:
     kubernetes.io/ingress.class: public-nginx
spec:
  rules:
    - host: my-service-my-internal-domain.com
      http:
        paths:
          - path: /
            backend:
              serviceName: wine-service
              servicePort: 80

I expect the downtime to be zero when I am upgrading the image using helm upgrade command. Meanwhile, when the upgrade is in progress, I continuously hit my service using a curl command. This curl command gives me 503-service Temporarily un-available errors for 2-3 seconds and then again the service is up. I expect that this downtime does not happens.

-- mohd shoaib
kubernetes

3 Answers

4/26/2019

This issue is caused by the Service VIP using iptables. You haven't done anything wrong - it's a limitation of current Kubernetes.

When the readiness probe on the new pod passes, the old pod is terminated and kube-proxy rewrites the iptables for the service. However, a request can hit the service after the old pod is terminated but before iptables has been updated resulting in a 503.

A simple workaround is to delay termination by using a preStop lifecycle hook:

lifecycle:
  preStop:
    exec:
      command: ["/bin/bash", "-c", "sleep 10"]

It'd probably not relevant in this case, but implementing graceful termination in your application is a good idea. Intercept the TERM signal and wait for your application to finish handling any requests that it has already received rather than just exiting immediately.

Alternatively, more replicas, a low maxUnavailable and a high maxSurge will all reduce the probability of requests hitting a terminating pod.

For more info: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

Another answer mistakenly suggests you need a liveness probe. While it's a good idea to have a liveness probe, it won't effect the issue that you are experiencing. With no liveness probe defined the default state is Success.

In the context of a rolling deployment a liveness probe will be irrelevant - Once the readiness probe on the new pod passes the old pod will be sent the TERM signal and iptables will be updated. Now that the old pod is terminating, any liveness probe is irrelevant as its only function is to cause a pod to be restarted if the liveness probe fails.

Any liveness probe on the new pod again is irrelevant. When the pod is first started it is considered live by default. Only after the initialDelaySeconds of the liveness probe would it start being checked and, if it failed, the pod would be terminated.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

-- Hamish
Source: StackOverflow

3/29/2019

The problem you describe indicate an issue with readiness probes. It is important to understand the differences between liveness and readiness probes. First of all you should implement and configure both!

The liveness probes are to check if the container is started and alive. If this isn’t the case, kubernetes will eventually restart the container.

The readiness probes in turn also check dependencies like database connections or other services your container is depending on to fulfill it’s work. As a developer you have to invest here more time into the implementation than just for the liveness probes. You have to expose a an endpoint which is also checking the mentioned dependencies when queried.

Your current configuration uses an health endpoint which is usually used by the liveness probes. It probably doesn’t check if your services is really ready to take traffic.

Kubernetes relies on the readiness probes. During an rolling update, it will keep the old container up and running until the new service declares that it is ready to take traffic. Therefore the readiness probes have to be implemented correctly.

-- Randy
Source: StackOverflow

3/29/2019

Go around with blue-green deployments because even if pods are up it may take time for kube-proxy to forward requests to new POD IPs.
So setup new deployment, after all pods are up update service selector to new POD lables. Follow: https://kubernetes.io/blog/2018/04/30/zero-downtime-deployment-kubernetes-jenkins/

-- Akash Sharma
Source: StackOverflow