Why there is downtime while rolling update a deployment or even scaling down a replicaset

8/30/2019

Due to official document of kubernetes

Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones

I was trying to perform zero downtime update using Rolling Update strategy which was recommanded way to update an application in kube cluster. Official reference:

https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/

But i was a a little bit confused about the definition while performing it: downtime of application still happens. Here is my cluster info at the begining, as shown below:

liguuudeiMac:~ liguuu$ kubectl get all
NAME                                     READY   STATUS    RESTARTS   AGE
pod/ubuntu-b7d6cb9c6-6bkxz               1/1     Running   0          3h16m
pod/webapp-deployment-6dcf7b88c7-4kpgc   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-4vsch   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-7xzsk   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-jj8vx   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-qz2xq   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-s7rtt   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-s88tb   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-snmw5   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-v287f   1/1     Running   0          3m52s
pod/webapp-deployment-6dcf7b88c7-vd4kb   1/1     Running   0          3m52s
NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/kubernetes          ClusterIP   10.96.0.1       <none>        443/TCP          3h16m
service/tc-webapp-service   NodePort    10.104.32.134   <none>        1234:31234/TCP   3m52s
NAME                                READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ubuntu              1/1     1            1           3h16m
deployment.apps/webapp-deployment   10/10   10           10          3m52s
NAME                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/ubuntu-b7d6cb9c6               1         1         1       3h16m
replicaset.apps/webapp-deployment-6dcf7b88c7   10        10        10      3m52s

deployment.apps/webapp-deployment is a tomcat-based webapp application, and the Service tc-webapp-service mapped to Pods contains tomcat containers(the full deployment config files was present at the end of ariticle). deployment.apps/ubuntu is just a standalone app in cluster, which is about to perform infinite http request to tc-webapp-service every second, so that i can trace the status of so called rolling update of webapp-deployment, the commands run in ubuntu container was likely as below(infinite loop of curl command in every 0.01 second):

for ((;;)); do curl -sS -D - http://tc-webapp-service:1234 -o /dev/null | grep HTTP; date +"%Y-%m-%d %H:%M:%S"; echo ; sleep 0.01 ; done;

And the output of ubuntu app(everthing is fine):

...
HTTP/1.1 200 
2019-08-30 07:27:15
...
HTTP/1.1 200 
2019-08-30 07:27:16
...

Then I try to change tag of tomcat image, from 8-jdk8 to 8-jdk11. Note that the rolling update strategy of deployment.apps/webapp-deployment has been config correctly, with maxSurge 0 and maxUnavailable 9.(the same result if these two attr were default )

...
    spec:
      containers:          
      - name: tc-part
        image: tomcat:8-jdk8 -> tomcat:8-jdk11
...

Then, the output of ubuntu app:

HTTP/1.1 200 
2019-08-30 07:47:43
curl: (56) Recv failure: Connection reset by peer
2019-08-30 07:47:43
HTTP/1.1 200 
2019-08-30 07:47:44

As shown above, some http requests failed, and this is no doubt the interruption of application while performing rolling update for apps in kube cluster. However, I can also replay the situation mentioned above(interruption) in Scaling down, the commands as shown below(from 10 to 2):

kubectl scale deployment.apps/tc-webapp-service --replicas=2

After performing the above tests, I was wondering whether so-called Zero downtime actually means. Although the way mocking http request was a little bit tricky, the situation is so normal for some applications which were designed to be able to handle thousands of, millions of request in one second.

env:

liguuudeiMac:cacheee liguuu$ minikube version
minikube version: v1.3.1
commit: ca60a424ce69a4d79f502650199ca2b52f29e631
liguuudeiMac:cacheee liguuu$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T09:15:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Deployment & Service Config:

# Service
apiVersion: v1
kind: Service
metadata:
  name: tc-webapp-service
spec:
  type: NodePort
  selector:
    appName: tc-webapp
  ports:
  - name: tc-svc
    protocol: TCP
    port: 1234
    targetPort: 8080
    nodePort: 31234
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-deployment
spec:
  replicas: 10
  selector:
    matchLabels:
      appName: tc-webapp
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 9
  # Pod Templates
  template:
    metadata:
      labels:
        appName: tc-webapp
    spec:
      containers:          
      - name: tc-part
        image: tomcat:8-jdk8
        ports:
        - containerPort: 8080
        livenessProbe:
          tcpSocket:
            port: 8080            
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            scheme: HTTP
            port: 8080
            path: /
          initialDelaySeconds: 5
          periodSeconds: 1 
-- liguuu
deployment
kubernetes
replicaset

1 Answer

8/30/2019

To deploy an application which will really update with zero downtime the application should meet some requirements. To mention few of them:

  • application should handle graceful shutdown
  • application should implement readiness and liveness probes correctly

For example if shutdown signal is recived, then it should not respond with 200 to new readiness probes, but it still respond with 200 for liveness untill all old requests are processed.

-- Alik Khilazhev
Source: StackOverflow