RollingUpdate on kubernetes does not prevent Gateway Timeout

8/8/2018

I've followed http://rahmonov.me/posts/zero-downtime-deployment-with-kubernetes/ blog, created two docker images with index.html returning 'Version 1 of an app' and 'Version 2 of an app'. What I want to achieve is zero downtime release. I'm using

kubectl apply -f mydeployment.yaml

with image: mynamespace/nodowntime-test:v1 inside.

to deploy v1 version to k8s and then run:

while True
    do
            printf "\n---------------------------------------------\n"
            curl "http://myhosthere"
            sleep 1s
    done

So far everything works. After short time curl returns 'Version 1 of an app'. Then I'm applying same k8s deployment file with image: mynamespace/nodowntime-test:v2. And well, it works, but there is one ( always one ) Gateway Timeout response between v1 and v2. So its not really no downtime release ; ) It is much better than without RollingUpdate but not perfect.

I'm using RollingUpdate strategy and readinessProbe:

---                              
apiVersion: apps/v1              
kind: Deployment                 
metadata:                        
  name: nodowntime-deployment    
spec:                            
  replicas: 1                    
  strategy:                      
    type: RollingUpdate          
    rollingUpdate:               
      maxUnavailable: 0          
      maxSurge: 1                
  selector:                      
    matchLabels:                 
      app: nodowntime-test       
  template:                      
    metadata:                    
      labels:                    
        app: nodowntime-test     
    spec:                        
      containers:                
      ...                        
        readinessProbe:          
          httpGet:               
            path: /              
            port: 80             
          initialDelaySeconds: 5 
          periodSeconds: 5       
          successThreshold: 5 

Can I do it better? Is it some issue with syncing all of that with ingress controller? I know I can tweak it by using minReadySeconds so old and new pod overlaps for some time but is it the only solution?

-- freakman
kubernetes
kubernetes-ingress
traefik-ingress

1 Answer

8/9/2018

I've recreated the mentioned experiment and changed the number of requests to something close to 30 per second by starting the three simultaneous processes of the following:

While True
    do
        curl -s https://<NodeIP>:<NodePort>/ -m 0.1 --connect-timeout 0.1 | grep Version || echo "fail"
done

After editing deployment and changing image version several times, there was no packet loss at all during the transition process. I even caught a short moment of serving requests by both images at the same time.

  Version 1 of my awesome app! Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 1 of my awesome app! Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!
  Version 2 of my awesome app! More Money is pouring in!

Therefore, if you send the request to service directly, it will work as expected.

“Gateway Timeout” error is a reply from Traefik proxy. It opens TCP connection to backend through a set of iptables rules.
When you do the RollingUpdates, iptables rules have changed but Traefic doesn't know that, so the connection is still considered as open from Traefik point of view. And after the first unsuccessful attempt to go through nonexistent iptables rule Traefik reports "Gateway Timeout" and closes tcp connection. On the next try, it opens a new connection to the backend through the new iptables rule, and everything goes well again.

It could be fixed by enabling retries in Traefik.

# Enable retry sending request if network error
[retry]

# Number of attempts
#
# Optional
# Default: (number servers in backend) -1
#
# attempts = 3

Update:

finally we worked around it without using 'retry' feature of traefik which could potentially need idempotent processing on all services ( which is good to have anyway but we could not afford forcing all projects to do that ). What you need is kubernetes RollingUpdate strategy + ReadinessProbe configured and graceful shutdown implemented in your app.

-- VAS
Source: StackOverflow