I've followed http://rahmonov.me/posts/zero-downtime-deployment-with-kubernetes/ blog, created two docker images with index.html returning 'Version 1 of an app' and 'Version 2 of an app'. What I want to achieve is zero downtime release. I'm using
kubectl apply -f mydeployment.yaml
with image: mynamespace/nodowntime-test:v1
inside.
to deploy v1 version to k8s and then run:
while True
do
printf "\n---------------------------------------------\n"
curl "http://myhosthere"
sleep 1s
done
So far everything works. After short time curl returns 'Version 1 of an app'. Then I'm applying same k8s deployment file with image: mynamespace/nodowntime-test:v2
. And well, it works, but there is one ( always one ) Gateway Timeout response between v1 and v2. So its not really no downtime release ; ) It is much better than without RollingUpdate but not perfect.
I'm using RollingUpdate
strategy and readinessProbe:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nodowntime-deployment
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: nodowntime-test
template:
metadata:
labels:
app: nodowntime-test
spec:
containers:
...
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 5
Can I do it better? Is it some issue with syncing all of that with ingress controller? I know I can tweak it by using minReadySeconds
so old and new pod overlaps for some time but is it the only solution?
I've recreated the mentioned experiment and changed the number of requests to something close to 30 per second by starting the three simultaneous processes of the following:
While True
do
curl -s https://<NodeIP>:<NodePort>/ -m 0.1 --connect-timeout 0.1 | grep Version || echo "fail"
done
After editing deployment and changing image version several times, there was no packet loss at all during the transition process. I even caught a short moment of serving requests by both images at the same time.
Version 1 of my awesome app! Money is pouring in!
Version 1 of my awesome app! Money is pouring in!
Version 1 of my awesome app! Money is pouring in!
Version 2 of my awesome app! More Money is pouring in!
Version 1 of my awesome app! Money is pouring in!
Version 1 of my awesome app! Money is pouring in!
Version 2 of my awesome app! More Money is pouring in!
Version 2 of my awesome app! More Money is pouring in!
Version 2 of my awesome app! More Money is pouring in!
Therefore, if you send the request to service directly, it will work as expected.
“Gateway Timeout” error is a reply from Traefik proxy. It opens TCP connection to backend through a set of iptables rules.
When you do the RollingUpdates, iptables rules have changed but Traefic doesn't know that, so the connection is still considered as open from Traefik point of view. And after the first unsuccessful attempt to go through nonexistent iptables rule Traefik reports "Gateway Timeout" and closes tcp connection. On the next try, it opens a new connection to the backend through the new iptables rule, and everything goes well again.
It could be fixed by enabling retries in Traefik.
# Enable retry sending request if network error
[retry]
# Number of attempts
#
# Optional
# Default: (number servers in backend) -1
#
# attempts = 3
Update:
finally we worked around it without using 'retry' feature of traefik which could potentially need idempotent processing on all services ( which is good to have anyway but we could not afford forcing all projects to do that ). What you need is kubernetes RollingUpdate strategy + ReadinessProbe configured and graceful shutdown implemented in your app.