I am trying to achieve zero-downtime deployment using Kubernetes. But every time I do the upgrade of the deployment using a new image, I am seeing 2-3 seconds of downtime. I am testing this using a Hello-World sort of application but still could not achieve it. I am deploying my application using the Helm charts.
Following the online blogs and resources, I am using Readiness-Probe and Rolling Update strategy in my Deployment.yaml file. But this gives me no success. I have created a /health
end-point which simply returns 200
status code as a check for readiness probe. I expected that after using readiness probes and RollingUpdate strategy in Kubernetes I would be able to achieve zero-downtime of my service when I upgrade the image of the container. The request to my service goes through an Amazon ELB.
Deployment.yaml file is as below:
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: wine-deployment
labels:
app: wine-store
chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: wine-store
replicas: 2
template:
metadata:
labels:
app: wine-store
spec:
containers:
- name: {{ .Chart.Name }}
resources:
limits:
cpu: 250m
requests:
cpu: 200m
image: "my-private-image-repository-with-tag-and-version-goes-here-which-i-have-hidden-here"
imagePullPolicy: Always
env:
- name: GET_HOSTS_FROM
value: dns
ports:
- containerPort: 8089
name: testing-port
readinessProbe:
httpGet:
path: /health
port: 8089
initialDelaySeconds: 3
periodSeconds: 3
Service.yaml file:
apiVersion: v1
kind: Service
metadata:
name: wine-service
labels:
app: wine-store
spec:
ports:
- port: 80
targetPort: 8089
protocol: TCP
selector:
app: wine-store
Ingress.yaml file:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: wine-ingress
annotations:
kubernetes.io/ingress.class: public-nginx
spec:
rules:
- host: my-service-my-internal-domain.com
http:
paths:
- path: /
backend:
serviceName: wine-service
servicePort: 80
I expect the downtime to be zero when I am upgrading the image using helm upgrade
command. Meanwhile, when the upgrade is in progress, I continuously hit my service using a curl command. This curl command gives me 503-service Temporarily un-available
errors for 2-3 seconds and then again the service is up. I expect that this downtime does not happens.
This issue is caused by the Service VIP using iptables. You haven't done anything wrong - it's a limitation of current Kubernetes.
When the readiness probe on the new pod passes, the old pod is terminated and kube-proxy rewrites the iptables for the service. However, a request can hit the service after the old pod is terminated but before iptables has been updated resulting in a 503.
A simple workaround is to delay termination by using a preStop
lifecycle hook:
lifecycle:
preStop:
exec:
command: ["/bin/bash", "-c", "sleep 10"]
It'd probably not relevant in this case, but implementing graceful termination in your application is a good idea. Intercept the TERM signal and wait for your application to finish handling any requests that it has already received rather than just exiting immediately.
Alternatively, more replicas, a low maxUnavailable
and a high maxSurge
will all reduce the probability of requests hitting a terminating pod.
For more info: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-iptables https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
Another answer mistakenly suggests you need a liveness probe. While it's a good idea to have a liveness probe, it won't effect the issue that you are experiencing. With no liveness probe defined the default state is Success.
In the context of a rolling deployment a liveness probe will be irrelevant - Once the readiness probe on the new pod passes the old pod will be sent the TERM signal and iptables will be updated. Now that the old pod is terminating, any liveness probe is irrelevant as its only function is to cause a pod to be restarted if the liveness probe fails.
Any liveness probe on the new pod again is irrelevant. When the pod is first started it is considered live by default. Only after the initialDelaySeconds
of the liveness probe would it start being checked and, if it failed, the pod would be terminated.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
The problem you describe indicate an issue with readiness probes. It is important to understand the differences between liveness and readiness probes. First of all you should implement and configure both!
The liveness probes are to check if the container is started and alive. If this isn’t the case, kubernetes will eventually restart the container.
The readiness probes in turn also check dependencies like database connections or other services your container is depending on to fulfill it’s work. As a developer you have to invest here more time into the implementation than just for the liveness probes. You have to expose a an endpoint which is also checking the mentioned dependencies when queried.
Your current configuration uses an health endpoint which is usually used by the liveness probes. It probably doesn’t check if your services is really ready to take traffic.
Kubernetes relies on the readiness probes. During an rolling update, it will keep the old container up and running until the new service declares that it is ready to take traffic. Therefore the readiness probes have to be implemented correctly.
Go around with blue-green deployments because even if pods are up it may take time for kube-proxy to forward requests to new POD IPs.
So setup new deployment, after all pods are up update service selector
to new POD lables. Follow: https://kubernetes.io/blog/2018/04/30/zero-downtime-deployment-kubernetes-jenkins/