GKE Ingress is slow to pick up pod readiness/liveness

12/17/2019

I managed to create a cluster using GKE using gce ingress successfully. However it takes a long time for Ingress to detect the service is ready (I already set both livenessProbe and readinessProbe). My pods set up

Containers:
...
  gateway:
    Liveness:   http-get http://:5100/api/v1/gateway/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:5100/api/v1/gateway/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
...

and ingress

...
Name:             main-ingress
  Host                              Path  Backends
  ----                              ----  --------
  <host>
                                    /api/v1/gateway/    gateway:5100 (<ip:5100>)
                                    /api/v1/gateway/*   gateway:5100 (<ip:5100>)
                                                        web:80 (<ip>)
Annotations:
  ingress.kubernetes.io/backends:               {"k8s-be-***":"HEALTHY","k8s-be-***":"HEALTHY","k8s-be-***":"HEALTHY"}
  kubernetes.io/ingress.allow-http:             false

What I notice is that if I killed all the service and redeploy, the backend stays at UNHEALTHY for quite some time before it picks it up even though Kubernetes itself managed to pick up that pods/service are all running

I also noticed that when livenessProbe and readinessProbe is set, the Backend health check that's generated by ingress-gce is the following

Backend
Timeout: 30 seconds

Backend Health check
Interval: 70 seconds
Timeout: 1 second
Unhealthy threshold: 10 consecutive failures
Healthy threshold: 1 success

Whereas if I just deploy a simple nginx pod without specifying livenessProbe and readinessProbe, the backend generated is the following

Backend
Timeout: 30 seconds

Backend Health Check
Interval: 60 seconds
Timeout: 60 seconds
Unhealthy threshold: 10 consecutive failures
Healthy threshold: 1 success

Is the Backend health check the root cause of the slowness of picking things up? If so, any idea how to speed it up?


UPDATE Wanted to clarify after reading @yyyyahir's answer below

I understand that when creating new ingress it will take much longer because the ingress controller needs to provision the new Load Balancer, backend and all the other related things.

However what I also notice is that when I release a new version of the service (through Helm - deployment is set to Recreate rather than RollingUpgrade) OR if the pod is died (out of memory) and restarted, it takes quite a while before the backend status is healthy again despite the Pod is already in running/healthy state (this is with existing Ingress and Load Balancer in GCP). Is there a way to speed this up?

-- GantengX
google-kubernetes-engine
kubernetes
kubernetes-ingress

1 Answer

12/17/2019

When using GCE Ingress, you need to wait for the load balancer provisioning time before the backend service is deemed as healthy.

Consider that when you use this ingress class, you're relying on the GCE infrastructure that automatically has to provision an HTTP(S) load balancer and all of its components before sending requests into the cluster.

When you set up a deployment without readinessProbe, the default values are going to be applied to the load balancer health check:

Backend Health Check
Interval: 60 seconds
Timeout: 60 seconds
Unhealthy threshold: 10 consecutive failures
Healthy threshold: 1 success

However, using the readinessProbe will add the periodSeconds value to the default health check configuration. So, in your case, you had 10 seconds + 60 by default = 70.

Backend Health check
Interval: 70 seconds
Timeout: 1 second
Unhealthy threshold: 10 consecutive failures
Healthy threshold: 1 success

Note that GKE will only use readinessProbe to set the health check in the load balancer. Liveness is never picked.

This means that, the lowest value will always be that of the default load balancer health check, 60. Since these values are automatically set when the load balancer is invoked from GKE, there is no way to change them.

Wrapping up, you have to wait for the load balancer provisioing period (around ~1-3 minutes) plus the periodSeconds value set in your readinessProbe.

-- yyyyahir
Source: StackOverflow