Given:
When calling the pod from within the cluster we get a 200 response code
When calling the service from within the cluster we get a 200 response code
The ingress shows as annotation:
ingress.kubernetes.io/backends: '{"k8s-be-30606--559b9972f521fd4f":"UNHEALTHY"}'
To top things of, we have a different kubernetes cluster with the exact same configuration (apart from the namespace dev vs qa & timestamps & assigned ips & ports) where everything is working properly.
We've already tried removing the ingress, deleting pods, upscaling pods, explicitly defining the readiness probe, all without any change in the result.
Judging from the above it's the health check on the pod that's failing for some reason (even though if we do it manually (curl to a node internal ip + the node port from the service from within the cluster), it returns 200 & in qa it's working fine with the same container image).
Is there any log available in Stackdriver Logging (or elsewhere) where we can see what exact request is being done by that health check and what the exact response code is? (or if it timed out for some reason?)
Is there any way to get more view on what's happening in the google processes?
We use the default gke ingress controller.
Some additional info: When comparing with an entirely different application, I see tons of requests like these:
10.129.128.10 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"
10.129.128.8 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"
10.129.128.12 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"
10.129.128.10 - - [31/May/2018:11:06:51 +0000] "GET / HTTP/1.1" 200 1049 "-" "GoogleHC/1.0"
Which I assume are the health checks. I don't see any similar logs for the failing application nor for the working version in qa. So I imagine the health checks are ending up somewhere entirely different & by chance in qa it's something that also returns 200. So question remains: where can I see the actual requests performed by a health check?
Also for this particular application I see about 8 health checks per second for that single pod which seems to be a bit much to me (the configured interval is 60 seconds). Is it possible health checks for other applications are ending up in this one?
Unfortunately there are no user facing logs to show the status of health check requests (likely because of the volume of logs this would create)
As to the first question, the GKE SHOULD be handling all the firewall rules automatically, if it is not in your case it is either because of an issue with the node version or a specific user issue (in which case I suggest filing a bug with Google on the issue tracker)
GKE is managing a firewall rule. For some reason new (node) ports used by ingresses aren't added automatically anymore to this rule. After adding the new ports manually to this rule in the console, the backend service became healthy.
Still need to find out:
In any case I hope this can help someone else since we wasted a huge amount of time finding this out.
edit:
The error turned out to be an invalid certificate used by tls termination by an unrelated (except that it's managed by the same controller) ingress. Once that was fixed, the rule was updated automatically again.