I can't get Ingress to work on GKE, owing to health check failures. I've tried all of the debugging steps I can think of, including:
healthz
checks are passing in Stackdriver logs... I'd love any advice about how to debug or fix. Details below!
I have set up a service with type LoadBalancer
on GKE. Works great via external IP:
apiVersion: v1
kind: Service
metadata:
name: echoserver
namespace: es
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
type: LoadBalancer
selector:
app: echoserver
Then I try setting up an Ingress on top of this same service:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: echoserver-ingress
namespace: es
annotations:
kubernetes.io/ingress.class: "gce"
kubernetes.io/ingress.global-static-ip-name: "echoserver-global-ip"
spec:
backend:
serviceName: echoserver
servicePort: 80
The Ingress gets created, but it thinks the backend nodes are unhealthy:
$ kubectl --namespace es describe ingress echoserver-ingress | grep backends
backends: {"k8s-be-31102--<snipped>":"UNHEALTHY"}
Inspecting the state of the Ingress backend in the GKE web console, I see the same thing:
The health check details appear as expected:
... and from within a pod in my cluster I can call the service successfully:
# curl -vvv echoserver 2>&1 | grep "< HTTP"
< HTTP/1.0 200 OK
# curl -vvv echoserver/healthz 2>&1 | grep "< HTTP"
< HTTP/1.0 200 OK
And I can address the service by NodePort:
# curl -vvv 10.0.1.1:31102 2>&1 | grep "< HTTP"
< HTTP/1.0 200 OK
(Which goes without saying, because the Load Balancer service I set up in step 1 resulted in a web site that's working just fine.)
I also see healthz
checks passing in Stackdriver logs:
Regarding quotas, I check and see I'm only using 3 of 30 backend services:
$ gcloud compute project-info describe | grep -A 1 -B 1 BACKEND_SERVICES
- limit: 30.0
metric: BACKEND_SERVICES
usage: 3.0
You have configured the timeout value to be 1 second. Perhaps increasing it to 5 seconds will solve the issue.
I had this issue, and eventually came across https://stackoverflow.com/a/50645953/9276, which made me look at my firewall settings. Sure enough, the last several NodePort services I had added were not enabled in the firewall rule, and therefore the health checks from the ingress pointing at them all failed. Manually adding the new host ports to the firewall rule fixed this issue for me.
However, unlike the linked answer, I do not have an invalid certificate in use. I'm guessing there are other errors or weird states that can cause this behavior, but I have not found the reason the rules stopped being automatically managed yet.
Possibly unrelated, I did not have this problem in our qa environment, just production, so there may be GCP project level settings at play.
Had a similar issue a few weeks ago. What fixed it for me was to add a NodePort in the service description so that the Google Cloud Loadbalancer can probe this NodePort. The config that worked for me:
apiVersion: v1
kind: Service
metadata:
name: some-service
spec:
selector:
name: some-app
type: NodePort
ports:
- port: 80
targetPort: 8080
nodePort: 32000
protocol: TCP
It might take some time for the ingress to pick this up. You ca re-create the ingress to speed things up.