Clean Ambassador Edge Stack install on GKE fails with DNS resolving

1/4/2022

we are testing out the Ambassador Edge Stack and started with a brand new GKE private cluster in autopilot mode.

We installed from scratch following the quick start tour to get a feeling of it and ended up with the following error

Error from server: error when creating "mapping-test.yaml": conversion webhook for getambassador.io/v3alpha1, Kind=Mapping failed: Post "https://emissary-apiext.emissary-system.svc:443/webhooks/crd-convert?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We did a few rounds of DNS testing and deployed a few different test pods in different namespaces to validate that kube-dns is working properly, everything looks good at that end. Also the resolv.conf looks good.

Ambassador is using the hostname emissary-apiext.emissary-system.svc:443 (without the cluster.local) which should resolve fine. Doing a lookup with the FQN (with cluster.local) works fine btw.

Any clues?

Thanks a lot and take care.

-- Sebastian
ambassador
dns
google-kubernetes-engine
kubernetes
kubernetes-ingress

2 Answers

1/5/2022

That sounds like an issue related to the webhooks limitation in GKE Autopilot

Which version of GKE are you on ?

Also there is a limitation with which resources and namespaces we allow webhooks to intercept

Additionally, webhooks which specify one or more of following resources (and any of their sub-resources) in the rules, will be rejected:

  • group: "" resource: nodes
  • group: "" resource: persistentvolumes
  • group: certificates.k8s.io resource: certificatesigningrequests
  • group: authentication.k8s.io resource: tokenreviews

You probably have to check the manifests of Ambassador Edge Stack to figure this out.

-- boredabdel
Source: StackOverflow

1/6/2022

I think i found the solution, posting here if someone come across this later on.

So i followed this to deploy Ambassador Edge Stack in a Autopilot private cluster. I was getting the same error when i was trying to deploy the Mapping object (step 2.2).

The issue is that the control plane (API Server) is trying to call emissary-apiext.emissary-system.svc:443 but the pods behind it are listening on port 8443 (figured that out by describing the Service).

So i added a firewall rule to allow the GKE control plane to talk to the nodes on port 443.

The firewall rule in question is called gke-gke-ap-xxxxx-master. The xxxx is called the cluster hash and is different for each cluster. To make sure you are editing the proper rule, double check that source IP Range matches the "Control plane address range" from the cluster details page. And that it's the rule that has a name ending with master.

Just edit that rule and add 8443 to the tcp ports. It should work

-- boredabdel
Source: StackOverflow