Jetstack cert-manager and GKE private cluster (failed to verify ACME account)

11/1/2019

I have installed the Jetstack cert-manager within my private GKE cluster. That all went well, but I can't get a certificate issued. The error that I get is:

E1101 03:45:15.754642       1 sync.go:184] cert-manager/controller/challenges "msg"="propagation check failed" "error"="wrong status code '404', expected '200'" "dnsName"="[snip]" "resource_kind"="Challenge" "resource_name"="[snip]-certificate-2096248848-189663135-2951658629" "resource_namespace"="default" "type"="http-01" 
I1101 03:45:15.755017       1 controller.go:135] cert-manager/controller/challenges "level"=0 "msg"="finished processing work item" "key"="default/[snip]-certificate-2096248848-189663135-2951658629" 
I1101 03:45:25.755400       1 controller.go:129] cert-manager/controller/challenges "level"=0 "msg"="syncing item" "key"="default/[snip]-certificate-2096248848-189663135-2951658629" 
I1101 03:45:25.755810       1 pod.go:58] cert-manager/controller/challenges/http01/selfCheck/http01/ensurePod "level"=0 "msg"="found one existing HTTP01 solver pod" "dnsName"="[snip]" "related_resource_kind"="Pod" "related_resource_name"="cm-acme-http-solver-b6k59" "related_resource_namespace"="default" "resource_kind"="Challenge" "resource_name"="[snip]-certificate-2096248848-189663135-2951658629" "resource_namespace"="default" "type"="http-01" 
I1101 03:45:25.755897       1 service.go:43] cert-manager/controller/challenges/http01/selfCheck/http01/ensureService "level"=0 "msg"="found one existing HTTP01 solver Service for challenge resource" "dnsName"="[snip]" "related_resource_kind"="Service" "related_resource_name"="cm-acme-http-solver-qsvbv" "related_resource_namespace"="default" "resource_kind"="Challenge" "resource_name"="[snip]-certificate-2096248848-189663135-2951658629" "resource_namespace"="default" "type"="http-01" 
I1101 03:45:25.755960       1 ingress.go:91] cert-manager/controller/challenges/http01/selfCheck/http01/ensureIngress "level"=0 "msg"="found one existing HTTP01 solver ingress" "dnsName"="[snip]" "related_resource_kind"="Ingress" "related_resource_name"="cm-acme-http-solver-br7d2" "related_resource_namespace"="default" "resource_kind"="Challenge" "resource_name"="[snip]-certificate-2096248848-189663135-2951658629" "resource_namespace"="default" "type"="http-01" 

This corresponds with an error event in the ClusterIssuer that I deployed:

Warning ErrVerifyACMEAccount 27m (x4 over 28m) cert-manager Failed to verify ACME account: Get https://acme-v02.api.letsencrypt.org/directory: dial tcp: i/o timeout

Because of this my CertificateRequest and Certificate resources perpetually stay in a "pending" state.

This is happening during initial cluster creation. My configuration for the certificate manager & ingress is as follows:

apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: letsencrypt-uat
spec:
  acme:
    email: cert-manager+uat@[snip]
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: letsencrypt-uat-private-key
    solvers:
    - http01:
        ingress:
          class: nginx
apiVersion: cert-manager.io/v1alpha2
kind: Certificate
metadata:
  name: [snip]-uat-certificate
spec:
  secretName: [snip]-uat-tls-cert
  duration: 2160h
  renewBefore: 360h
  commonName: [snip]
  dnsNames:
  - [snip]
  issuerRef:
    name: letsencrypt-uat
    kind: ClusterIssuer
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: [snip]-uat-tls-ingress
  namespace: default
  annotations:
    kubernetes.io/ingress.class: "nginx"
    cert-manager.io/cluster-issuer: letsencrypt-uat
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/affinity: "cookie"
spec:
  rules:
  - host: [snip]
    http:
      paths:
      - backend:
          serviceName: [snip]-uat-webapp-service
          servicePort: 80
  tls:
  - hosts:
    - [snip]
    secretName: [snip]-uat-tls-cert

I am on a GKE private cluster and have therefore also been unable to run the webhook component. The documentation seems to imply that this it's OK, but not recommended, to run this way.

Also, I note that the documentation references the need to add a firewall rule to allow the webhook to work. And I wonder if that is also relevant here? The error above seems to indicate some kind of networking (firewall?) related issue.

Environment details:: GKE (1.14.7-gke.10) Kubernetes (v1.16.2) (I think) cert-manager (0.11.0)

Installed with kubectl

Do I need to configure a firewall rule, perhaps?

Many thanks, Ben

Edit 1:

The "dial tcp: i/o timeout" is a red herring. That error persists only as long as the DNS takes to initialise with my cluster. I am also coming closer to the conclusion that the propagation error is simply LetsEncrypt DNS not seeing my domain associated with my IP address (yet).

Is it correct that I use an A record here? I made the DNS update around an hour ago - is there any way that I can see what LetsEncrypt's DNS sees?

-- benjimix
cert-manager
google-kubernetes-engine

1 Answer

11/3/2019

Ok thanks both for your help. It turns out that this was nothing to do with cert-manager. I had two issues in play here:

  1. There was a GCP issue at the time that I was doing this to do with networks (this just caused confusion);
  2. My application was not responding correctly the the HTTP challenge.

However, in the end, for other reasons, I decided to use the DNS solver. This worked just fine.

Thanks again!

-- benjimix
Source: StackOverflow