Converting from self signed to commercial cert TLS errors

5/7/2019

When I installed our cluster, I used a self signed cert from our internal CA authority. Everything was fine until I started getting cert errors from applications that I was deploying to the OKD cluster. We decided instead of trying to fix the errors one at a time for all time, we would simply purchase a commercial cert and install that. So we bought a SAN cert, with wildcards (identical to the one we got from our internal CA originally) from GlobalSign and I'm trying to install it with huge problems.

Keep in mind, I have tried dozens of iterations here. I'm just documenting the last one I have tried in an attempt to figure out what the hell is the problem. This is on my test cluster, which is a VM server and I revert to a snapshot after every one. The snapshot is the operational cluster using the internal CA certs.

So, my first step was to build my CAfile to be passed in. I downloaded the root and intermediate certs for GlobalSign and put them in the ca-globalsign.crt file. (PEM formatted)

when i run

openssl verify -CAfile ../ca-globalsign.crt labtest.mycompany.com.pem

i get:

labtest.mycompany.com.pem: OK

and openssl x509 -in labtest.mycompany.com.pem -text -noout gives me (redacted)

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            (redacted)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=BE, O=GlobalSign nv-sa, CN=GlobalSign Organization Validation CA - SHA256 - G2
        Validity
            Not Before: Apr 29 16:11:07 2019 GMT
            Not After : Apr 29 16:11:07 2020 GMT
        Subject: C=US, ST=(redacted), L=(redacted), OU=Information Technology, O=(redacted), CN=labtest.mycompany.com
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    (redacted)
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            Authority Information Access:
                CA Issuers - URI:http://secure.globalsign.com/cacert/gsorganizationvalsha2g2r1.crt
                OCSP - URI:http://ocsp2.globalsign.com/gsorganizationvalsha2g2

            X509v3 Certificate Policies:
                Policy: 1.3.6.1.4.1.4146.1.20
                  CPS: https://www.globalsign.com/repository/
                Policy: 2.23.140.1.2.2

            X509v3 Basic Constraints:
                CA:FALSE
            X509v3 Subject Alternative Name:
                DNS:labtest.mycompany.com, DNS:*.labtest.mycompany.com, DNS:*.apps.labtest.mycompany.com
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Subject Key Identifier:
                (redacted)
            X509v3 Authority Key Identifier:
                (redacted)

            (redacted)

on my local machine. Everything i know about SSL says the cert is fine. These new files are put in the project that I use to hold the configs and such for my OKD install.

Then I updated the cert files in my ansible inventory project and run the command

ansible-playbook -i ../okd_install/inventory/okd_labtest_inventory.yml playbooks/redeploy-certificates.yml

When I read the docs, everything tells me that it should simply roll thru its process and come up with the new certs. This doesn't happen. When I use openshift_master_overwrite_named_certificates: false in my inventory file, the install completes, but it only replaces the cert on the *.apps.labtest domain, but the console.labtest stays the original but it does come online, other than the fact the monitoring says bad gateway in the cluster console.

Now, if I try to run the command again, using openshift_master_overwrite_named_certificates: true my /var/log/containers/master-api*.log is flooded with errors like this

{"log":"I0507 15:53:28.451851       1 logs.go:49] http: TLS handshake error from 10.128.0.56:46796: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.451894391Z"}
{"log":"I0507 15:53:28.455218       1 logs.go:49] http: TLS handshake error from 10.128.0.56:46798: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.455272658Z"}
{"log":"I0507 15:53:28.458742       1 logs.go:49] http: TLS handshake error from 10.128.0.56:46800: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.461070768Z"}
{"log":"I0507 15:53:28.462093       1 logs.go:49] http: TLS handshake error from 10.128.0.56:46802: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.463719816Z"}

and These

{"log":"I0507 15:53:29.355463       1 logs.go:49] http: TLS handshake error from 10.70.25.131:44424: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.357218793Z"}
{"log":"I0507 15:53:29.357961       1 logs.go:49] http: TLS handshake error from 10.70.25.132:43128: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.358779155Z"}
{"log":"I0507 15:53:29.357993       1 logs.go:49] http: TLS handshake error from 10.70.25.132:43126: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.358790397Z"}
{"log":"I0507 15:53:29.405532       1 logs.go:49] http: TLS handshake error from 10.70.25.131:44428: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.406873158Z"}
{"log":"I0507 15:53:29.527221       1 logs.go:49] http: TLS handshake error from 10.70.25.132:43130: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53

and the install hangs on the ansible task TASK [Remove web console pods]. It will sit there for hours. When go into the masters console and run oc get pods on the openshift-web-console its in terminating state. When I describe the pod that is trying to start with pending, it comes back saying the hard disk is full. I'm assuming thats because its not capable of communicating with the storage system because of all those TLS errors above. It just stays there. I can bring the cluster back up if i force delete the terminating pod, then reboot the master, then delete the new pod that is attempting to start, then reboot a second time. Then the web console comes online but all my log files are flooded with those TLS errors. But, the more concerning thing is the install hangs at that spot, so im assuming there are additional steps after bringing the web console online that cause me problems as well.

So, I have also attempted to redeploy the server CA. That yielded problems because my new cert isn't a CA cert. Then when I just ran the redeploy CA playbook, to have the cluster recreate the server CA's, it finished fine, but then when I tried to run the redeploy-certificates.yml, I got the same results.

here is my inventory file

all:
  children:
    etcd:
      hosts:
        okdmastertest.labtest.mycompany.com:
    masters:
      hosts:
        okdmastertest.labtest.mycompany.com:
    nodes:
      hosts:
        okdmastertest.labtest.mycompany.com:
          openshift_node_group_name: node-config-master-infra
        okdnodetest1.labtest.mycompany.com:
          openshift_node_group_name: node-config-compute
          openshift_schedulable: True
    OSEv3:
      children:
        etcd:
        masters:
        nodes:
        # https://docs.okd.io/latest/install_config/persistent_storage/persistent_storage_glusterfs.html#overview-containerized-glusterfs
        # https://github.com/openshift/openshift-ansible/tree/master/playbooks/openshift-glusterfs
        # glusterfs:
      vars:
        openshift_deployment_type: origin
        ansible_user: root

        openshift_master_cluster_method: native
        openshift_master_default_subdomain: apps.labtest.mycompany.com
        openshift_install_examples: true

        openshift_master_cluster_hostname: console.labtest.mycompany.com
        openshift_master_cluster_public_hostname: console.labtest.mycompany.com
        openshift_hosted_registry_routehost: registry.apps.labtest.mycompany.com

        openshift_certificate_expiry_warning_days: 30
        openshift_certificate_expiry_fail_on_warn: false
        openshift_master_overwrite_named_certificates: true
        openshift_hosted_registry_routetermination: reencrypt

        openshift_master_named_certificates:
          - certfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.pem"
            keyfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.key"
            cafile: "/Users/me/code/devops/okd_install/certs/ca-globalsign.crt"
            names:
              - "console.labtest.mycompany.com"
              # - "labtest.mycompany.com"
              # - "*.labtest.mycompany.com"
              # - "*.apps.labtest.mycompany.com"
        openshift_hosted_router_certificate:
          certfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.pem"
          keyfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.key"
          cafile: "/Users/me/code/devops/okd_install/certs/ca-globalsign.crt"
        openshift_hosted_registry_routecertificates:
          certfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.pem"
          keyfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.key"
          cafile: "/Users/me/code/devops/okd_install/certs/ca-globalsign.crt"

        # LDAP auth
        openshift_master_identity_providers:
        - name: 'mycompany_ldap_provider'
          challenge: true
          login: true
          kind: LDAPPasswordIdentityProvider
          attributes:
            id:
            - dn
            email:
            - mail
            name:
            - cn
            preferredUsername:
            - sAMAccountName
          bindDN: 'ldapbind@int.mycompany.com'
          bindPassword: (redacted) 
          insecure: true
          url: 'ldap://dc-pa1.int.mycompany.com/ou=mycompany,dc=int,dc=mycompany,dc=com'

what am I missing here? I thought this redeploy-certificates.yml playbook was designed to update the certificates. Why can't I get this to swtich to my new commercial cert? Its almost like its replacing the certs on the router (kinda), but in the process screwing up the internal server cert. I'm really at my whits end here, I don't know what else to try.

-- scphantm
kubernetes
openshift
openshift-origin

1 Answer

5/8/2019

You should configure openshift_master_cluster_hostname and openshift_master_cluster_public_hostname as different hostname each other. The both hostname also should be resolved by DNS. Your commercial certificates are used as external access point.

The openshift_master_cluster_public_hostname and openshift_master_cluster_hostname parameters in the Ansible inventory file, by default /etc/ansible/hosts, must be different. 
If they are the same, the named certificates will fail and you will need to re-install them.

# Native HA with External LB VIPs
openshift_master_cluster_hostname=internal.paas.example.com
openshift_master_cluster_public_hostname=external.paas.example.com

And you had better to configure certificates each component step by step for testing. For example, First, Configuring a Custom Master Host Certificate, and verify. And then, Configuring a Custom Wildcard Certificate for the Default Router, and verify. And so on. If you can succeed all the redeploying certificates tasks, finally you can run with complete parameters for your commercial certificates maintenance.

Refer Configuring Custom Certificates for more details. I hope it help you.

-- Daein Park
Source: StackOverflow