When I installed our cluster, I used a self signed cert from our internal CA authority. Everything was fine until I started getting cert errors from applications that I was deploying to the OKD cluster. We decided instead of trying to fix the errors one at a time for all time, we would simply purchase a commercial cert and install that. So we bought a SAN cert, with wildcards (identical to the one we got from our internal CA originally) from GlobalSign and I'm trying to install it with huge problems.
Keep in mind, I have tried dozens of iterations here. I'm just documenting the last one I have tried in an attempt to figure out what the hell is the problem. This is on my test cluster, which is a VM server and I revert to a snapshot after every one. The snapshot is the operational cluster using the internal CA certs.
So, my first step was to build my CAfile to be passed in. I downloaded the root and intermediate certs for GlobalSign and put them in the ca-globalsign.crt
file. (PEM formatted)
when i run
openssl verify -CAfile ../ca-globalsign.crt labtest.mycompany.com.pem
i get:
labtest.mycompany.com.pem: OK
and openssl x509 -in labtest.mycompany.com.pem -text -noout
gives me (redacted)
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
(redacted)
Signature Algorithm: sha256WithRSAEncryption
Issuer: C=BE, O=GlobalSign nv-sa, CN=GlobalSign Organization Validation CA - SHA256 - G2
Validity
Not Before: Apr 29 16:11:07 2019 GMT
Not After : Apr 29 16:11:07 2020 GMT
Subject: C=US, ST=(redacted), L=(redacted), OU=Information Technology, O=(redacted), CN=labtest.mycompany.com
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
(redacted)
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
Authority Information Access:
CA Issuers - URI:http://secure.globalsign.com/cacert/gsorganizationvalsha2g2r1.crt
OCSP - URI:http://ocsp2.globalsign.com/gsorganizationvalsha2g2
X509v3 Certificate Policies:
Policy: 1.3.6.1.4.1.4146.1.20
CPS: https://www.globalsign.com/repository/
Policy: 2.23.140.1.2.2
X509v3 Basic Constraints:
CA:FALSE
X509v3 Subject Alternative Name:
DNS:labtest.mycompany.com, DNS:*.labtest.mycompany.com, DNS:*.apps.labtest.mycompany.com
X509v3 Extended Key Usage:
TLS Web Server Authentication, TLS Web Client Authentication
X509v3 Subject Key Identifier:
(redacted)
X509v3 Authority Key Identifier:
(redacted)
(redacted)
on my local machine. Everything i know about SSL says the cert is fine. These new files are put in the project that I use to hold the configs and such for my OKD install.
Then I updated the cert files in my ansible inventory project and run the command
ansible-playbook -i ../okd_install/inventory/okd_labtest_inventory.yml playbooks/redeploy-certificates.yml
When I read the docs, everything tells me that it should simply roll thru its process and come up with the new certs. This doesn't happen. When I use openshift_master_overwrite_named_certificates: false
in my inventory file, the install completes, but it only replaces the cert on the *.apps.labtest
domain, but the console.labtest
stays the original but it does come online, other than the fact the monitoring says bad gateway
in the cluster console.
Now, if I try to run the command again, using openshift_master_overwrite_named_certificates: true
my /var/log/containers/master-api*.log
is flooded with errors like this
{"log":"I0507 15:53:28.451851 1 logs.go:49] http: TLS handshake error from 10.128.0.56:46796: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.451894391Z"}
{"log":"I0507 15:53:28.455218 1 logs.go:49] http: TLS handshake error from 10.128.0.56:46798: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.455272658Z"}
{"log":"I0507 15:53:28.458742 1 logs.go:49] http: TLS handshake error from 10.128.0.56:46800: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.461070768Z"}
{"log":"I0507 15:53:28.462093 1 logs.go:49] http: TLS handshake error from 10.128.0.56:46802: EOF\n","stream":"stderr","time":"2019-05-07T19:53:28.463719816Z"}
and These
{"log":"I0507 15:53:29.355463 1 logs.go:49] http: TLS handshake error from 10.70.25.131:44424: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.357218793Z"}
{"log":"I0507 15:53:29.357961 1 logs.go:49] http: TLS handshake error from 10.70.25.132:43128: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.358779155Z"}
{"log":"I0507 15:53:29.357993 1 logs.go:49] http: TLS handshake error from 10.70.25.132:43126: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.358790397Z"}
{"log":"I0507 15:53:29.405532 1 logs.go:49] http: TLS handshake error from 10.70.25.131:44428: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53:29.406873158Z"}
{"log":"I0507 15:53:29.527221 1 logs.go:49] http: TLS handshake error from 10.70.25.132:43130: remote error: tls: bad certificate\n","stream":"stderr","time":"2019-05-07T19:53
and the install hangs on the ansible task TASK [Remove web console pods]
. It will sit there for hours. When go into the masters console and run oc get pods
on the openshift-web-console
its in terminating
state. When I describe the pod that is trying to start with pending
, it comes back saying the hard disk is full. I'm assuming thats because its not capable of communicating with the storage system because of all those TLS errors above. It just stays there. I can bring the cluster back up if i force delete the terminating pod, then reboot the master, then delete the new pod that is attempting to start, then reboot a second time. Then the web console comes online but all my log files are flooded with those TLS errors. But, the more concerning thing is the install hangs at that spot, so im assuming there are additional steps after bringing the web console online that cause me problems as well.
So, I have also attempted to redeploy the server CA. That yielded problems because my new cert isn't a CA cert. Then when I just ran the redeploy CA playbook, to have the cluster recreate the server CA's, it finished fine, but then when I tried to run the redeploy-certificates.yml
, I got the same results.
here is my inventory file
all:
children:
etcd:
hosts:
okdmastertest.labtest.mycompany.com:
masters:
hosts:
okdmastertest.labtest.mycompany.com:
nodes:
hosts:
okdmastertest.labtest.mycompany.com:
openshift_node_group_name: node-config-master-infra
okdnodetest1.labtest.mycompany.com:
openshift_node_group_name: node-config-compute
openshift_schedulable: True
OSEv3:
children:
etcd:
masters:
nodes:
# https://docs.okd.io/latest/install_config/persistent_storage/persistent_storage_glusterfs.html#overview-containerized-glusterfs
# https://github.com/openshift/openshift-ansible/tree/master/playbooks/openshift-glusterfs
# glusterfs:
vars:
openshift_deployment_type: origin
ansible_user: root
openshift_master_cluster_method: native
openshift_master_default_subdomain: apps.labtest.mycompany.com
openshift_install_examples: true
openshift_master_cluster_hostname: console.labtest.mycompany.com
openshift_master_cluster_public_hostname: console.labtest.mycompany.com
openshift_hosted_registry_routehost: registry.apps.labtest.mycompany.com
openshift_certificate_expiry_warning_days: 30
openshift_certificate_expiry_fail_on_warn: false
openshift_master_overwrite_named_certificates: true
openshift_hosted_registry_routetermination: reencrypt
openshift_master_named_certificates:
- certfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.pem"
keyfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.key"
cafile: "/Users/me/code/devops/okd_install/certs/ca-globalsign.crt"
names:
- "console.labtest.mycompany.com"
# - "labtest.mycompany.com"
# - "*.labtest.mycompany.com"
# - "*.apps.labtest.mycompany.com"
openshift_hosted_router_certificate:
certfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.pem"
keyfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.key"
cafile: "/Users/me/code/devops/okd_install/certs/ca-globalsign.crt"
openshift_hosted_registry_routecertificates:
certfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.pem"
keyfile: "/Users/me/code/devops/okd_install/certs/labtest/commercial.04.29.2019.labtest.mycompany.com.key"
cafile: "/Users/me/code/devops/okd_install/certs/ca-globalsign.crt"
# LDAP auth
openshift_master_identity_providers:
- name: 'mycompany_ldap_provider'
challenge: true
login: true
kind: LDAPPasswordIdentityProvider
attributes:
id:
- dn
email:
- mail
name:
- cn
preferredUsername:
- sAMAccountName
bindDN: 'ldapbind@int.mycompany.com'
bindPassword: (redacted)
insecure: true
url: 'ldap://dc-pa1.int.mycompany.com/ou=mycompany,dc=int,dc=mycompany,dc=com'
what am I missing here? I thought this redeploy-certificates.yml
playbook was designed to update the certificates. Why can't I get this to swtich to my new commercial cert? Its almost like its replacing the certs on the router (kinda), but in the process screwing up the internal server cert. I'm really at my whits end here, I don't know what else to try.
You should configure openshift_master_cluster_hostname
and openshift_master_cluster_public_hostname
as different hostname each other. The both hostname also should be resolved by DNS. Your commercial certificates are used as external access point.
The openshift_master_cluster_public_hostname and openshift_master_cluster_hostname parameters in the Ansible inventory file, by default /etc/ansible/hosts, must be different.
If they are the same, the named certificates will fail and you will need to re-install them.
# Native HA with External LB VIPs
openshift_master_cluster_hostname=internal.paas.example.com
openshift_master_cluster_public_hostname=external.paas.example.com
And you had better to configure certificates each component step by step for testing. For example, First, Configuring a Custom Master Host Certificate, and verify. And then, Configuring a Custom Wildcard Certificate for the Default Router, and verify. And so on. If you can succeed all the redeploying certificates tasks, finally you can run with complete parameters for your commercial certificates maintenance.
Refer Configuring Custom Certificates for more details. I hope it help you.