multi master OKD-3.11 setup fails if master-1 nodes is down

12/11/2020
<!-- Note that this is the issue tracker for OpenShift Origin. OpenShift Installer issues should be reported at https://github.com/openshift/openshift-ansible/issues Management Console issues are collected at https://github.com/openshift/origin-web-console/issues Documentation issues are better reported at https://github.com/openshift/openshift-docs/issues -->

I am trying to install multi-master openshift-3.11 setup in openstack VMs as per the inventory file present in the official documentation.

https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha

OKD Version
[centos@master1 ~]$ oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master1.167.254.204.74.nip.io:8443
openshift v3.11.0+ff2bdbd-531
kubernetes v1.11.0+d4cacc0
Steps To Reproduce

Bring up an okd-3.11 multi master setup as per the inventory file mentioned in here, https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha

Current Result

The setup is successful but struck with two issues as mentioned below, 1. unable to list down the load balancer nodes on issue of "oc get nodes" command.

[centos@master1 ~]$ oc get nodes
NAME                            STATUS    ROLES          AGE       VERSION
master1.167.254.204.74.nip.io   Ready     infra,master   6h        v1.11.0+d4cacc0
master2.167.254.204.58.nip.io   Ready     infra,master   6h        v1.11.0+d4cacc0
master3.167.254.204.59.nip.io   Ready     infra,master   6h        v1.11.0+d4cacc0
node1.167.254.204.82.nip.io     Ready     compute        6h        v1.11.0+d4cacc0
  1. The master nodes and the load balancer are totally dependent on master-1 node because if master-1 is down then rest of the master nodes or load balancer unable to run any of the oc commands,
[centos@master2 ~]$ oc get nodes
Unable to connect to the server: dial tcp 167.254.204.74:8443: connect: no route to host

The OKD setup works fine if the other master nodes (other than master-1) or the load balancer are down.

Expected Result

The OKD setup should be up & running though any one of the master nodes went down.

Inventory file:

[OSEv3:children]
masters
nodes
etcd
lb

[masters]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io

[etcd]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io

[lb]
lb.167.254.204.111.nip.io

[nodes]
master1.167.254.204.74.nip.io openshift_ip=167.254.204.74 openshift_schedulable=true openshift_node_group_name='node-config-master'
master2.167.254.204.58.nip.io openshift_ip=167.254.204.58 openshift_schedulable=true openshift_node_group_name='node-config-master'
master3.167.254.204.59.nip.io openshift_ip=167.254.204.59 openshift_schedulable=true openshift_node_group_name='node-config-master'
node1.167.254.204.82.nip.io openshift_ip=167.254.204.82 openshift_schedulable=true openshift_node_group_name='node-config-compute'

[OSEv3:vars]
debug_level=4
ansible_ssh_user=centos
ansible_become=true
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
openshift_enable_service_catalog=true
ansible_service_broker_install=true

openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}]

containerized=false
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability

deployment_type=origin
openshift_deployment_type=origin

openshift_release=v3.11.0
openshift_pkg_version=-3.11.0
openshift_image_tag=v3.11.0
openshift_service_catalog_image_version=v3.11.0
template_service_broker_image_version=v3.11
osm_use_cockpit=true

# put the router on dedicated infra1 node
openshift_master_cluster_method=native
openshift_master_default_subdomain=sub.master1.167.254.204.74.nip.io
openshift_public_hostname=master1.167.254.204.74.nip.io
openshift_master_cluster_hostname=master1.167.254.204.74.nip.io

Please let me know the entire setup dependency on master-node-1 and also any work around to fix this.

-- Bhavani Prasad
docker
kubernetes
okd
openshift
openshift-origin

1 Answer

12/11/2020

You should configure LB hostname to openshift_master_cluster_hostname and openshift_master_cluster_public_hostname, not master hostname. As your configuration, if you configure it as master1, then all API entrypoint will be master1, so if master1 stopped, then all API service would be down.

In advance you should configure your LB for loadbalancing to your master nodes, and register the LB IP(AKA VIP) to DNS as ocp-cluster.example.com. This hostname will be entrypoint for OCP API, you can set it using both openshift_master_cluster_hostname and openshift_master_cluster_public_hostname.

openshift_master_cluster_method=native
openshift_master_cluster_hostname=ocp-cluster.example.com
openshift_master_cluster_public_hostname=ocp-cluster.example.com
-- Daein Park
Source: StackOverflow