Kubernetes - Implementing Kubernetes Master HA solution in CentOS7

6/30/2017

I am in the process of implementing an HA solution for Kubernetes Master nodes in a CentOS7 env.

My env looks like :

K8S_Master1 : 172.16.16.5
K8S_Master2 : 172.16.16.51
HAProxy     : 172.16.16.100
K8S_Minion1 : 172.16.16.50


etcd Version: 3.1.7
Kubernetes v1.5.2
CentOS Linux release 7.3.1611 (Core)

My etcd cluster is setup properly and is in working state.

[root@master1 ~]# etcdctl cluster-health
member 282a4a2998aa4eb0 is healthy: got healthy result from http://172.16.16.51:2379
member dd3979c28abe306f is healthy: got healthy result from http://172.16.16.5:2379
member df7b762ad1c40191 is healthy: got healthy result from http://172.16.16.50:2379

My K8S config for Master1 is :

[root@master1 ~]# cat /etc/kubernetes/apiserver 
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"

[root@master1 ~]# cat /etc/kubernetes/config 
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"

[root@master1 ~]# cat /etc/kubernetes/controller-manager 
KUBE_CONTROLLER_MANAGER_ARGS="--leader-elect"

[root@master1 ~]# cat /etc/kubernetes/scheduler 
KUBE_SCHEDULER_ARGS="--leader-elect"

As for Master2 , I have configured it to be :

[root@master2 kubernetes]# cat apiserver 
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"

[root@master2 kubernetes]# cat config 
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"

[root@master2 kubernetes]# cat scheduler 
KUBE_SCHEDULER_ARGS=""

[root@master2 kubernetes]# cat controller-manager 
KUBE_CONTROLLER_MANAGER_ARGS=""

Note that --leader-elect is only configured on Master1 as I want Master1 to be the leader.

My HA Proxy config is simple :

frontend K8S-Master
    bind 172.16.16.100:8080
    default_backend K8S-Master-Nodes

backend K8S-Master-Nodes
    mode        http
    balance     roundrobin
    server      master1 172.16.16.5:8080 check
    server      master2 172.16.16.51:8080 check

Now I have directed my minion to connect to the Load Balancer IP rather than directly to the Master IP.

Config on Minion is :

[root@minion kubernetes]# cat /etc/kubernetes/config 
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://172.16.16.100:8080"

On both Master nodes, I see the minion/node status as Ready

[root@master1 ~]# kubectl get nodes
NAME           STATUS    AGE
172.16.16.50   Ready     2h

[root@master2 ~]# kubectl get nodes
NAME           STATUS    AGE
172.16.16.50   Ready     2h

I setup an example nginx pod using :

apiVersion: v1
kind: ReplicationController
metadata:
  name: nginx
spec:
  replicas: 2
  selector:
    app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80

I created the Replication Controller on Master1 using :

[root@master1 ~]# kubectl create -f nginx.yaml

And on both Master nodes, I was able to see the pods created.

[root@master1 ~]# kubectl get po
NAME          READY     STATUS    RESTARTS   AGE
nginx-jwpxd   1/1       Running   0          29m
nginx-q613j   1/1       Running   0          29m

[root@master2 ~]# kubectl get po
NAME          READY     STATUS    RESTARTS   AGE
nginx-jwpxd   1/1       Running   0          29m
nginx-q613j   1/1       Running   0          29m

Now logically thinking, if I were to take down Master1 node and delete the pods on Master2 , Master2 should recreate the pods. So this is what I do.

On Master1 :

[root@master1 ~]# systemctl stop kube-scheduler ; systemctl stop kube-apiserver ; systemctl stop kube-controller-manager

On Master2 :

[root@slave1 kubernetes]# kubectl delete po --all
pod "nginx-l7mvc" deleted
pod "nginx-r3m58" deleted

Now Master2 should create the pods since the Replication Controller is still up. But the new Pods are stuck in :

[root@master2 kubernetes]# kubectl get po
NAME          READY     STATUS        RESTARTS   AGE
nginx-l7mvc   1/1       Terminating   0          13m
nginx-qv6z9   0/1       Pending       0          13m
nginx-r3m58   1/1       Terminating   0          13m
nginx-rplcz   0/1       Pending       0          13m

Ive waited a long time but the pods are stuck in this state.

But when I restart the services on Master1 :

[root@master1 ~]# systemctl start kube-scheduler ; systemctl start kube-apiserver ; systemctl start kube-controller-manager

Then I see progress on Master1 :

NAME          READY     STATUS              RESTARTS   AGE
nginx-qv6z9   0/1       ContainerCreating   0          14m
nginx-rplcz   0/1       ContainerCreating   0          14m

[root@slave1 kubernetes]# kubectl get po
NAME          READY     STATUS    RESTARTS   AGE
nginx-qv6z9   1/1       Running   0          15m
nginx-rplcz   1/1       Running   0          15m

Why doesnt Master2 recreate the pods ? This is the confusion that I am trying to figure out. Ive come a long way to setup a fully function HA setup but seems like almost there only if I can figure out this puzzle.

--
haproxy
kubectl
kubernetes

1 Answer

7/4/2017

In my opinion, the error comes from the fact that Master2 has no --leader-elect flag enabled. There can only be one scheduler and controller processes running at the same time, that is the reason of --leader-elect. The aim of this flag is to have them "compete" to see which of the scheduler and controller processes is active at a given time. As you did not set the flag in both master nodes, then there are two scheduler and controller processes active, and thus the conflicts you are experiencing. In order to fix the issue, I advise you to enable this flag in all of the master nodes.

What is more, according to the k8s documentation https://kubernetes.io/docs/tasks/administer-cluster/highly-available-master/#best-practices-for-replicating-masters-for-ha-clusters:

Do not use a cluster with two master replicas. Consensus on a two replica cluster requires both replicas running when changing persistent state. As a result, both replicas are needed and a failure of any replica turns cluster into majority failure state. A two-replica cluster is thus inferior, in terms of HA, to a single replica cluster.

-- Javier Salmeron
Source: StackOverflow