I am in the process of implementing an HA solution for Kubernetes Master nodes in a CentOS7 env.
My env looks like :
K8S_Master1 : 172.16.16.5
K8S_Master2 : 172.16.16.51
HAProxy : 172.16.16.100
K8S_Minion1 : 172.16.16.50
etcd Version: 3.1.7
Kubernetes v1.5.2
CentOS Linux release 7.3.1611 (Core)My etcd cluster is setup properly and is in working state.
[root@master1 ~]# etcdctl cluster-health
member 282a4a2998aa4eb0 is healthy: got healthy result from http://172.16.16.51:2379
member dd3979c28abe306f is healthy: got healthy result from http://172.16.16.5:2379
member df7b762ad1c40191 is healthy: got healthy result from http://172.16.16.50:2379My K8S config for Master1 is :
[root@master1 ~]# cat /etc/kubernetes/apiserver
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"
[root@master1 ~]# cat /etc/kubernetes/config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"
[root@master1 ~]# cat /etc/kubernetes/controller-manager
KUBE_CONTROLLER_MANAGER_ARGS="--leader-elect"
[root@master1 ~]# cat /etc/kubernetes/scheduler
KUBE_SCHEDULER_ARGS="--leader-elect"As for Master2 , I have configured it to be :
[root@master2 kubernetes]# cat apiserver
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"
[root@master2 kubernetes]# cat config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"
[root@master2 kubernetes]# cat scheduler
KUBE_SCHEDULER_ARGS=""
[root@master2 kubernetes]# cat controller-manager
KUBE_CONTROLLER_MANAGER_ARGS=""Note that --leader-elect is only configured on Master1 as I want Master1 to be the leader.
My HA Proxy config is simple :
frontend K8S-Master
bind 172.16.16.100:8080
default_backend K8S-Master-Nodes
backend K8S-Master-Nodes
mode http
balance roundrobin
server master1 172.16.16.5:8080 check
server master2 172.16.16.51:8080 checkNow I have directed my minion to connect to the Load Balancer IP rather than directly to the Master IP.
Config on Minion is :
[root@minion kubernetes]# cat /etc/kubernetes/config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://172.16.16.100:8080"On both Master nodes, I see the minion/node status as Ready
[root@master1 ~]# kubectl get nodes
NAME STATUS AGE
172.16.16.50 Ready 2h
[root@master2 ~]# kubectl get nodes
NAME STATUS AGE
172.16.16.50 Ready 2hI setup an example nginx pod using :
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 2
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80I created the Replication Controller on Master1 using :
[root@master1 ~]# kubectl create -f nginx.yamlAnd on both Master nodes, I was able to see the pods created.
[root@master1 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-jwpxd 1/1 Running 0 29m
nginx-q613j 1/1 Running 0 29m
[root@master2 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-jwpxd 1/1 Running 0 29m
nginx-q613j 1/1 Running 0 29mNow logically thinking, if I were to take down Master1 node and delete the pods on Master2 , Master2 should recreate the pods. So this is what I do.
On Master1 :
[root@master1 ~]# systemctl stop kube-scheduler ; systemctl stop kube-apiserver ; systemctl stop kube-controller-managerOn Master2 :
[root@slave1 kubernetes]# kubectl delete po --all
pod "nginx-l7mvc" deleted
pod "nginx-r3m58" deletedNow Master2 should create the pods since the Replication Controller is still up. But the new Pods are stuck in :
[root@master2 kubernetes]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-l7mvc 1/1 Terminating 0 13m
nginx-qv6z9 0/1 Pending 0 13m
nginx-r3m58 1/1 Terminating 0 13m
nginx-rplcz 0/1 Pending 0 13mIve waited a long time but the pods are stuck in this state.
But when I restart the services on Master1 :
[root@master1 ~]# systemctl start kube-scheduler ; systemctl start kube-apiserver ; systemctl start kube-controller-managerThen I see progress on Master1 :
NAME READY STATUS RESTARTS AGE
nginx-qv6z9 0/1 ContainerCreating 0 14m
nginx-rplcz 0/1 ContainerCreating 0 14m
[root@slave1 kubernetes]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-qv6z9 1/1 Running 0 15m
nginx-rplcz 1/1 Running 0 15mWhy doesnt Master2 recreate the pods ? This is the confusion that I am trying to figure out. Ive come a long way to setup a fully function HA setup but seems like almost there only if I can figure out this puzzle.
In my opinion, the error comes from the fact that Master2 has no --leader-elect flag enabled. There can only be one scheduler and controller processes running at the same time, that is the reason of --leader-elect. The aim of this flag is to have them "compete" to see which of the scheduler and controller processes is active at a given time. As you did not set the flag in both master nodes, then there are two scheduler and controller processes active, and thus the conflicts you are experiencing. In order to fix the issue, I advise you to enable this flag in all of the master nodes.
What is more, according to the k8s documentation https://kubernetes.io/docs/tasks/administer-cluster/highly-available-master/#best-practices-for-replicating-masters-for-ha-clusters:
Do not use a cluster with two master replicas. Consensus on a two replica cluster requires both replicas running when changing persistent state. As a result, both replicas are needed and a failure of any replica turns cluster into majority failure state. A two-replica cluster is thus inferior, in terms of HA, to a single replica cluster.