I am in the process of implementing an HA solution for Kubernetes Master nodes in a CentOS7 env.
My env looks like :
K8S_Master1 : 172.16.16.5
K8S_Master2 : 172.16.16.51
HAProxy : 172.16.16.100
K8S_Minion1 : 172.16.16.50
etcd Version: 3.1.7
Kubernetes v1.5.2
CentOS Linux release 7.3.1611 (Core)
My etcd cluster is setup properly and is in working state.
[root@master1 ~]# etcdctl cluster-health
member 282a4a2998aa4eb0 is healthy: got healthy result from http://172.16.16.51:2379
member dd3979c28abe306f is healthy: got healthy result from http://172.16.16.5:2379
member df7b762ad1c40191 is healthy: got healthy result from http://172.16.16.50:2379
My K8S config for Master1 is :
[root@master1 ~]# cat /etc/kubernetes/apiserver
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"
[root@master1 ~]# cat /etc/kubernetes/config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"
[root@master1 ~]# cat /etc/kubernetes/controller-manager
KUBE_CONTROLLER_MANAGER_ARGS="--leader-elect"
[root@master1 ~]# cat /etc/kubernetes/scheduler
KUBE_SCHEDULER_ARGS="--leader-elect"
As for Master2 , I have configured it to be :
[root@master2 kubernetes]# cat apiserver
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"
[root@master2 kubernetes]# cat config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"
[root@master2 kubernetes]# cat scheduler
KUBE_SCHEDULER_ARGS=""
[root@master2 kubernetes]# cat controller-manager
KUBE_CONTROLLER_MANAGER_ARGS=""
Note that --leader-elect
is only configured on Master1 as I want Master1 to be the leader.
My HA Proxy config is simple :
frontend K8S-Master
bind 172.16.16.100:8080
default_backend K8S-Master-Nodes
backend K8S-Master-Nodes
mode http
balance roundrobin
server master1 172.16.16.5:8080 check
server master2 172.16.16.51:8080 check
Now I have directed my minion to connect to the Load Balancer IP rather than directly to the Master IP.
Config on Minion is :
[root@minion kubernetes]# cat /etc/kubernetes/config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://172.16.16.100:8080"
On both Master nodes, I see the minion/node status as Ready
[root@master1 ~]# kubectl get nodes
NAME STATUS AGE
172.16.16.50 Ready 2h
[root@master2 ~]# kubectl get nodes
NAME STATUS AGE
172.16.16.50 Ready 2h
I setup an example nginx pod using :
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 2
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
I created the Replication Controller on Master1
using :
[root@master1 ~]# kubectl create -f nginx.yaml
And on both Master nodes, I was able to see the pods created.
[root@master1 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-jwpxd 1/1 Running 0 29m
nginx-q613j 1/1 Running 0 29m
[root@master2 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-jwpxd 1/1 Running 0 29m
nginx-q613j 1/1 Running 0 29m
Now logically thinking, if I were to take down Master1
node and delete the pods on Master2
, Master2
should recreate the pods. So this is what I do.
On Master1
:
[root@master1 ~]# systemctl stop kube-scheduler ; systemctl stop kube-apiserver ; systemctl stop kube-controller-manager
On Master2
:
[root@slave1 kubernetes]# kubectl delete po --all
pod "nginx-l7mvc" deleted
pod "nginx-r3m58" deleted
Now Master2
should create the pods since the Replication Controller is still up. But the new Pods are stuck in :
[root@master2 kubernetes]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-l7mvc 1/1 Terminating 0 13m
nginx-qv6z9 0/1 Pending 0 13m
nginx-r3m58 1/1 Terminating 0 13m
nginx-rplcz 0/1 Pending 0 13m
Ive waited a long time but the pods are stuck in this state.
But when I restart the services on Master1
:
[root@master1 ~]# systemctl start kube-scheduler ; systemctl start kube-apiserver ; systemctl start kube-controller-manager
Then I see progress on Master1
:
NAME READY STATUS RESTARTS AGE
nginx-qv6z9 0/1 ContainerCreating 0 14m
nginx-rplcz 0/1 ContainerCreating 0 14m
[root@slave1 kubernetes]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-qv6z9 1/1 Running 0 15m
nginx-rplcz 1/1 Running 0 15m
Why doesnt Master2
recreate the pods ? This is the confusion that I am trying to figure out. Ive come a long way to setup a fully function HA setup but seems like almost there only if I can figure out this puzzle.
In my opinion, the error comes from the fact that Master2 has no --leader-elect
flag enabled. There can only be one scheduler
and controller
processes running at the same time, that is the reason of --leader-elect
. The aim of this flag is to have them "compete" to see which of the scheduler
and controller
processes is active at a given time. As you did not set the flag in both master nodes, then there are two scheduler
and controller
processes active, and thus the conflicts you are experiencing. In order to fix the issue, I advise you to enable this flag in all of the master nodes.
What is more, according to the k8s documentation https://kubernetes.io/docs/tasks/administer-cluster/highly-available-master/#best-practices-for-replicating-masters-for-ha-clusters:
Do not use a cluster with two master replicas. Consensus on a two replica cluster requires both replicas running when changing persistent state. As a result, both replicas are needed and a failure of any replica turns cluster into majority failure state. A two-replica cluster is thus inferior, in terms of HA, to a single replica cluster.