I have deployed a kubernetes HA cluster using the next config.yaml:
etcd:
endpoints:
- "http://172.16.8.236:2379"
- "http://172.16.8.237:2379"
- "http://172.16.8.238:2379"
networking:
podSubnet: "192.168.0.0/16"
apiServerExtraArgs:
endpoint-reconciler-type: lease
When I check kubectl get nodes
:
NAME STATUS ROLES AGE VERSION
master1 Ready master 22m v1.10.4
master2 NotReady master 17m v1.10.4
master3 NotReady master 16m v1.10.4
If I check the pods, I can see too much are failing:
[ikerlan@master1 ~]$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-etcd-5jftb 0/1 NodeLost 0 16m
calico-etcd-kl7hb 1/1 Running 0 16m
calico-etcd-z7sps 0/1 NodeLost 0 16m
calico-kube-controllers-79dccdc4cc-vt5t7 1/1 Running 0 16m
calico-node-dbjl2 2/2 Running 0 16m
calico-node-gkkth 0/2 NodeLost 0 16m
calico-node-rqzzl 0/2 NodeLost 0 16m
kube-apiserver-master1 1/1 Running 0 21m
kube-controller-manager-master1 1/1 Running 0 22m
kube-dns-86f4d74b45-rwchm 1/3 CrashLoopBackOff 17 22m
kube-proxy-226xd 1/1 Running 0 22m
kube-proxy-jr2jq 0/1 ContainerCreating 0 18m
kube-proxy-zmjdm 0/1 ContainerCreating 0 17m
kube-scheduler-master1 1/1 Running 0 21m
If I run kubectl describe node master2
:
[ikerlan@master1 ~]$ kubectl describe node master2
Name: master2
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=master2
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Mon, 11 Jun 2018 12:06:03 +0200
Taints: node-role.kubernetes.io/master:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure False Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:00 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready Unknown Mon, 11 Jun 2018 12:06:15 +0200 Mon, 11 Jun 2018 12:06:56 +0200 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 172.16.8.237
Hostname: master2
Capacity:
cpu: 2
ephemeral-storage: 37300436Ki
Then if I check the pods, kubectl describe pod -n kube-system calico-etcd-5jftb
:
[ikerlan@master1 ~]$ kubectl describe pod -n kube-system calico-etcd-5jftb
Name: calico-etcd-5jftb
Namespace: kube-system
Node: master2/
Labels: controller-revision-hash=4283683065
k8s-app=calico-etcd
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
Status: Terminating (lasts 20h)
Termination Grace Period: 30s
Reason: NodeLost
Message: Node master2 which was running pod calico-etcd-5jftb is unresponsive
IP:
Controlled By: DaemonSet/calico-etcd
Containers:
calico-etcd:
Image: quay.io/coreos/etcd:v3.1.10
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/etcd
Args:
--name=calico
--data-dir=/var/etcd/calico-data
--advertise-client-urls=http://$CALICO_ETCD_IP:6666
--listen-client-urls=http://0.0.0.0:6666
--listen-peer-urls=http://0.0.0.0:6667
--auto-compaction-retention=1
Environment:
CALICO_ETCD_IP: (v1:status.podIP)
Mounts:
/var/etcd from var-etcd (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-tj6d7 (ro)
Volumes:
var-etcd:
Type: HostPath (bare host directory volume)
Path: /var/etcd
HostPathType:
default-token-tj6d7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-tj6d7
Optional: false
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events: <none>
I have tried to update etcd cluster, to version 3.3 and now I can see the next logs (and some more timeouts):
2018-06-12 09:17:51.305960 W | etcdserver: read-only range request "key:\"/registry/apiregistration.k8s.io/apiservices/v1beta1.authentication.k8s.io\" " took too long (190.475363ms) to execute
2018-06-12 09:18:06.788558 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (109.543763ms) to execute
2018-06-12 09:18:34.875823 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (136.649505ms) to execute
2018-06-12 09:18:41.634057 W | etcdserver: read-only range request "key:\"/registry/minions\" range_end:\"/registry/miniont\" count_only:true " took too long (106.00073ms) to execute
2018-06-12 09:18:42.345564 W | etcdserver: request "header:<ID:4449666326481959890 > lease_revoke:<ID:4449666326481959752 > " took too long (142.771179ms) to execute
I have checked: kubectl get events
22m 22m 1 master2.15375fdf087fc69f Node Normal Starting kube-proxy, master2 Starting kube-proxy.
22m 22m 1 master3.15375fe744055758 Node Normal Starting kubelet, master3 Starting kubelet.
22m 22m 5 master3.15375fe74d47afa2 Node Normal NodeHasSufficientDisk kubelet, master3 Node master3 status is now: NodeHasSufficientDisk
22m 22m 5 master3.15375fe74d47f80f Node Normal NodeHasSufficientMemory kubelet, master3 Node master3 status is now: NodeHasSufficientMemory
22m 22m 5 master3.15375fe74d48066e Node Normal NodeHasNoDiskPressure kubelet, master3 Node master3 status is now: NodeHasNoDiskPressure
22m 22m 5 master3.15375fe74d481368 Node Normal NodeHasSufficientPID kubelet, master3 Node master3 status is now: NodeHasSufficientPID
I have solved it:
Adding all the masters IPs and LB IP to the apiServerCertSANs
Copying the kubernetes certificates from the first master to the other masters.
I see multiple calico-etcd pods attempting to be ran, if you have used a calico.yaml that deploys etcd for you, that will not work in a multi-master environment.
That manifest is not intended for production deployment and will not work in a multi-master environment because the etcd it deploys is not configured to attempt to form a cluster.
You could still use that manifest but you would need to remove the etcd pods it deploys and set the etcd_endpoints to an etcd cluster you have deployed.