kubernetes HA cluster masters nodes not ready

6/11/2018

I have deployed a kubernetes HA cluster using the next config.yaml:

etcd:
  endpoints:
  - "http://172.16.8.236:2379"
  - "http://172.16.8.237:2379"
  - "http://172.16.8.238:2379"
networking:
  podSubnet: "192.168.0.0/16"
apiServerExtraArgs:
  endpoint-reconciler-type: lease

When I check kubectl get nodes:

NAME      STATUS     ROLES     AGE       VERSION
master1   Ready      master    22m       v1.10.4
master2   NotReady   master    17m       v1.10.4
master3   NotReady   master    16m       v1.10.4

If I check the pods, I can see too much are failing:

[ikerlan@master1 ~]$  kubectl get pods -n kube-system
NAME                                       READY     STATUS              RESTARTS   AGE
calico-etcd-5jftb                          0/1       NodeLost            0          16m
calico-etcd-kl7hb                          1/1       Running             0          16m
calico-etcd-z7sps                          0/1       NodeLost            0          16m
calico-kube-controllers-79dccdc4cc-vt5t7   1/1       Running             0          16m
calico-node-dbjl2                          2/2       Running             0          16m
calico-node-gkkth                          0/2       NodeLost            0          16m
calico-node-rqzzl                          0/2       NodeLost            0          16m
kube-apiserver-master1                     1/1       Running             0          21m
kube-controller-manager-master1            1/1       Running             0          22m
kube-dns-86f4d74b45-rwchm                  1/3       CrashLoopBackOff    17         22m
kube-proxy-226xd                           1/1       Running             0          22m
kube-proxy-jr2jq                           0/1       ContainerCreating   0          18m
kube-proxy-zmjdm                           0/1       ContainerCreating   0          17m
kube-scheduler-master1                     1/1       Running             0          21m

If I run kubectl describe node master2:

[ikerlan@master1 ~]$ kubectl describe node master2
Name:               master2
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=master2
                    node-role.kubernetes.io/master=
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp:  Mon, 11 Jun 2018 12:06:03 +0200
Taints:             node-role.kubernetes.io/master:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----             ------    -----------------                 ------------------                ------                    -------
  OutOfDisk        Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure   Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure      False     Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:00 +0200   KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
Addresses:
  InternalIP:  172.16.8.237
  Hostname:    master2
Capacity:
 cpu:                2
 ephemeral-storage:  37300436Ki

Then if I check the pods, kubectl describe pod -n kube-system calico-etcd-5jftb:

[ikerlan@master1 ~]$ kubectl describe pod -n kube-system  calico-etcd-5jftb
Name:                      calico-etcd-5jftb
Namespace:                 kube-system
Node:                      master2/
Labels:                    controller-revision-hash=4283683065
                           k8s-app=calico-etcd
                           pod-template-generation=1
Annotations:               scheduler.alpha.kubernetes.io/critical-pod=
Status:                    Terminating (lasts 20h)
Termination Grace Period:  30s
Reason:                    NodeLost
Message:                   Node master2 which was running pod calico-etcd-5jftb is unresponsive
IP:                        
Controlled By:             DaemonSet/calico-etcd
Containers:
  calico-etcd:
    Image:      quay.io/coreos/etcd:v3.1.10
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/local/bin/etcd
    Args:
      --name=calico
      --data-dir=/var/etcd/calico-data
      --advertise-client-urls=http://$CALICO_ETCD_IP:6666
      --listen-client-urls=http://0.0.0.0:6666
      --listen-peer-urls=http://0.0.0.0:6667
      --auto-compaction-retention=1
    Environment:
      CALICO_ETCD_IP:   (v1:status.podIP)
    Mounts:
      /var/etcd from var-etcd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-tj6d7 (ro)
Volumes:
  var-etcd:
    Type:          HostPath (bare host directory volume)
    Path:          /var/etcd
    HostPathType:  
  default-token-tj6d7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-tj6d7
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:          <none>

I have tried to update etcd cluster, to version 3.3 and now I can see the next logs (and some more timeouts):

2018-06-12 09:17:51.305960 W | etcdserver: read-only range request "key:\"/registry/apiregistration.k8s.io/apiservices/v1beta1.authentication.k8s.io\" " took too long (190.475363ms) to execute
2018-06-12 09:18:06.788558 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (109.543763ms) to execute
2018-06-12 09:18:34.875823 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (136.649505ms) to execute
2018-06-12 09:18:41.634057 W | etcdserver: read-only range request "key:\"/registry/minions\" range_end:\"/registry/miniont\" count_only:true " took too long (106.00073ms) to execute
2018-06-12 09:18:42.345564 W | etcdserver: request "header:<ID:4449666326481959890 > lease_revoke:<ID:4449666326481959752 > " took too long (142.771179ms) to execute

I have checked: kubectl get events

22m         22m          1         master2.15375fdf087fc69f   Node                  Normal    Starting                  kube-proxy, master2   Starting kube-proxy.
22m         22m          1         master3.15375fe744055758   Node                  Normal    Starting                  kubelet, master3      Starting kubelet.
22m         22m          5         master3.15375fe74d47afa2   Node                  Normal    NodeHasSufficientDisk     kubelet, master3      Node master3 status is now: NodeHasSufficientDisk
22m         22m          5         master3.15375fe74d47f80f   Node                  Normal    NodeHasSufficientMemory   kubelet, master3      Node master3 status is now: NodeHasSufficientMemory
22m         22m          5         master3.15375fe74d48066e   Node                  Normal    NodeHasNoDiskPressure     kubelet, master3      Node master3 status is now: NodeHasNoDiskPressure
22m         22m          5         master3.15375fe74d481368   Node                  Normal    NodeHasSufficientPID      kubelet, master3      Node master3 status is now: NodeHasSufficientPID
-- Asier Gomez
etcd
kubernetes
project-calico

2 Answers

6/12/2018

I have solved it:

  1. Adding all the masters IPs and LB IP to the apiServerCertSANs

  2. Copying the kubernetes certificates from the first master to the other masters.

-- Asier Gomez
Source: StackOverflow

6/11/2018

I see multiple calico-etcd pods attempting to be ran, if you have used a calico.yaml that deploys etcd for you, that will not work in a multi-master environment.

That manifest is not intended for production deployment and will not work in a multi-master environment because the etcd it deploys is not configured to attempt to form a cluster.

You could still use that manifest but you would need to remove the etcd pods it deploys and set the etcd_endpoints to an etcd cluster you have deployed.

-- Erik Stidham
Source: StackOverflow