kube-scheduler Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused

5/29/2019

So I have this unhealthy cluster partially working in the datacenter. This is probably the 10th time I have rebuilt from the instructions at: https://kubernetes.io/docs/setup/independent/high-availability/

I can apply some pods to this cluster and it seems to work but eventually it starts slowing down and crashing as you can see below. Here is the scheduler manifest:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    image: k8s.gcr.io/kube-scheduler:v1.14.2
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10251
        scheme: HTTP
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: kube-scheduler
    resources:
      requests:
        cpu: 100m
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

$ kubectl -n kube-system get pods

NAME                                       READY   STATUS             RESTARTS   AGE
coredns-fb8b8dccf-42psn                    1/1     Running            9          88m
coredns-fb8b8dccf-x9mlt                    1/1     Running            11         88m
docker-registry-dqvzb                      1/1     Running            1          2d6h
kube-apiserver-kube-apiserver-1            1/1     Running            44         2d8h
kube-apiserver-kube-apiserver-2            1/1     Running            34         2d7h
kube-controller-manager-kube-apiserver-1   1/1     Running            198        2d2h
kube-controller-manager-kube-apiserver-2   0/1     CrashLoopBackOff   170        2d7h
kube-flannel-ds-amd64-4mbfk                1/1     Running            1          2d7h
kube-flannel-ds-amd64-55hc7                1/1     Running            1          2d8h
kube-flannel-ds-amd64-fvwmf                1/1     Running            1          2d7h
kube-flannel-ds-amd64-ht5wm                1/1     Running            3          2d7h
kube-flannel-ds-amd64-rjt9l                1/1     Running            4          2d8h
kube-flannel-ds-amd64-wpmkj                1/1     Running            1          2d7h
kube-proxy-2n64d                           1/1     Running            3          2d7h
kube-proxy-2pq2g                           1/1     Running            1          2d7h
kube-proxy-5fbms                           1/1     Running            2          2d8h
kube-proxy-g8gmn                           1/1     Running            1          2d7h
kube-proxy-wrdrj                           1/1     Running            1          2d8h
kube-proxy-wz6gv                           1/1     Running            1          2d7h
kube-scheduler-kube-apiserver-1            0/1     CrashLoopBackOff   198        2d2h
kube-scheduler-kube-apiserver-2            1/1     Running            5          18m
nginx-ingress-controller-dz8fm             1/1     Running            3          2d4h
nginx-ingress-controller-sdsgg             1/1     Running            3          2d4h
nginx-ingress-controller-sfrgb             1/1     Running            1          2d4h

$ kubectl -n kube-system describe pod kube-scheduler-kube-apiserver-1

Containers:
  kube-scheduler:
    Container ID:  docker://c04f3c9061cafef8749b2018cd66e6865d102f67c4d13bdd250d0b4656d5f220
    Image:         k8s.gcr.io/kube-scheduler:v1.14.2
    Image ID:      docker-pullable://k8s.gcr.io/kube-scheduler@sha256:052e0322b8a2b22819ab0385089f202555c4099493d1bd33205a34753494d2c2
    Port:          <none>
    Host Port:     <none>
    Command:
      kube-scheduler
      --bind-address=127.0.0.1
      --kubeconfig=/etc/kubernetes/scheduler.conf
      --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
      --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
      --leader-elect=true
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 28 May 2019 23:16:50 -0400
      Finished:     Tue, 28 May 2019 23:19:56 -0400
    Ready:          False
    Restart Count:  195
    Requests:
      cpu:        100m
    Liveness:     http-get http://127.0.0.1:10251/healthz delay=15s timeout=15s period=10s #success=1 #failure=8
    Environment:  <none>
    Mounts:
      /etc/kubernetes/scheduler.conf from kubeconfig (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kubeconfig:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/scheduler.conf
    HostPathType:  FileOrCreate
QoS Class:         Burstable
Node-Selectors:    <none>
Tolerations:       :NoExecute
Events:
  Type     Reason          Age                    From                       Message
  ----     ------          ----                   ----                       -------
  Normal   Created         4h56m (x104 over 37h)  kubelet, kube-apiserver-1  Created container kube-scheduler
  Normal   Started         4h56m (x104 over 37h)  kubelet, kube-apiserver-1  Started container kube-scheduler
  Warning  Unhealthy       137m (x71 over 34h)    kubelet, kube-apiserver-1  Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
  Normal   Pulled          132m (x129 over 37h)   kubelet, kube-apiserver-1  Container image "k8s.gcr.io/kube-scheduler:v1.14.2" already present on machine
  Warning  BackOff         128m (x1129 over 34h)  kubelet, kube-apiserver-1  Back-off restarting failed container
  Normal   SandboxChanged  80m                    kubelet, kube-apiserver-1  Pod sandbox changed, it will be killed and re-created.
  Warning  Failed          76m                    kubelet, kube-apiserver-1  Error: context deadline exceeded
  Normal   Pulled          36m (x7 over 78m)      kubelet, kube-apiserver-1  Container image "k8s.gcr.io/kube-scheduler:v1.14.2" already present on machine
  Normal   Started         36m (x6 over 74m)      kubelet, kube-apiserver-1  Started container kube-scheduler
  Normal   Created         32m (x7 over 74m)      kubelet, kube-apiserver-1  Created container kube-scheduler
  Warning  Unhealthy       20m (x9 over 40m)      kubelet, kube-apiserver-1  Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
  Warning  BackOff         2m56s (x85 over 69m)   kubelet, kube-apiserver-1  Back-off restarting failed container

I feel like I am overlooking a simple option or configuration but I can't find it and after days of dealing with this problem and reading documentation I am at my wits end.

The load balancer is a TCP load balancer and seems to be working as expected as I can query the cluster from my desktop.

Any suggestions or troubleshooting tips are definitely welcome at this time.

Thank you.

-- Daniel Maldonado
kubeadm
kubernetes

1 Answer

5/30/2019

The problem with our configuration was that a well intended technician decided to eliminate one of the rules on the kubernetes master firewall which prevented the master from looping back to ports it needed to probe. This caused all kinds of weird issues and misdiagnosed problems which was definitely the wrong direction. After we allowed all ports on the servers Kubernetes was back to its normal behavior.

-- Daniel Maldonado
Source: StackOverflow