So I have this unhealthy cluster partially working in the datacenter. This is probably the 10th time I have rebuilt from the instructions at: https://kubernetes.io/docs/setup/independent/high-availability/
I can apply some pods to this cluster and it seems to work but eventually it starts slowing down and crashing as you can see below. Here is the scheduler manifest:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
image: k8s.gcr.io/kube-scheduler:v1.14.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10251
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
hostNetwork: true
priorityClassName: system-cluster-critical
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
status: {}
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-42psn 1/1 Running 9 88m
coredns-fb8b8dccf-x9mlt 1/1 Running 11 88m
docker-registry-dqvzb 1/1 Running 1 2d6h
kube-apiserver-kube-apiserver-1 1/1 Running 44 2d8h
kube-apiserver-kube-apiserver-2 1/1 Running 34 2d7h
kube-controller-manager-kube-apiserver-1 1/1 Running 198 2d2h
kube-controller-manager-kube-apiserver-2 0/1 CrashLoopBackOff 170 2d7h
kube-flannel-ds-amd64-4mbfk 1/1 Running 1 2d7h
kube-flannel-ds-amd64-55hc7 1/1 Running 1 2d8h
kube-flannel-ds-amd64-fvwmf 1/1 Running 1 2d7h
kube-flannel-ds-amd64-ht5wm 1/1 Running 3 2d7h
kube-flannel-ds-amd64-rjt9l 1/1 Running 4 2d8h
kube-flannel-ds-amd64-wpmkj 1/1 Running 1 2d7h
kube-proxy-2n64d 1/1 Running 3 2d7h
kube-proxy-2pq2g 1/1 Running 1 2d7h
kube-proxy-5fbms 1/1 Running 2 2d8h
kube-proxy-g8gmn 1/1 Running 1 2d7h
kube-proxy-wrdrj 1/1 Running 1 2d8h
kube-proxy-wz6gv 1/1 Running 1 2d7h
kube-scheduler-kube-apiserver-1 0/1 CrashLoopBackOff 198 2d2h
kube-scheduler-kube-apiserver-2 1/1 Running 5 18m
nginx-ingress-controller-dz8fm 1/1 Running 3 2d4h
nginx-ingress-controller-sdsgg 1/1 Running 3 2d4h
nginx-ingress-controller-sfrgb 1/1 Running 1 2d4h
$ kubectl -n kube-system describe pod kube-scheduler-kube-apiserver-1
Containers:
kube-scheduler:
Container ID: docker://c04f3c9061cafef8749b2018cd66e6865d102f67c4d13bdd250d0b4656d5f220
Image: k8s.gcr.io/kube-scheduler:v1.14.2
Image ID: docker-pullable://k8s.gcr.io/kube-scheduler@sha256:052e0322b8a2b22819ab0385089f202555c4099493d1bd33205a34753494d2c2
Port: <none>
Host Port: <none>
Command:
kube-scheduler
--bind-address=127.0.0.1
--kubeconfig=/etc/kubernetes/scheduler.conf
--authentication-kubeconfig=/etc/kubernetes/scheduler.conf
--authorization-kubeconfig=/etc/kubernetes/scheduler.conf
--leader-elect=true
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 28 May 2019 23:16:50 -0400
Finished: Tue, 28 May 2019 23:19:56 -0400
Ready: False
Restart Count: 195
Requests:
cpu: 100m
Liveness: http-get http://127.0.0.1:10251/healthz delay=15s timeout=15s period=10s #success=1 #failure=8
Environment: <none>
Mounts:
/etc/kubernetes/scheduler.conf from kubeconfig (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubeconfig:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/scheduler.conf
HostPathType: FileOrCreate
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 4h56m (x104 over 37h) kubelet, kube-apiserver-1 Created container kube-scheduler
Normal Started 4h56m (x104 over 37h) kubelet, kube-apiserver-1 Started container kube-scheduler
Warning Unhealthy 137m (x71 over 34h) kubelet, kube-apiserver-1 Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
Normal Pulled 132m (x129 over 37h) kubelet, kube-apiserver-1 Container image "k8s.gcr.io/kube-scheduler:v1.14.2" already present on machine
Warning BackOff 128m (x1129 over 34h) kubelet, kube-apiserver-1 Back-off restarting failed container
Normal SandboxChanged 80m kubelet, kube-apiserver-1 Pod sandbox changed, it will be killed and re-created.
Warning Failed 76m kubelet, kube-apiserver-1 Error: context deadline exceeded
Normal Pulled 36m (x7 over 78m) kubelet, kube-apiserver-1 Container image "k8s.gcr.io/kube-scheduler:v1.14.2" already present on machine
Normal Started 36m (x6 over 74m) kubelet, kube-apiserver-1 Started container kube-scheduler
Normal Created 32m (x7 over 74m) kubelet, kube-apiserver-1 Created container kube-scheduler
Warning Unhealthy 20m (x9 over 40m) kubelet, kube-apiserver-1 Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
Warning BackOff 2m56s (x85 over 69m) kubelet, kube-apiserver-1 Back-off restarting failed container
I feel like I am overlooking a simple option or configuration but I can't find it and after days of dealing with this problem and reading documentation I am at my wits end.
The load balancer is a TCP load balancer and seems to be working as expected as I can query the cluster from my desktop.
Any suggestions or troubleshooting tips are definitely welcome at this time.
Thank you.
The problem with our configuration was that a well intended technician decided to eliminate one of the rules on the kubernetes master firewall which prevented the master from looping back to ports it needed to probe. This caused all kinds of weird issues and misdiagnosed problems which was definitely the wrong direction. After we allowed all ports on the servers Kubernetes was back to its normal behavior.