I have 2 kubernetes clusters in the IBM cloud, one has 2 Nodes, the other one 4.
The one that has 4 Nodes is working properly but at the other one I had to temporarily remove the worker nodes due to monetary reasons (shouldn't be payed while being idle).
When I reactivated the two nodes, everything seemed to start up fine and as long as I don't try to interact with Pods it still looks fine on the surface, no messages about inavailability or critical health status. OK, I deleted two obsolete Namespace
s which got stuck in the Terminating
state, but I could resolve that issue by restarting a cluster node (don't exactly know anymore which one it was).
When everything looked ok, I tried to access the kubernetes dashboard (everything done before was on IBM management level or in the command line) but surprisingly I found it unreachable with an error page in the browser stating:
503: Service Unavailable
There was a small JSON message at the bottom of that page, which said:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": { },
"status": "Failure",
"message": "error trying to reach service: read tcp 172.18.190.60:39946-\u003e172.19.151.38:8090: read: connection reset by peer",
"reason": "ServiceUnavailable",
"code": 503
}
I sent a kubectl logs kubernetes-dashboard-54674bdd65-nf6w7 --namespace=kube-system
where the Pod
was shown as running, but the result was not logs to be viewed, it was this message instead:
Error from server: Get "https://10.215.17.75:10250/containerLogs/kube-system/kubernetes-dashboard-54674bdd65-nf6w7/kubernetes-dashboard":
read tcp 172.18.135.195:56882->172.19.151.38:8090:
read: connection reset by peer
Then I found out I'm neither able to get the logs of any Pod
running in that cluster, nor am I able to deploy any new custom kubernetes object that requires scheduling (I actually could apply Service
s or ConfigMap
s, but no Pod
, ReplicaSet
, Deployment
or similar).
I already tried to
Deployment
Unfortunately, none of the above actions changed the accessibility of the Pod
s.
There's another thing that might be related (though I'm not quite sure it actually is):
In the other cluster that runs fine, there are three calico Pod
s running and all three are up while in the problematic cluster only 2 of the three calico Pod
s are up and running, the third one stays in Pending
state and a kubectl describe pod calico-blablabla-blabla
reveals the reason, an Event
Warning FailedScheduling 13s default-scheduler
0/2 nodes are available: 2 node(s) didn't have free ports for the requested pod ports.
Does anyone have a clue about what's going on in that cluster and can point me to possible solutions? I don't really want to delete the cluster and spawn a new one but I cannot use the user interfaces (dashboard or cli).
The result of kubectl describe pod kubernetes-dashboard-54674bdd65-4m2ch --namespace=kube-system
:
Name: kubernetes-dashboard-54674bdd65-4m2ch
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: 10.215.17.82/10.215.17.82
Start Time: Mon, 15 Nov 2021 09:01:30 +0100
Labels: k8s-app=kubernetes-dashboard
pod-template-hash=54674bdd65
Annotations: cni.projectcalico.org/containerID: ca52cefaae58d8e5ce6d54883cb6a6135318c8db53d231dc645a5cf2e67d821e
cni.projectcalico.org/podIP: 172.30.184.2/32
cni.projectcalico.org/podIPs: 172.30.184.2/32
container.seccomp.security.alpha.kubernetes.io/kubernetes-dashboard: runtime/default
kubectl.kubernetes.io/restartedAt: 2021-11-10T15:47:14+01:00
kubernetes.io/psp: ibm-privileged-psp
Status: Running
IP: 172.30.184.2
IPs:
IP: 172.30.184.2
Controlled By: ReplicaSet/kubernetes-dashboard-54674bdd65
Containers:
kubernetes-dashboard:
Container ID: containerd://bac57850055cd6bb944c4d893a5d315c659fd7d4935fe49083d9ef8ae03e5c31
Image: registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard:v2.3.1
Image ID: registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard@sha256:f14f581d36b83fc9c1cfa3b0609e7788017ecada1f3106fab1c9db35295fe523
Port: 8443/TCP
Host Port: 0/TCP
Args:
--auto-generate-certificates
--namespace=kube-system
State: Running
Started: Mon, 15 Nov 2021 09:01:37 +0100
Ready: True
Restart Count: 0
Requests:
cpu: 50m
memory: 100Mi
Liveness: http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
Readiness: http-get https://:8443/ delay=10s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc9kw (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-sc9kw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 600s
node.kubernetes.io/unreachable:NoExecute op=Exists for 600s
Events: <none>
The cause of the problem was an update of the cluster to the kubernetes version 1.21 while my cluster was meeting the following conditions:
In Kubernetes version 1.21, Konnectivity replaces OpenVPN as the network proxy that is used to secure the communication of the Kubernetes API server master to worker nodes in the cluster.
When using Konnectivity, a problem exists with masters to cluster nodes communication when all of the above mentioned conditions are met.
ibmcloud ks cluster master private-service-endpoint disable --cluster <CLUSTER_NAME>
(this command is provider specific, if you are experiencing the same problem with a different provider or on a local installation, find out how to disable that private service endpoint)ibmcloud ks cluster master refresh --cluster <CLUSTER_NAME>
and finallyPod
s accessible and schedulable againBEFORE you update any cluster to kubernetes 1.21, check if you have enabled the private service endpoint. If you have, either disable it or delay the update until you can, or enable VRF (virtual routing and forwarding), which I couldn't but was told it was likely to resolve the issue.