No Pods reachable or schedulable on kubernetes cluster

11/12/2021

I have 2 kubernetes clusters in the IBM cloud, one has 2 Nodes, the other one 4.

The one that has 4 Nodes is working properly but at the other one I had to temporarily remove the worker nodes due to monetary reasons (shouldn't be payed while being idle).

When I reactivated the two nodes, everything seemed to start up fine and as long as I don't try to interact with Pods it still looks fine on the surface, no messages about inavailability or critical health status. OK, I deleted two obsolete Namespaces which got stuck in the Terminating state, but I could resolve that issue by restarting a cluster node (don't exactly know anymore which one it was).

When everything looked ok, I tried to access the kubernetes dashboard (everything done before was on IBM management level or in the command line) but surprisingly I found it unreachable with an error page in the browser stating:

503: Service Unavailable

There was a small JSON message at the bottom of that page, which said:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": { },
  "status": "Failure",
  "message": "error trying to reach service: read tcp 172.18.190.60:39946-\u003e172.19.151.38:8090: read: connection reset by peer",
  "reason": "ServiceUnavailable",
  "code": 503
}

I sent a kubectl logs kubernetes-dashboard-54674bdd65-nf6w7 --namespace=kube-system where the Pod was shown as running, but the result was not logs to be viewed, it was this message instead:

Error from server: Get "https://10.215.17.75:10250/containerLogs/kube-system/kubernetes-dashboard-54674bdd65-nf6w7/kubernetes-dashboard":
read tcp 172.18.135.195:56882->172.19.151.38:8090:
read: connection reset by peer

Then I found out I'm neither able to get the logs of any Pod running in that cluster, nor am I able to deploy any new custom kubernetes object that requires scheduling (I actually could apply Services or ConfigMaps, but no Pod, ReplicaSet, Deployment or similar).

I already tried to

  • reload the worker nodes in the workerpool
  • restart the worker nodes in the workerpool
  • restarted the kubernetes-dashboard Deployment

Unfortunately, none of the above actions changed the accessibility of the Pods.

There's another thing that might be related (though I'm not quite sure it actually is):

In the other cluster that runs fine, there are three calico Pods running and all three are up while in the problematic cluster only 2 of the three calico Pods are up and running, the third one stays in Pending state and a kubectl describe pod calico-blablabla-blabla reveals the reason, an Event

Warning  FailedScheduling  13s   default-scheduler
0/2 nodes are available: 2 node(s) didn't have free ports for the requested pod ports.

Does anyone have a clue about what's going on in that cluster and can point me to possible solutions? I don't really want to delete the cluster and spawn a new one but I cannot use the user interfaces (dashboard or cli).

Edit

The result of kubectl describe pod kubernetes-dashboard-54674bdd65-4m2ch --namespace=kube-system:

Name:                 kubernetes-dashboard-54674bdd65-4m2ch
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 10.215.17.82/10.215.17.82
Start Time:           Mon, 15 Nov 2021 09:01:30 +0100
Labels:               k8s-app=kubernetes-dashboard
                      pod-template-hash=54674bdd65
Annotations:          cni.projectcalico.org/containerID: ca52cefaae58d8e5ce6d54883cb6a6135318c8db53d231dc645a5cf2e67d821e
                      cni.projectcalico.org/podIP: 172.30.184.2/32
                      cni.projectcalico.org/podIPs: 172.30.184.2/32
                      container.seccomp.security.alpha.kubernetes.io/kubernetes-dashboard: runtime/default
                      kubectl.kubernetes.io/restartedAt: 2021-11-10T15:47:14+01:00
                      kubernetes.io/psp: ibm-privileged-psp
Status:               Running
IP:                   172.30.184.2
IPs:
  IP:           172.30.184.2
Controlled By:  ReplicaSet/kubernetes-dashboard-54674bdd65
Containers:
  kubernetes-dashboard:
    Container ID:  containerd://bac57850055cd6bb944c4d893a5d315c659fd7d4935fe49083d9ef8ae03e5c31
    Image:         registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard:v2.3.1
    Image ID:      registry.eu-de.bluemix.net/armada-master/kubernetesui-dashboard@sha256:f14f581d36b83fc9c1cfa3b0609e7788017ecada1f3106fab1c9db35295fe523
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --auto-generate-certificates
      --namespace=kube-system
    State:          Running
      Started:      Mon, 15 Nov 2021 09:01:37 +0100
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/ delay=10s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /certs from kubernetes-dashboard-certs (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-sc9kw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kubernetes-dashboard-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubernetes-dashboard-certs
    Optional:    false
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-sc9kw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 600s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 600s
Events:                      <none>
-- deHaar
kubernetes
kubernetes-dashboard
kubernetes-pod
portforwarding
scheduling

1 Answer

11/19/2021

Problem resolved…

The cause of the problem was an update of the cluster to the kubernetes version 1.21 while my cluster was meeting the following conditions:

  • private and public service endpoint enabled
  • VRF disabled

Root cause:

In Kubernetes version 1.21, Konnectivity replaces OpenVPN as the network proxy that is used to secure the communication of the Kubernetes API server master to worker nodes in the cluster.
When using Konnectivity, a problem exists with masters to cluster nodes communication when all of the above mentioned conditions are met.

Solution steps:

  • disabled the private service endpoint (the public one seems not to be a problem) by using the command
    ibmcloud ks cluster master private-service-endpoint disable --cluster <CLUSTER_NAME> (this command is provider specific, if you are experiencing the same problem with a different provider or on a local installation, find out how to disable that private service endpoint)
  • refreshed the cluster master using ibmcloud ks cluster master refresh --cluster <CLUSTER_NAME> and finally
  • reloaded all the worker nodes (in the web console, should be possible through a command as well)
  • waited for about 30 minutes:
    • Dashboard available / reachable again
    • Pods accessible and schedulable again

General recommendation:

BEFORE you update any cluster to kubernetes 1.21, check if you have enabled the private service endpoint. If you have, either disable it or delay the update until you can, or enable VRF (virtual routing and forwarding), which I couldn't but was told it was likely to resolve the issue.

-- deHaar
Source: StackOverflow