k8s pod stuck in status "pending"

6/25/2021

All new containers are stuck in status "pending". It does not seem to be a resource issue, since the total cluster utilization is about 10% cpu, 30% memory.

How do I get more insights into the issue?

$ kubectl get pod
NAME                                                    READY   STATUS        RESTARTS   AGE
cq-iam-boarding-77fd94dc94-8pc6f                        1/1     Running       0          30h
cq-iam-demo-cloud-6b99f6544d-9v7j7                      1/1     Running       0          30h
cq-iam-mpm-dev-8c6cc58fd-fczlw                          1/1     Running       0          30h
cq-iam-proxy-86854cc78d-49gfw                           0/1     Terminating   0          7h42m
cq-iam-proxy-86854cc78d-dqlz8                           0/1     Terminating   0          7h36m
cq-iam-proxy-86854cc78d-m7zs2                           0/1     Pending       0          5h22m
cq-launchpad-app-7b57c478b9-gqcxj                       1/1     Running       0          13h
cq-management-api-7c689c7846-q9fz2                      1/1     Running       0          29h
cq-opa-api-8458db697c-75rzd                             1/1     Running       0          30h
cq-settings-app-6874885794-mspj9                        1/1     Running       0          29h
node-debugger-aks-nodepool1-31127038-vmss000000-czt8s   0/1     Pending       0          8h
$ kubectl top pods
NAME                                 CPU(cores)   MEMORY(bytes)
cq-iam-boarding-77fd94dc94-8pc6f     2m           482Mi
cq-iam-demo-cloud-6b99f6544d-9v7j7   2m           507Mi
cq-iam-mpm-dev-8c6cc58fd-fczlw       2m           443Mi
cq-launchpad-app-7b57c478b9-gqcxj    0m           2Mi
cq-management-api-7c689c7846-q9fz2   1m           88Mi
cq-opa-api-8458db697c-75rzd          1m           17Mi
cq-settings-app-6874885794-mspj9     1m           2Mi
$ kubectl describe pod cq-iam-proxy-86854cc78d-m7zs2
Name:           cq-iam-proxy-86854cc78d-m7zs2
Namespace:      dev
Priority:       0
Node:           aks-nodepool1-31127038-vmss000000/
Labels:         app=cq-iam-proxy
                pod-template-hash=86854cc78d
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
  cq-iam-proxy:
    Image:      xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
    Port:       80/TCP
    Host Port:  0/TCP
    Environment:
      CQ_HOSTNAME:  dev.hvt.zone
      key1:         TODO
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
  Type           Status
  PodScheduled   True
Volumes:
  default-token-pl6p4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pl6p4
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

enter image description here

Check the status of nodepool1:

  • nodepool is all good and running
  • there are three nodes which are all green (memory, disk, readiness)

Can you show the logs of the pod?

This is what I get when I print the pod logs:

$ kubectl logs cq-iam-proxy-86854cc78d-m7zs2
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-m7zs2)

Please include the events of pods in Terminating status. There may be a clue there:

$ kubectl describe pod cq-iam-proxy-86854cc78d-49gfw
Name:                      cq-iam-proxy-86854cc78d-49gfw
Namespace:                 dev
Priority:                  0
Node:                      aks-nodepool1-31127038-vmss000000/
Labels:                    app=cq-iam-proxy
                           pod-template-hash=86854cc78d
Annotations:               <none>
Status:                    Terminating (lasts 2d18h)
Termination Grace Period:  30s
IP:
IPs:                       <none>
Controlled By:             ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
  cq-iam-proxy:
    Image:      xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
    Port:       80/TCP
    Host Port:  0/TCP
    Environment:
      CQ_HOSTNAME:  dev.hvt.zone
      key1:         TODO
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
  Type           Status
  PodScheduled   True
Volumes:
  default-token-pl6p4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pl6p4
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

There are no events there? Is there anything in the logs of those two pods?

$ kubectl logs cq-iam-proxy-86854cc78d-dqlz8
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-dqlz8)

This seems like a problem with the application itself.

It does not seem to be a problem with the application itself. I ran these two commands:

$ kubectl run --image=busybox myapp -- false
$ kubectl run --image=busybox myapp2 -- false
  • myapp was able to start
  • myapp2 is in pending mode (same as the other applications)
myapp      0/1     CrashLoopBackOff   5          11m
myapp2     0/1     Pending            0          9m26s
$ kubectl describe pod myapp
...
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  11m                 default-scheduler  Successfully assigned dev/myapp to aks-nodepool1-31127038-vmss000001
  Normal   Created    11m (x4 over 11m)   kubelet            Created container myapp
  Normal   Started    11m (x4 over 11m)   kubelet            Started container myapp
  Normal   Pulling    10m (x5 over 11m)   kubelet            Pulling image "busybox"
  Normal   Pulled     10m (x5 over 11m)   kubelet            Successfully pulled image "busybox"
  Warning  BackOff    95s (x47 over 11m)  kubelet            Back-off restarting failed container
$ kubectl describe pod myapp2
...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned dev/myapp2 to aks-nodepool1-31127038-vmss000000

The only difference between myapp and myapp2 is that they have been scheduled on different nodes:

  • myapp was successfully started on node aks-nodepool1-31127038-vmss000001
  • myapp2 does not start on node aks-nodepool1-31127038-vmss000000
-- Florian Boehmak
azure-aks
kubernetes

1 Answer

7/8/2021

After two weeks the cluster healed it self.

The node nodepool1-31127038-vmss000000 was problematic and would get stuck starting a container.

Next time I encounter this problem I will play with these commands to heal the node:

kubectl cordon my-node            # Mark my-node as unschedulable
kubectl drain my-node             # Drain my-node in preparation for maintenance
kubectl uncordon my-node          # Mark my-node as schedulable
kubectl top node my-node          # Show metrics for a given node
-- Florian Boehmak
Source: StackOverflow