Pods are getting stuck in shutdown status

8/10/2021

I can see this Message: Node is shutting, evicting pods in pod description, This happening only for pods with a specific toleration, node selector on a preemptible node pool.

we have added tolerations to pod and created different node pools with different taints(preemptible,non preemptible) to segregate preemptible and non preemptible pods on a cluster.

cluster without taints is working fine.

cluster with taints has an issue that pods are getting stuck in shutdown status(only pods which got deployed on preemptible nodepool)

Here is the pod description

Namespace:      XXXXXX
Priority:       0
Node:           gke-cluster-reliable-preemptible-node-XXXXXX
Start Time:     Tue, 10 Aug 2021 16:44:30 +0530
Labels:         app=XXXX
                pod-template-hash=XXXX
                release=XXXX
                repo=XXX
Annotations:    randVersion: a200a
Status:         Failed
Reason:         Shutdown
Message:        Node is shutting, evicting pods
IP:
IPs:            <none>
Controlled By:  ReplicaSet/career-assessor-be-8467d6c885
Containers:
  career-assessor-be:
    Image:      XXXXXX
    Port:       8001/TCP
          key: CLOUD_SQL_CONNECTION_NAME
    Host Port:  0/TCP
    Command:
      /bin/sh
      -c
    Args:
      XXXXX

    Limits:
      cpu:     3200m
      memory:  2400Mi
    Requests:
      cpu:     1600m
      memory:  1800Mi
    Environment Variables from:
      careerassessor-config  ConfigMap  Optional: false
    Environment:
      LOG_TO_CONSOLE:     1
      INACTIVITY_PERIOD:
      USER_EMAIL:         jyostna@springboard.com
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xpdwd (ro)
  cloudsql-proxy:
    Image:      gcr.io/cloudsql-docker/gce-proxy:1.17
    Port:       <none>
    Host Port:  <none>
    Command:
      /cloud_sql_proxy
      -instances=$(CLOUD_SQL_CONNECTION_NAME)=tcp:0.0.0.0:3306
      -credential_file=/secrets/cloudsql/cloudsql-instance-credentials.json
      -term_timeout=$(CLOUD_SQL_CONNECTION_TIMEOUT)s
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:     20m
      memory:  20Mi
    Environment:
      CLOUD_SQL_CONNECTION_NAME:     <set to the key 'CLOUD_SQL_CONNECTION_NAME' of config map 'careerassessor-config'>     Optional: false
      CLOUD_SQL_CONNECTION_TIMEOUT:  <set to the key 'CLOUD_SQL_CONNECTION_TIMEOUT' of config map 'careerassessor-config'>  Optional: false
    Mounts:
      /secrets/cloudsql from careerassessor-cloudsql-instance-credentials (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xpdwd (ro)
Volumes:
  careerassessor-cloudsql-instance-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  XXXXX
    Optional:    false
  default-token-xpdwd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  XXX
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  non-preemptible=false
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 non-preemptible=false:NoSchedule
Events:          <none>

Here is the yaml of pod

apiVersion: v1
kind: Pod
metadata:
  annotations:
    randVersion: a200a
  creationTimestamp: "2021-08-10T10:59:29Z"
  generateName: xxx
  labels:
    app: xxx
    pod-template-hash: 8467d6c885
    release: xxxx
    repo: xxx
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:randVersion: {}
        f:generateName: {}
        f:labels:
          .: {}
          f:app: {}
          f:pod-template-hash: {}
          f:release: {}
          f:repo: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"674b9e8e-420e-44e7-9601-871be01a9fcb"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:containers:
          k:{"name":"career-assessor-be"}:
            .: {}
            f:args: {}
            f:command: {}
            f:env:
              .: {}
              k:{"name":"INACTIVITY_PERIOD"}:
                .: {}
                f:name: {}
              k:{"name":"LOG_TO_CONSOLE"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"USER_EMAIL"}:
                .: {}
                f:name: {}
                f:value: {}
            f:envFrom: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":8001,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:name: {}
                f:protocol: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
          k:{"name":"cloudsql-proxy"}:
            .: {}
            f:command: {}
            f:env:
              .: {}
              k:{"name":"CLOUD_SQL_CONNECTION_NAME"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:configMapKeyRef:
                    .: {}
                    f:key: {}
                    f:name: {}
              k:{"name":"CLOUD_SQL_CONNECTION_TIMEOUT"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:configMapKeyRef:
                    .: {}
                    f:key: {}
                    f:name: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/secrets/cloudsql"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:readOnly: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:nodeSelector:
          .: {}
          f:non-preemptible: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:terminationGracePeriodSeconds: {}
        f:tolerations: {}
        f:volumes:
          .: {}
          k:{"name":"careerassessor-cloudsql-instance-credentials"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:secretName: {}
    manager: kube-controller-manager
    operation: Update
    time: "2021-08-10T10:59:29Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"PodScheduled"}:
            f:message: {}
            f:reason: {}
    manager: kube-scheduler
    operation: Update
    time: "2021-08-10T10:59:29Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:message: {}
        f:phase: {}
        f:reason: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: "2021-08-10T11:51:28Z"
  name: career-assessor-be-8467d6c885-h27sh
  namespace: jyostna1
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: career-assessor-be-8467d6c885
    uid: 674b9e8e-420e-44e7-9601-871be01a9fcb
  resourceVersion: "48899168"
  uid: 8837f88d-7e3e-444f-a804-32a7a6e98c71
spec:
  containers:
  - args:
    - |
      xxxx
    command:
    - /bin/sh
    - -c
    env:
    - name: LOG_TO_CONSOLE
      value: "1"
    - name: INACTIVITY_PERIOD
    - name: USER_EMAIL
      value: jyostna@springboard.com
    envFrom:
    - configMapRef:
        name: careerassessor-config
    image: us.gcr.io/springboard-production/career_assessor:IP-405-implement-explored-strategy-for-r
    imagePullPolicy: Always
    name: career-assessor-be
    ports:
    - containerPort: 8001
      name: be-port
      protocol: TCP
    resources:
      limits:
        cpu: 3200m
        memory: 2400Mi
      requests:
        cpu: 1600m
        memory: 1800Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-xpdwd
      readOnly: true
  - command:
    - /cloud_sql_proxy
    - -instances=$(CLOUD_SQL_CONNECTION_NAME)=tcp:0.0.0.0:3306
    - -credential_file=/secrets/cloudsql/cloudsql-instance-credentials.json
    - -term_timeout=$(CLOUD_SQL_CONNECTION_TIMEOUT)s
    env:
    - name: CLOUD_SQL_CONNECTION_NAME
      valueFrom:
        configMapKeyRef:
          key: CLOUD_SQL_CONNECTION_NAME
          name: careerassessor-config
    - name: CLOUD_SQL_CONNECTION_TIMEOUT
      valueFrom:
        configMapKeyRef:
          key: CLOUD_SQL_CONNECTION_TIMEOUT
          name: careerassessor-config
    image: gcr.io/cloudsql-docker/gce-proxy:1.17
    imagePullPolicy: IfNotPresent
    name: cloudsql-proxy
    resources:
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 20m
        memory: 20Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /secrets/cloudsql
      name: careerassessor-cloudsql-instance-credentials
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-xpdwd
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: gke-cluster-reliable-preemptible-node-4b42c9be-x9qs
  nodeSelector:
    non-preemptible: "false"
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: non-preemptible
    operator: Equal
    value: "false"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: careerassessor-cloudsql-instance-credentials
    secret:
      defaultMode: 420
      secretName: careerassessor-cloudsql-instance-credentials
  - name: default-token-xpdwd
    secret:
      defaultMode: 420
      secretName: default-token-xpdwd
status:
  message: Node is shutting, evicting pods
  phase: Failed
  reason: Shutdown
  startTime: "2021-08-10T11:14:30Z"

description of node

Name:               gke-cluster-reliable-preemptible-node-xxxxx
Roles:              <none>
Labels:             beta.kubernetes.io/arch=xxx
                    beta.kubernetes.io/instance-type=xxx
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-boot-disk=pd-standard
                    cloud.google.com/gke-container-runtime=containerd
                    cloud.google.com/gke-nodepool=preemptible-nodepool
                    cloud.google.com/gke-os-distribution=cos
                    cloud.google.com/gke-preemptible=true
                    cloud.google.com/machine-family=n1
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-cluster-reliable-preemptible-node-xxxx
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=n1-standard-4
                    non-preemptible=false
                    topology.gke.io/zone=us-central1-a
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-a
Annotations:        container.googleapis.com/instance_id: 7488269578212988511
                    csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/playground-206205/zones/us-central1-a/instances/gke-cluster-reliable-preemptible-node-4b42c9be-x9qs"}
                    node.alpha.kubernetes.io/ttl: 0
                    node.gke.io/last-applied-node-labels:
                      cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-nodepool=preemptible-nod...
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 10 Aug 2021 17:24:03 +0530
Taints:             non-preemptible=false:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  gke-cluster-reliable-preemptible-node-4b42c9be-x9qs
  AcquireTime:     <unset>
  RenewTime:       Tue, 10 Aug 2021 20:27:03 +0530
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  FrequentDockerRestart         False   Tue, 10 Aug 2021 20:24:28 +0530   Tue, 10 Aug 2021 17:24:08 +0530   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Tue, 10 Aug 2021 20:24:28 +0530   Tue, 10 Aug 2021 17:24:08 +0530   NoFrequentContainerdRestart     containerd is functioning properly
  FrequentUnregisterNetDevice   False   Tue, 10 Aug 2021 20:24:28 +0530   Tue, 10 Aug 2021 17:24:08 +0530   NoFrequentUnregisterNetDevice   node is functioning properly
  CorruptDockerOverlay2         False   Tue, 10 Aug 2021 20:24:28 +0530   Tue, 10 Aug 2021 17:24:08 +0530   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  KernelDeadlock                False   Tue, 10 Aug 2021 20:24:28 +0530   Tue, 10 Aug 2021 17:24:08 +0530   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Tue, 10 Aug 2021 20:24:28 +0530   Tue, 10 Aug 2021 17:24:08 +0530   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentKubeletRestart        False   Tue, 10 Aug 2021 20:24:28 +0530   Tue, 10 Aug 2021 17:24:08 +0530   NoFrequentKubeletRestart        kubelet is functioning properly
  NetworkUnavailable            False   Tue, 10 Aug 2021 17:24:03 +0530   Tue, 10 Aug 2021 17:24:03 +0530   RouteCreated                    NodeController create implicit route
  MemoryPressure                False   Tue, 10 Aug 2021 20:26:16 +0530   Tue, 10 Aug 2021 17:24:00 +0530   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Tue, 10 Aug 2021 20:26:16 +0530   Tue, 10 Aug 2021 17:24:00 +0530   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Tue, 10 Aug 2021 20:26:16 +0530   Tue, 10 Aug 2021 17:24:00 +0530   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Tue, 10 Aug 2021 20:26:16 +0530   Tue, 10 Aug 2021 17:24:03 +0530   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:   10.128.0.100
  ExternalIP:   34.133.49.148
  InternalDNS:  gke-cluster-reliable-preemptible-node-4b42c9be-x9qs.c.playground-206205.internal
  Hostname:     gke-cluster-reliable-preemptible-node-4b42c9be-x9qs.c.playground-206205.internal
Capacity:
  attachable-volumes-gce-pd:  127
  cpu:                        4
-- jyostna lalitha
google-cloud-platform
kubernetes

1 Answer

8/10/2021

Thanks for all the info. According to the documentation and assuming your GKE cluster is on 1.20 version:

On preemptible GKE nodes running versions 1.20 or later, the kubelet graceful node shutdown feature is enabled by default. As a result, kubelet detects preemption and gracefully terminates Pods.

For Pods on preemptible nodes, do not specify more than 25 seconds for terminationGracePeriodSeconds because those Pods will only receive 25 seconds during preemption.

The best way to use taint and toleration is using the default label created on preemptible VMs - Tainting a node for preemptible VMs:

kubectl taint nodes node-name cloud.google.com/gke-preemptible="true":NoSchedule

Add toleration to a Pod:

tolerations:
- key: cloud.google.com/gke-preemptible
  operator: Equal
  value: "true"
  effect: NoSchedule

Also:

When the kubelet terminates Pods during preemptible node shutdown, it assigns a Failed status and a Shutdown reason to the Pods. These Pods are cleaned up during the next garbage collection. You can also delete shutdown Pods manually using the following command:

kubectl get pods --all-namespaces | grep -i shutdown | awk '{print $1, $2}' | xargs kubectl delete pod -n

Please review the full documentation which explains all the details: https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms

-- CaioT
Source: StackOverflow