Why does Kubernetes not terminating PODs after a node crash?

4/13/2020

I am running a self managed Kubernetes cluster 1.18.1. I have deployed some pods with persistence volumes (based on the longhorn project). Now after doing some testing I observe the following behavior:

If I simulate a hard shutdown of one node, after a while (5 minutes) Kubernetes is recognizing the loss and starts rescheduling PODs form the dead node to another.

Because of the fact that my nodes had persistence volumes the new POD will never start. The reason is that the old pod (on the dead node) is now durable in the the status terminating.

The fact that pods that reside on a crashed node did not terminate seems to be an well known Kubernetes limitation. See also the problem description here.

My question is: Why does Kubernetes not provide a function to automatically terminate old PODs and resources like persistence volumes. Why do I have to intervene manually as an administrator? To me, this behavior seems not logical regarding to the promises that Kubernetes makes.

This is for example how my yaml file looks like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: db
  labels: 
    app: db
spec:
  replicas: 1
  selector: 
    matchLabels:
      app: db
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: db
    spec:
      containers:
      - env:
        - name: POSTGRES_DB
          value: office
        image: postgres:9.6.1
        name: db

        livenessProbe:
          tcpSocket:
            port: 5432
          initialDelaySeconds: 30
          periodSeconds: 10

        ports:
          - containerPort: 5432        
        resources: {}
        volumeMounts:
        - mountPath: /var/lib/postgresql/data
          name: dbdata
          subPath: postgres
      restartPolicy: Always
      volumes:
      - name: dbdata
        persistentVolumeClaim:
          claimName: office-demo-dbdata-pvc


# Storrage
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: office-demo-dbdata-pv
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  claimRef:
    namespace: default
    name: office-demo-dbdata-pvc
  csi:
    driver: io.rancher.longhorn 
    fsType: ext4
    volumeHandle: office-demo-dbdata
  storageClassName: longhorn-durable
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: office-demo-dbdata-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn-durable
  resources:
    requests:
      storage: 2Gi
  volumeName: "office-demo-dbdata-pv"

As explained the volume is created on Longhorn. But the attachement is not released even after kubernetes starts to reschedule the pod to another node.

The pod hanging in the termination status can be released if I manually delete the 'volumeattachment'

$ kubectl delete volumeattachment csi-08d9842e.......

But in any case this is a manual action.

-- Ralph
kubernetes

2 Answers

4/17/2020

According to Longhorn Documentation on Node Failure:

When a Kubernetes node fails with CSI driver installed (all the following are based on Kubernetes v1.12 with default setup):

  1. After one minute, kubectl get nodes will report NotReady for the failure node.
  2. After about five minutes, the states of all the pods on the NotReady node will change to either Unknown or NodeLost.
  3. If you’re deploying using StatefulSet or Deployment, you need to decide is if it’s safe to force deletion the pod of the workload running on the lost node. See here.
    1. StatefulSet has stable identity, so Kubernetes won’t force delete the Pod for the user.
    2. Deployment doesn’t have stable identity, but Longhorn is a Read-Write-Once type of storage, which means it can only attached to one Pod. So the new Pod created by Kubernetes won’t be able to start due to the Longhorn volume still attached to the old Pod, on the lost Node.
    3. In both cases, Kubernetes will automatically evict the pod (set deletion timestamp for the pod) on the lost node, then try to recreate a new one with old volumes. Because the evicted pod gets stuck in Terminating state and the attached Longhorn volumes cannot be released/reused, the new pod will get stuck in ContainerCreating state. That’s why users need to decide is if it’s safe to force deleting the pod.

This is the current state of LongHorn (which is still in beta).

There is an open Issue on Github: Improve node failure handling #1105 to address exacly that, but for now as stated in the documentation the admin has to intervene manually.

There is more Issues in Kubernetes Github like this one, which is a issue that I believe lies in the edge between kubernetes and the CSI, it's a mutual validation: The CSI is read-write-once and sets the storage as stable and locks the storage. In kubernetes part, it sees the pods has finalizers (as in the issue above) and won't delete them until the task is finished.

Unfortunately either way, as of today, requires manual interference.

  • From longhorn perspective you should run the delete volumeattachment.
  • From kubernetes side you can remove the pod finalizers allowing it to remove the pods, here is an example of how to change the finalizers.

Edit:

Github Kubernetes Issue #69697 Example for removing finalizers:

kubectl patch pvc <PVC_NAME> -p '{"metadata":{"finalizers":null}}'
kubectl patch pod <POD_NAME> -p '{"metadata":{"finalizers":null}}'

You can create a script to remove your finalizers so you don't have to make it manually, like suggested in another Kubernetes Open Issue #77258:

Here's a one-liner to remove finalizers from all pv in the system:

kubectl get pv | tail -n+2 | awk '{print $1}' | xargs -I{} kubectl patch pv {} -p '{"metadata":{"finalizers": null}}'

The great issue here, is that the finalizers are added by the LongHorn, so in my understanding you can't create the pods without it because it is added afterwards by the LongHorn.

I added the documentation and the Open Issues from Github to show you that this is a current issue and is yet to be resolved by the developers both from Longhorn and Kubernetes.

-- willrof
Source: StackOverflow

4/13/2020

5 minutes is default eviction time set at control plane components of Kubernetes.If you want to customize that you can use taint based evictions and add below in the deployment yaml

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 60

Note that Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready with tolerationSeconds=300 unless the pod configuration provided by the user already has a toleration for node.kubernetes.io/not-ready. Likewise it adds a toleration for node.kubernetes.io/unreachable with tolerationSeconds=300 unless the pod configuration provided by the user already has a toleration for node.kubernetes.io/unreachable

-- Arghya Sadhu
Source: StackOverflow