I am running a self managed Kubernetes cluster 1.18.1. I have deployed some pods with persistence volumes (based on the longhorn project). Now after doing some testing I observe the following behavior:
If I simulate a hard shutdown of one node, after a while (5 minutes) Kubernetes is recognizing the loss and starts rescheduling PODs form the dead node to another.
Because of the fact that my nodes had persistence volumes the new POD will never start. The reason is that the old pod (on the dead node) is now durable in the the status terminating.
The fact that pods that reside on a crashed node did not terminate seems to be an well known Kubernetes limitation. See also the problem description here.
My question is: Why does Kubernetes not provide a function to automatically terminate old PODs and resources like persistence volumes. Why do I have to intervene manually as an administrator? To me, this behavior seems not logical regarding to the promises that Kubernetes makes.
This is for example how my yaml file looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: db
labels:
app: db
spec:
replicas: 1
selector:
matchLabels:
app: db
strategy:
type: Recreate
template:
metadata:
labels:
app: db
spec:
containers:
- env:
- name: POSTGRES_DB
value: office
image: postgres:9.6.1
name: db
livenessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 30
periodSeconds: 10
ports:
- containerPort: 5432
resources: {}
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: dbdata
subPath: postgres
restartPolicy: Always
volumes:
- name: dbdata
persistentVolumeClaim:
claimName: office-demo-dbdata-pvc
# Storrage
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: office-demo-dbdata-pv
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
claimRef:
namespace: default
name: office-demo-dbdata-pvc
csi:
driver: io.rancher.longhorn
fsType: ext4
volumeHandle: office-demo-dbdata
storageClassName: longhorn-durable
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: office-demo-dbdata-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn-durable
resources:
requests:
storage: 2Gi
volumeName: "office-demo-dbdata-pv"
As explained the volume is created on Longhorn. But the attachement is not released even after kubernetes starts to reschedule the pod to another node.
The pod hanging in the termination status can be released if I manually delete the 'volumeattachment'
$ kubectl delete volumeattachment csi-08d9842e.......
But in any case this is a manual action.
According to Longhorn Documentation on Node Failure:
When a Kubernetes node fails with CSI driver installed (all the following are based on Kubernetes v1.12 with default setup):
- After one minute,
kubectl get nodes
will reportNotReady
for the failure node.- After about five minutes, the states of all the pods on the
NotReady
node will change to eitherUnknown
orNodeLost
.- If you’re deploying using StatefulSet or Deployment, you need to decide is if it’s safe to force deletion the pod of the workload running on the lost node. See here.
- StatefulSet has stable identity, so Kubernetes won’t force delete the Pod for the user.
- Deployment doesn’t have stable identity, but Longhorn is a Read-Write-Once type of storage, which means it can only attached to one Pod. So the new Pod created by Kubernetes won’t be able to start due to the Longhorn volume still attached to the old Pod, on the lost Node.
- In both cases, Kubernetes will automatically evict the pod (set deletion timestamp for the pod) on the lost node, then try to recreate a new one with old volumes. Because the evicted pod gets stuck in
Terminating
state and the attached Longhorn volumes cannot be released/reused, the new pod will get stuck inContainerCreating
state. That’s why users need to decide is if it’s safe to force deleting the pod.
This is the current state of LongHorn (which is still in beta).
There is an open Issue on Github: Improve node failure handling #1105 to address exacly that, but for now as stated in the documentation the admin has to intervene manually.
There is more Issues in Kubernetes Github like this one, which is a issue that I believe lies in the edge between kubernetes and the CSI, it's a mutual validation: The CSI is read-write-once and sets the storage as stable and locks the storage. In kubernetes part, it sees the pods has finalizers (as in the issue above) and won't delete them until the task is finished.
Unfortunately either way, as of today, requires manual interference.
delete volumeattachment
.Edit:
Github Kubernetes Issue #69697 Example for removing finalizers:
kubectl patch pvc <PVC_NAME> -p '{"metadata":{"finalizers":null}}'
kubectl patch pod <POD_NAME> -p '{"metadata":{"finalizers":null}}'
You can create a script to remove your finalizers so you don't have to make it manually, like suggested in another Kubernetes Open Issue #77258:
Here's a one-liner to remove finalizers from all pv in the system:
kubectl get pv | tail -n+2 | awk '{print $1}' | xargs -I{} kubectl patch pv {} -p '{"metadata":{"finalizers": null}}'
The great issue here, is that the finalizers are added by the LongHorn, so in my understanding you can't create the pods without it because it is added afterwards by the LongHorn.
I added the documentation and the Open Issues from Github to show you that this is a current issue and is yet to be resolved by the developers both from Longhorn and Kubernetes.
5 minutes is default eviction time set at control plane components of Kubernetes.If you want to customize that you can use taint based evictions and add below in the deployment yaml
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 60
Note that Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready
with tolerationSeconds=300
unless the pod configuration provided by the user already has a toleration for node.kubernetes.io/not-ready
. Likewise it adds a toleration for node.kubernetes.io/unreachable
with tolerationSeconds=300
unless the pod configuration provided by the user already has a toleration for node.kubernetes.io/unreachable