I have a simple k8s installation with few nodes and ceph (kubernetes.io/rbd) as storageclass. I have a deployment with a single pod which uses a persistent volume from the persistent volume claim (ReadWriteOnce) from this storage class.
A node with this pod have failed (NotReady in get nodes
output for a long time and it's physically dead).
K8s could not create a new pod for my deploy because of 'Multi-Attach error for volume "pvc-..." Volume is already exclusively attached to one node and can't be attached to another'.
I see that pv is bounded to the failed node: "Status: Bound".
How can I force kubernetes to forget about old pod to allow a new pod to bound to the persistent volume?
It is a complex problem.
Kubelet daemon, which manages mounts of Volumes, should set the information about a new status of volume to enable the Scheduler to spawn a Pod on the other node.
But, you have the 'NotReady' status, which means Kubernetes cannot communicate with the Kubelet to check the current status of Volumes. In Kubernetes, the status of the Volume is the last one which has been received - "Bound." It is not possible to reset that status somehow without changing the state of the node.
I see only 2 workarounds here:
ReadWriteMany
mode instead of ReadWriteOnce
. CephFS can work in that mode, but RBD can't. That mode allows Kubernetes to claim the same volume on several nodes at the same time.