Moving Pod to another node automatically

7/8/2021

Is it possible for a pod/deployment/statefulset to be moved to another node or be recreated on another node automatically if the first node fails? The pod in question is set to 1 replica. So is it possible to configure some sort of failover for kubernetes pods? I've tried out pod affinity settings but nothing is moved automatically it has been around 10 minutes.

the yaml for the said pod is like below:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-rbd-sc-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: ceph-rbd-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: ceph-rbd-pod-pvc-sc
  labels:
    app: ceph-rbd-pod-pvc-sc
spec:
  containers:
  - name:  ceph-rbd-pod-pvc-sc
    image: busybox
    command: ["sleep", "infinity"]
    volumeMounts:
    - mountPath: /mnt/ceph_rbd
      name: volume
  nodeSelector:
    etiket: worker
  volumes:
  - name: volume
    persistentVolumeClaim:
      claimName: ceph-rbd-sc-pvc
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            name: ceph-rbd-pod-pvc-sc
        topologyKey: "kubernetes.io/hostname"

Edit:

I managed to get it to work. But now i have another problem, the newly created pod in the other node is stuck in "container creating" and the old pod is stuck in "terminating". I also get Multi-Attach error for volume stating that the pv is still in use by the old pod. The situation is the same for any deployment/statefulset with a pv attached, the problem is resolved only when the failed node comes back online. Is there a solution for this?

-- Nyquillus
kubernetes
kubernetes-deployment
kubernetes-pod
kubernetes-statefulset

2 Answers

7/9/2021

Answer from coderanger remains valid regarding Pods. Answering to your last edit:

Your issue is with CSI.

  • When your Pod uses a PersistentVolume whose accessModes is RWO.

  • And when the Node hosting your Pod gets unreachable, prompting Kubernetes scheduler to Terminate the current Pod and create a new one on another Node

Your PersistentVolume can not be attached to the new Node.

The reason for this is that CSI introduced some kind of "lease", marking a volume as bound.

With previous CSI spec & implementations, this lock was not visible, in terms of Kubernetes API. If your ceph-csi deployment is recent enough, you should find a corresponding "VolumeAttachment" object that could be deleted, to fix your issue:

# kubectl get volumeattachments -n ci
NAME                                                                   ATTACHER           PV                                         NODE                ATTACHED   AGE
csi-194d3cfefe24d5f22616fabd3d2fb2ce5f79b16bdca75088476c2902e7751794   rbd.csi.ceph.com   pvc-902c3925-11e2-4f7f-aac0-59b1edc5acf4   melpomene.xxx.com   true       14d
csi-24847171efa99218448afac58918b6e0bb7b111d4d4497166ff2c4e37f18f047   rbd.csi.ceph.com   pvc-b37722f7-0176-412f-b6dc-08900e4b210d   clio.xxx.com        true       90d
....
kubectl delete -n ci volumeattachment csi-xxxyyyzzz

Those VolumeAttachments are created by your CSI provisioner, before the device mapper attaches a volume.

They would be deleted only once the corresponding PV would have been released from a given Node, according to its device mapper - that needs to be running, kubelet up/Node marked as Ready according to the the API. Until then, other Nodes can't map it. There's no timeout, should a Node get unreachable due to network issues or an abrupt shutdown/force off/reset: its RWO PV are stuck.

See: https://github.com/ceph/ceph-csi/issues/740

One workaround for this would be not to use CSI, and rather stick with legacy StorageClasses, in your case installing rbd on your nodes.

Though last I checked -- k8s 1.19.x -- I couldn't manage to get it working, I can't recall what was wrong, ... CSI tends to be "the way" to do it, nowadays. Despite not being suitable for production use, sadly, unless running in an IAAS with auto-scale groups deleting Nodes from the Kubernetes API (eventually evicting the corresponding VolumeAttachments), or using some kind of MachineHealthChecks like OpenShift 4 implements.

-- SYN
Source: StackOverflow

7/8/2021

A bare Pod is a single immutable object. It doesn't have any of these nice things. Related: never ever use bare Pods for anything. If you try this with a Deployment you should see it spawn a new one to get back to the requested number of replicas. If the new Pod is Unschedulable you should see events emitted explaining why. For example if only node 1 matches the nodeSelector you specified, or if another Pod is already running on the other node which triggers the anti-affinity.

-- coderanger
Source: StackOverflow