I am unable to configure my stateful application to be resilient to kubernetes worker failure (the one where my application pod exists)
$ kk get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
example-openebs-97767f45f-xbwp6 1/1 Running 0 6m21s 192.168.207.233 new-kube-worker1 <none> <none>
Once I take the worker down, kubernetes notices that the pod is not responding and schedules it to a different worker.
marek649@new-kube-master:~$ kk get pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
example-openebs-97767f45f-gct5b 0/1 ContainerCreating 0 22s <none> new-kube-worker2 <none> <none>
example-openebs-97767f45f-xbwp6 1/1 Terminating 0 13m 192.168.207.233 new-kube-worker1 <none> <none>
This is great, but the new container is not able to start since it is trying to attach the same pvc that the old container was using and kubernetes does not release the binding to the old (not responding) node.
$ kk describe pod example-openebs-97767f45f-gct5b
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/example-openebs-97767f45f
Containers:
example-openebs:
Container ID:
Image: nginx
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/usr/share/nginx/html from demo-claim (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-4xmvf (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
demo-claim:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: example-pvc
ReadOnly: false
default-token-4xmvf:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4xmvf
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m9s default-scheduler Successfully assigned default/example-openebs-97767f45f-gct5b to new-kube-worker2
Warning FailedAttachVolume 2m9s attachdetach-controller Multi-Attach error for volume "pvc-911f94a9-b43a-4cac-be94-838b0e7376e8" Volume is already used by pod(s) example-openebs-97767f45f-xbw
p6
Warning FailedMount 6s kubelet, new-kube-worker2 Unable to attach or mount volumes: unmounted volumes=[demo-claim], unattached volumes=[demo-claim default-token-4xmvf]: timed out waiti
ng for the condition
I am able to resolve this situation by manually force deleting the containers, unbounding the PV and recreating containers but this is far from high availability that I am expecting.
I am using openEBS jiva volumes and after manual intervention I am able to restore the container with correct data on the PV which means that data gets replicated to other nodes correctly.
Can someone please explain what am I doing wrong and how to achieve a fault tolerance for k8s applications with volumes attached?
I found this related but I don;t see any suggestions how to overcome this issue https://github.com/openebs/openebs/issues/2536
To deploy stateful application kubernetes has Statefulset object with might help you in this case.
StatefulSets are valuable for applications that require one or more of the following.
For unmanaged Kubernetes Clusters, this is a hard problem that applies to all types of RWO volumes.
There have been several discussions around this in the Kubernetes community, which are summarized in these issues:
The current thought process is to take the help of the NodeTolerations to come up with a solution and implement the solution via the CSI driver.
At openebs, when we looked at how the cloud providers handle this case, we found that when a node is shutdown, its corresponding node object is deleted from the cluster. There is no harm done with this operation since when the node comes back online, the node object is recreated.
It will eventually release the volume, usually limiting factor is the network storage system being slow to detect the volume is unmounted. But you are correct that it's a limitation. The usual fix would be to use a multi-mount capable volume type instead, such as NFS or CephFS.