K8s Volume doesn't detach from host

2/6/2019

We're using Kubernetes on-premise and it's currently running on VMWare. So far, we have been successfull in being able to provision volumes for the apps that we deploy. The problem comes if the pods - for whatever reason - switch to a different worker node. When that happens, the disk fails to mount to the second worker as it's already present on the first worker where the pod was originally running. See below:

As it stands, we have no app on either worker1 or worker2:

[root@worker01 ~]# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
fd0                   2:0    1     4K  0 disk
sda                   8:0    0   200G  0 disk
├─sda1                8:1    0   500M  0 part /boot
└─sda2                8:2    0 199.5G  0 part
  ├─vg_root-lv_root 253:0    0    20G  0 lvm  /
  ├─vg_root-lv_swap 253:1    0     2G  0 lvm
  ├─vg_root-lv_var  253:2    0    50G  0 lvm  /var
  └─vg_root-lv_k8s  253:3    0    20G  0 lvm  /mnt/disks
sr0                  11:0    1  1024M  0 rom


[root@worker02 ~]# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
fd0                   2:0    1     4K  0 disk
sda                   8:0    0   200G  0 disk
├─sda1                8:1    0   500M  0 part /boot
└─sda2                8:2    0 199.5G  0 part
  ├─vg_root-lv_root 253:0    0    20G  0 lvm  /
  ├─vg_root-lv_swap 253:1    0     2G  0 lvm
  ├─vg_root-lv_var  253:2    0    50G  0 lvm  /var
  └─vg_root-lv_k8s  253:3    0    20G  0 lvm  /mnt/disks
sr0                  11:0    1   4.5G  0 rom

Next we create our PVC with the following:

[root@master01 ~]$ cat app-pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-pvc
  annotations:
    volume.beta.kubernetes.io/storage-class: thin-disk
  namespace: tools
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

[root@master01 ~]$ kubectl create -f app-pvc.yaml
persistentvolumeclaim "app-pvc" created

This works fine as the disk is created and bound:

[root@master01 ~]$ kubectl get pvc -n tools
NAME        STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
app-pvc   Bound     pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7   10Gi       RWO            thin-disk      12s


[root@master01 ~]$ kubectl get pv -n tools
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM             STORAGECLASS   REASON    AGE
pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7   10Gi       RWO            Delete           Bound     tools/app-pvc   thin-disk                12s

Now we can deploy our application which creates the pod and sorts storage etc:

[centos@master01 ~]$ cat app.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: app
  namespace: tools
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
      - image: sonatype/app3:latest
        imagePullPolicy: IfNotPresent
        name: app
        ports:
        - containerPort: 8081
        - containerPort: 5000
        volumeMounts:
          - mountPath: /app-data
            name: app-data-volume
      securityContext:
        fsGroup: 2000
      volumes:
        - name: app-data-volume
          persistentVolumeClaim:
            claimName: app-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: app-service
  namespace: tools
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 8081
    protocol: TCP
    name: http
  - port: 5000
    targetPort: 5000
    protocol: TCP
    name: docker
  selector:
    app: app

[centos@master01 ~]$ kubectl create -f app.yaml
deployment.extensions "app" created
service "app-service" created

This deploys fine:

[centos@master01 ~]$ kubectl get pods -n tools
NAME                     READY     STATUS              RESTARTS   AGE
app-6588cf4b87-wvwg2   0/1       ContainerCreating   0          6s

[centos@neb-k8s02-master01 ~]$ kubectl describe pod app-6588cf4b87-wvwg2 -n tools

Events:
  Type    Reason                  Age   From                         Message
  ----    ------                  ----  ----                         -------
  Normal  Scheduled               18s   default-scheduler           Successfully assigned nexus-6588cf4b87-wvwg2 to neb-k8s02-worker01
  Normal  SuccessfulMountVolume   18s   kubelet, worker01           MountVolume.SetUp succeeded for volume "default-token-7cv62"
  Normal  SuccessfulAttachVolume  15s   attachdetach-controller     AttachVolume.Attach succeeded for volume "pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7"
  Normal  SuccessfulMountVolume   7s    kubelet, worker01           MountVolume.SetUp succeeded for volume "pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7"
  Normal  Pulled                  7s    kubelet, worker01           Container image "acme/app:latest" already present on machine
  Normal  Created                 7s    kubelet, worker01           Created container
  Normal  Started                 6s    kubelet, worker01           Started container

We can also see the disk has been created and mounted in VMWare for Worker01 and not for Worker02:

[root@worker01 ~]# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
fd0                   2:0    1     4K  0 disk
sda                   8:0    0   200G  0 disk
├─sda1                8:1    0   500M  0 part /boot
└─sda2                8:2    0 199.5G  0 part
  ├─vg_root-lv_root 253:0    0    20G  0 lvm  /
  ├─vg_root-lv_swap 253:1    0     2G  0 lvm
  ├─vg_root-lv_var  253:2    0    50G  0 lvm  /var
  └─vg_root-lv_k8s  253:3    0    20G  0 lvm  /mnt/disks
sdb                   8:16   0    10G  0 disk /var/lib/kubelet/pods/1e55ad6a-294f-11e9-9175-005056a47f18/volumes/kubernetes.io~vsphere-volume/pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7
sr0                  11:0    1  1024M  0 rom



[root@worker02 ~]# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
fd0                   2:0    1     4K  0 disk
sda                   8:0    0   200G  0 disk
├─sda1                8:1    0   500M  0 part /boot
└─sda2                8:2    0 199.5G  0 part
 ├─vg_root-lv_root 253:0    0    20G  0 lvm  /
  ├─vg_root-lv_swap 253:1    0     2G  0 lvm
  ├─vg_root-lv_var  253:2    0    50G  0 lvm  /var
  └─vg_root-lv_k8s  253:3    0    20G  0 lvm  /mnt/disks
sr0                  11:0    1   4.5G  0 rom

If Worker01 falls over then Worker02 kicks in and we can see the disk being attached to the other node:

[root@worker02 ~]# lsblk
NAME                MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
fd0                   2:0    1     4K  0 disk
sda                   8:0    0   200G  0 disk
├─sda1                8:1    0   500M  0 part /boot
└─sda2                8:2    0 199.5G  0 part
  ├─vg_root-lv_root 253:0    0    20G  0 lvm  /
  ├─vg_root-lv_swap 253:1    0     2G  0 lvm
  ├─vg_root-lv_var  253:2    0    50G  0 lvm  /var
  └─vg_root-lv_k8s  253:3    0    20G  0 lvm  /mnt/disks
sdb                   8:16   0    10G  0 disk /var/lib/kubelet/pods/a0695030-2950-11e9-9175-005056a47f18/volumes/kubernetes.io~vsphere-volume/pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7
sr0                  11:0    1   4.5G  0 rom

However, seeing as though the disk is now attached to Worker01 and Worker02, Worker01 will no longer start citing the following error in vCenter:

Cannot open the disk '/vmfs/volumes/5ba35d3b-21568577-efd4-469e3c301eaa/kubevols/kubernetes-dynamic-pvc-e55ad6a-294f-11e9-9175-005056a47f18.vmdk' or one of the snapshot disks it depends on.

This error occurs because (I assume) Worker02 has access to the disk and is reading/writing from/to it. Shouldn't Kubernetes detach the disk from nodes that do not need it if it's been attached to another node. How can we go about fixing this issue? If a pods moves to another host due to node failure then we have to manually detach the disk and then start the other worker manually.

Any and all help appreciated.

-- automation1002
kubernetes
vmware

1 Answer

2/6/2019

First, I'll assume your running in tree vsphere disks.

Second, in this case (and more so, with CSI) kubernetes doesn't have control over all volume operations. The VMWare functionality for managing attachment and detachment of a disk is implemented in the volume plugin which you are using. Kubernetes doesn't strictly control all volume attachment/detachment semantics as a generic function.

To see the in-tree implementation details, check out:

https://kubernetes.io/docs/concepts/storage/volumes/#vspherevolume

Overall i think the way you are doing failover is going to mean that when your worker1 pod dies, worker2 can schedule. At that point, worker1 should not be able to grab the same PVC, and it should not schedule until the worker2 pod dies.

However if worker1 is scheduling, it means that Vsphere is trying to (erroneously) let worker1 start, and the kubelet is failing.

There is a chance that this is a bug in the VMWare driver in that it will bind a persistent volume even though it is not ready to.

To further elaborate, details about how worker2 is being launched may be helped. Is it a separate replication controller ? or is it running outside of kubernetes? If the latter, then the volumes wont be managed the same way, and you cant use a the same PVC as the locking mechanism.

-- jayunit100
Source: StackOverflow