Provisioning persistent disks for horizontally scaled pods

12/13/2017

In our cluster we have a horizontally-scaling deployment of an application that uses a lot of local disk space, which has been causing major cluster stability problems (docker crashes, nodes recreate, etc).

We are trying to have each pod provision a gcePersistentDisk of its own so its disk usage is isolated from the cluster. We created a storage class and a persistent volume claim that uses that class, and have specified a volume mount for that claim in our deployment's pod template spec.

However, when we set the autoscaler to use multiple replicas, they apparently try to use the same volume, and we get this error:

Multi-Attach error for volume 
Volume is already exclusively attached to one node and can't be attached to another

Here are the relevant parts of our manifests. Storage Class:

{
  "apiVersion": "storage.k8s.io/v1",
  "kind": "StorageClass",
  "metadata": {
    "annotations": {},
    "name": "some-storage",
    "namespace": ""
  },
  "parameters": {
    "type": "pd-standard"
  },
  "provisioner": "kubernetes.io/gce-pd"
}

PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: some-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: some-class

Deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: some-deployment
spec:
  volumes:
    - name: some-storage
      persistentVolumeClaim:
        claimName: some-pvc
  containers:
      [omitted]
      volumeMounts:
        - name: some-storage
          mountPath: /var/path

With those applied, we update the deployment's autoscaler to a minimum of 2 replicas and get the above error.

  1. Is this not how persistent volume claims should work?
  2. We definitely don't care about volume sharing, and we don't really care about persistence, we just want storage that is isolated from the cluster -- is this the right tool for the job?
-- Tyler Gould
google-cloud-platform
google-kubernetes-engine
kubernetes

1 Answer

12/13/2017

A Deployment is meant to be stateless. There is no way for the deployment controller to determine which disk belongs to which pod once a pod gets rescheduled, which would lead to corrupted state. That is the reason why a Deployment can only have one disk shared across all its pods.

Concerning the error you are seeing:

Multi-Attach error for volume Volume is already exclusively attached to one node and can't be attached to another

You are getting this because you have pods across multiple nodes, but only one volume (because a Deployment can only have one) and multiple nodes are trying to mount this volume to attach it to your deployments pods. The volume doesn't seem to be NFS which could be mounted into multiple nodes at the same time. If you do not care about state at all and still want to use a Deployment, then you must use a disk that supports mounts from multiple nodes at the same time, like NFS. Further, you would need to change your PVCs accessModes policy to ReadWriteMany, as multiple pods would write to the same physical volume.

If you need a dedicated disk for each pod, then you might want to use a StatefulSet instead. As the name suggests, its pods are meant to keep state, thus you can also define a volumeClaimTemplates section in it, which will create a dedicated disk for each pod as described in the documentation.

-- fishi0x01
Source: StackOverflow