Failed to attach volume ... already being used by

6/29/2017

I am running Kubernetes in a GKE cluster using version 1.6.6 and another cluster with 1.6.4. Both are experiencing issues with failing over GCE compute disks.

I have been simulating failures using kill 1 inside the container or killing the GCE node directly. Sometimes I get lucky and the pod will get created on the same node again. But most of the time this isn't the case.

Looking at the event log it shows the error trying to mount 3 times and it fails to do anything more. Without human intervention it never corrects it self. I am forced to kill the pod multiple times until it works. During maintenances this is a giant pain.

How do I get Kubernetes to fail over with volumes properly ? Is there a way to tell the deployment to try a new node on failure ? Is there a way to remove the 3 retry limit ?

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: dev-postgres
  namespace: jolene
spec:
  revisionHistoryLimit: 0
  template:
    metadata:
      labels:
        app: dev-postgres
        namespace: jolene
    spec:
      containers:
      - image: postgres:9.6-alpine
        imagePullPolicy: IfNotPresent
        name: dev-postgres
        volumeMounts:
        - mountPath: /var/lib/postgresql/data
          name: postgres-data
        env:
          [ ** Removed, irrelevant environment variables ** ]
        ports:
          - containerPort: 5432
        livenessProbe:
          exec:
            command:
            - sh
            - -c
            - exec pg_isready
          initialDelaySeconds: 30
          timeoutSeconds: 5
          failureThreshold: 6
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - exec pg_isready --host $POD_IP
          initialDelaySeconds: 5
          timeoutSeconds: 3
          periodSeconds: 5
      volumes:
        - name: postgres-data
          persistentVolumeClaim:
            claimName: dev-jolene-postgres

I have tried this with and without PersistentVolume / PersistentVolumeClaim.

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
  name: dev-jolene-postgres
spec:
  capacity:
    storage: "1Gi"
  accessModes:
    - "ReadWriteOnce"
  claimRef:
    namespace: jolene
    name: dev-jolene-postgres
  gcePersistentDisk:
    fsType: "ext4"
    pdName: "dev-jolene-postgres"

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dev-jolene-postgres
  namespace: jolene
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
-- Joped
google-compute-engine
kubernetes

1 Answer

7/16/2017

By default, every node is schedulable, so there is no need to explicitly mention it in deployment. and feature which can mention retry limits is still in progress, which can be tracked here, https://github.com/kubernetes/kubernetes/issues/16652

-- Suraj Narwade
Source: StackOverflow