Does GKE Autopilot sometimes kill Pods and is there a way to prevent it for Critical Services?

8/21/2021

I've been debugging a 10min downtime of our service for some hours now, and I seem to have found the cause, but not the reason for it. Our redis deployment in kubernetes was down for quite a while, causing neither django nor redis to be able to reach it. This caused a bunch of jobs to be lost.

There are no events for the redis deployment, but here are the first logs before and after the reboot:

before: enter image description here

after: enter image description here

I'm also attaching the complete redis yml at the bottom. We're using GKE Autopilot, so I guess something caused the pod to reboot? Resource usage is a lot lower than requested, at about 1% for both CPU and memory. Not sure what's going on here. I also couldn't find an annotation to tell Autopilot to leave a specific deployment alone

redis.yml:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-disk
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gce-ssd
  resources:
    requests:
      storage: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  ports:
    - port: 6379
      name: redis
  clusterIP: None
  selector:
    app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      volumes:
        - name: redis-volume
          persistentVolumeClaim:
            claimName: redis-disk
            readOnly: false
      terminationGracePeriodSeconds: 5
      containers:
        - name: redis
          image: redis:6-alpine
          command: ["sh"]
          args: ["-c", 'exec redis-server --requirepass "$REDIS_PASSWORD"']
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
              ephemeral-storage: "1Gi"
          envFrom:
            - secretRef:
                name: env-secrets
          volumeMounts:
            - name: redis-volume
              mountPath: /data
              subPath: data
-- yspreen
celery
google-cloud-platform
google-kubernetes-engine
kubernetes
redis

1 Answer

8/22/2021

PersistentVolumeClaim is an object in kubernetes allowing to decouple storage resource requests from actual resource provisioning done by its associated PersistentVolume part.

Given:

kubernetes will try to dynamically provision a suitable persistent disk for you suitable for the underlying infrastructure being a Google Compute Engine Persistent Disk in you case based on the requested storage class (gce-ssd).

The claim will result then in an SSD-like Persistent Disk to be automatically provisioned for you and once the claim is deleted (the requesting pod is deleted due to downscale), the volume is destroyed.

To overcome this issue and avoid precious data loss, you should have two alternatives:

At the PersistentVolumeClaim level

To avoid data loss once the Pod and its PVC are deleted, you can set the persistentVolumeReclaimPolicy parameter to Retain on the PVC object:

<!-- language: lang-yaml -->
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-disk
spec:
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: gce-ssd
  resources:
    requests:
      storage: "2Gi"

This allows for the persistent volume to go back to the Released state and the underlying data can be manually backed up.

At the StorageClass level

As a general recommendation, you should set the reclaimPolicy parameter to Retain (default is Delete) for your used StorageClass:

<!-- language: lang-yaml -->
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  reclaimPolicy: Retain
  replication-type: regional-pd
  volumeBindingMode: WaitForFirstConsumer

Additional parameters are recommended:

  • replication-type: should be set to regional-pd to allow zonal replication
  • volumeBindingMode: set to WaitForFirstConsumer to allow for first consumer dictating the zonal replication topology

You can read more on all above StorageClass parameters in the kubernetes documentation.

A PersistentVolume with same storage class name is then declared:

<!-- language: lang-yaml -->
apiVersion: v1
kind: PersistentVolume
metadata:
  name: ssd-volume
spec:
  storageClassName: "ssd"
  capacity:
    storage: 2G
  accessModes:
    - ReadWriteOnce
  gcePersistentDisk:
    pdName: redis-disk

And the PersistentVolumeClaim would only declare the requested StorageClass name:

<!-- language: lang-yaml -->
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ssd-volume-claim
spec:
  storageClassName: "ssd"
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: "2Gi"

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      volumes:
        - name: redis-volume
          persistentVolumeClaim:
            claimName: ssd-volume-claim
            readOnly: false

This objects declaration would prevent any failures or scale down operations from destroying the created PV either created manually by cluster administrators or dynamically using Dynamic Provisioning.

-- tmarwen
Source: StackOverflow