I've been debugging a 10min downtime of our service for some hours now, and I seem to have found the cause, but not the reason for it. Our redis deployment in kubernetes was down for quite a while, causing neither django nor redis to be able to reach it. This caused a bunch of jobs to be lost.
There are no events for the redis deployment, but here are the first logs before and after the reboot:
I'm also attaching the complete redis yml at the bottom. We're using GKE Autopilot, so I guess something caused the pod to reboot? Resource usage is a lot lower than requested, at about 1% for both CPU and memory. Not sure what's going on here. I also couldn't find an annotation to tell Autopilot to leave a specific deployment alone
redis.yml:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: redis-disk
spec:
accessModes:
- ReadWriteOnce
storageClassName: gce-ssd
resources:
requests:
storage: "2Gi"
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
ports:
- port: 6379
name: redis
clusterIP: None
selector:
app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
labels:
app: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
volumes:
- name: redis-volume
persistentVolumeClaim:
claimName: redis-disk
readOnly: false
terminationGracePeriodSeconds: 5
containers:
- name: redis
image: redis:6-alpine
command: ["sh"]
args: ["-c", 'exec redis-server --requirepass "$REDIS_PASSWORD"']
resources:
requests:
memory: "512Mi"
cpu: "500m"
ephemeral-storage: "1Gi"
envFrom:
- secretRef:
name: env-secrets
volumeMounts:
- name: redis-volume
mountPath: /data
subPath: data
PersistentVolumeClaim
is an object in kubernetes allowing to decouple storage resource requests from actual resource provisioning done by its associated PersistentVolume
part.
Given:
PersistentVolume
objectkubernetes will try to dynamically provision a suitable persistent disk for you suitable for the underlying infrastructure being a Google Compute Engine Persistent Disk in you case based on the requested storage class (gce-ssd).
The claim will result then in an SSD-like Persistent Disk to be automatically provisioned for you and once the claim is deleted (the requesting pod is deleted due to downscale), the volume is destroyed.
To overcome this issue and avoid precious data loss, you should have two alternatives:
To avoid data loss once the Pod and its PVC are deleted, you can set the persistentVolumeReclaimPolicy
parameter to Retain
on the PVC object:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: redis-disk
spec:
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: gce-ssd
resources:
requests:
storage: "2Gi"
This allows for the persistent volume to go back to the Released
state and the underlying data can be manually backed up.
As a general recommendation, you should set the reclaimPolicy
parameter to Retain
(default is Delete
) for your used StorageClass
:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ssd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
reclaimPolicy: Retain
replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
Additional parameters are recommended:
replication-type
: should be set to regional-pd
to allow zonal replicationvolumeBindingMode
: set to WaitForFirstConsumer
to allow for first consumer dictating the zonal replication topologyYou can read more on all above StorageClass
parameters in the kubernetes documentation.
A PersistentVolume
with same storage class name is then declared:
apiVersion: v1
kind: PersistentVolume
metadata:
name: ssd-volume
spec:
storageClassName: "ssd"
capacity:
storage: 2G
accessModes:
- ReadWriteOnce
gcePersistentDisk:
pdName: redis-disk
And the PersistentVolumeClaim
would only declare the requested StorageClass
name:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ssd-volume-claim
spec:
storageClassName: "ssd"
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "2Gi"
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
labels:
app: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
volumes:
- name: redis-volume
persistentVolumeClaim:
claimName: ssd-volume-claim
readOnly: false
This objects declaration would prevent any failures or scale down operations from destroying the created PV either created manually by cluster administrators or dynamically using Dynamic Provisioning.