First of all: I readed other posts like this.
My staging cluster is allocated on AWS using spot instances.
I have arround 50+ pods (runing diferent services / products) and 6 StatefulSets.
I created the StatefulSets this way:
OBS: I do not have PVs and PVCs created manualy, they are being created from the StatfulSet
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
labels:
app: redis
spec:
selector:
matchLabels:
app: redis
serviceName: "redis"
replicas: 1
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:alpine
imagePullPolicy: Always
ports:
- containerPort: 6379
name: client
volumeMounts:
- name: data
mountPath: /data
readOnly: false
volumeClaimTemplates:
- metadata:
name: data
labels:
name: redis-gp2
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
ports:
- port: 6379
name: redis
targetPort: 6379
selector:
app: redis
type: NodePort
I do have node and pod autoscalers configured.
In the past week after deploying some extra micro-services during the "usage peak" the node autoscaler trigged.
During the scale down some pods(StatefulSets) crashed with the error node(s) had volume node affinity conflict
.
My first reaction wast to delete and "recreate" the PVs/PVCs with high priority. That "fixed" the pending pods on that time.
Today I forced another scale-up, so I was able to check what was happening.
The problem occurs during the scalle up and take a long time to go back to normal (+/- 30 min) even after the scalling down.
Describe Pod:
Name: redis-0
Namespace: ***-staging
Priority: 1000
Priority Class Name: prioridade-muito-alta
Node: ip-***-***-***-***.sa-east-1.compute.internal/***.***.*.***
Start Time: Mon, 03 Jan 2022 09:24:13 -0300
Labels: app=redis
controller-revision-hash=redis-6fd5f59c5c
statefulset.kubernetes.io/pod-name=redis-0
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: ***.***.***.***
IPs:
IP: ***.***.***.***
Controlled By: StatefulSet/redis
Containers:
redis:
Container ID: docker://4928f38ed12c206dc5915c863415d3eba98b9592f2ab5c332a900aa2fa2cef64
Image: redis:alpine
Image ID: docker-pullable://redis@sha256:4bed291aa5efb9f0d77b76ff7d4ab71eee410962965d052552db1fb80576431d
Port: 6379/TCP
Host Port: 0/TCP
State: Running
Started: Mon, 03 Jan 2022 09:24:36 -0300
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/data from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-ngc7q (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-redis-0
ReadOnly: false
default-token-***:
Type: Secret (a volume populated by a Secret)
SecretName: *****
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 59m (x4 over 61m) default-scheduler 0/7 nodes are available: 1 Too many pods, 1 node(s) were unschedulable, 5 node(s) had volume node affinity conflict.
Warning FailedScheduling 58m default-scheduler 0/7 nodes are available: 1 Too many pods, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 4 node(s) had volume node affinity conflict.
Warning FailedScheduling 58m default-scheduler 0/7 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 57m (x2 over 58m) default-scheduler 0/7 nodes are available: 2 Too many pods, 2 node(s) were unschedulable, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 50m (x9 over 57m) default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 48m (x2 over 49m) default-scheduler 0/5 nodes are available: 2 Too many pods, 3 node(s) had volume node affinity conflict.
Warning FailedScheduling 35m (x10 over 48m) default-scheduler 0/5 nodes are available: 1 Too many pods, 4 node(s) had volume node affinity conflict.
Normal NotTriggerScaleUp 30m (x163 over 58m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict
Warning FailedScheduling 30m (x3 over 33m) default-scheduler 0/5 nodes are available: 5 node(s) had volume node affinity conflict.
Normal SuccessfulAttachVolume 29m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-23168a78-2286-40b7-aa71-194ca58e0005"
Normal Pulling 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Pulling image "redis:alpine"
Normal Pulled 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Successfully pulled image "redis:alpine" in 3.843908086s
Normal Created 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Created container redis
Normal Started 28m kubelet, ip-***-***-***-***.sa-east-1.compute.internal Started container redis
PVC:
Name: data-redis-0
Namespace: ***-staging
StorageClass: gp2
Status: Bound
Volume: pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels: app=redis
name=redis-gp2
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
volume.kubernetes.io/selected-node: ip-***-***-***-***.sa-east-1.compute.internal
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWO
VolumeMode: Filesystem
Mounted By: redis-0
Events: <none>
PV:
Name: pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels: failure-domain.beta.kubernetes.io/region=sa-east-1
failure-domain.beta.kubernetes.io/zone=sa-east-1b
Annotations: kubernetes.io/createdby: aws-ebs-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers: [kubernetes.io/pv-protection]
StorageClass: gp2
Status: Bound
Claim: ***-staging/data-redis-0
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 1Gi
Node Affinity:
Required Terms:
Term 0: failure-domain.beta.kubernetes.io/zone in [sa-east-1b]
failure-domain.beta.kubernetes.io/region in [sa-east-1]
Message:
Source:
Type: AWSElasticBlockStore (a Persistent Disk resource in AWS)
VolumeID: aws://sa-east-1b/vol-061fd23a65185d42c
FSType: ext4
Partition: 0
ReadOnly: false
Events: <none>
This happend in 4 of my 6 StatefulSets.
Question:
If I create PVs and PVCs manually setting:
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- sa-east-1
will the scale up/down not mess up with StatefulSets?
If not what can I do to avoid this problem ?
First of all, it's better to move allowedTopologies
stanza to StorageClass
. It's more flexible because you can create multiple zone-specific storage classes.
And yes, this should obviously solve your one problem and create another. You basically want to sacrifice high availability to costs/convenience. It's totally up to you, there is no one-size-fits-all recommendation here but I just want to make sure you know the options.
You may still have volumes not tied to specific zones if you always have enough node capacity in every AZ. This can be achieved using cluster-autoscaler. Generally, you create separate node groups per each AZ and autoscaler will do the rest.
Another option is to build distributed storage like Ceph or Portworx that allows to mount volumes from another AZ. That will greatly increase your cross-AZ traffic costs and needs to be maintained properly but I know companies that do that.
You can also avoid this problem by separating your Kubernetes workload with nodepool segregation and affinity options as mentioned in this external article.
In a case where only a portion of your workload requires PVs/PVCs I would suggest using a dedicated nodepool for your statefulsets.