K8s StatfullSets "pending" after node scale

1/3/2022

First of all: I readed other posts like this.

My staging cluster is allocated on AWS using spot instances.

I have arround 50+ pods (runing diferent services / products) and 6 StatefulSets.

I created the StatefulSets this way:

OBS: I do not have PVs and PVCs created manualy, they are being created from the StatfulSet

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  labels:
    app: redis
spec:
  selector:
    matchLabels:
      app: redis
  serviceName: "redis"
  replicas: 1
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:alpine
        imagePullPolicy: Always
        ports:
        - containerPort: 6379
          name: client
        volumeMounts:
          - name: data
            mountPath: /data
            readOnly: false
  volumeClaimTemplates:
    - metadata:
        name: data
        labels:
          name: redis-gp2
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  ports:
  - port: 6379
    name: redis
    targetPort: 6379
  selector:
    app: redis
  type: NodePort    

I do have node and pod autoscalers configured.

In the past week after deploying some extra micro-services during the "usage peak" the node autoscaler trigged.

During the scale down some pods(StatefulSets) crashed with the error node(s) had volume node affinity conflict.

My first reaction wast to delete and "recreate" the PVs/PVCs with high priority. That "fixed" the pending pods on that time.

Today I forced another scale-up, so I was able to check what was happening.

The problem occurs during the scalle up and take a long time to go back to normal (+/- 30 min) even after the scalling down.

Describe Pod:

Name:                 redis-0
Namespace:            ***-staging
Priority:             1000
Priority Class Name:  prioridade-muito-alta
Node:                 ip-***-***-***-***.sa-east-1.compute.internal/***.***.*.***
Start Time:           Mon, 03 Jan 2022 09:24:13 -0300
Labels:               app=redis
                      controller-revision-hash=redis-6fd5f59c5c
                      statefulset.kubernetes.io/pod-name=redis-0
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   ***.***.***.***
IPs:
  IP:           ***.***.***.***
Controlled By:  StatefulSet/redis
Containers:
  redis:
    Container ID:   docker://4928f38ed12c206dc5915c863415d3eba98b9592f2ab5c332a900aa2fa2cef64
    Image:          redis:alpine
    Image ID:       docker-pullable://redis@sha256:4bed291aa5efb9f0d77b76ff7d4ab71eee410962965d052552db1fb80576431d
    Port:           6379/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 03 Jan 2022 09:24:36 -0300
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /data from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-ngc7q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-redis-0
    ReadOnly:   false
  default-token-***:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  *****
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                  From                                                  Message
  ----     ------                  ----                 ----                                                  -------
  Warning  FailedScheduling        59m (x4 over 61m)    default-scheduler                                     0/7 nodes are available: 1 Too many pods, 1 node(s) were unschedulable, 5 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        58m                  default-scheduler                                     0/7 nodes are available: 1 Too many pods, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 4 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        58m                  default-scheduler                                     0/7 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        57m (x2 over 58m)    default-scheduler                                     0/7 nodes are available: 2 Too many pods, 2 node(s) were unschedulable, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        50m (x9 over 57m)    default-scheduler                                     0/6 nodes are available: 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        48m (x2 over 49m)    default-scheduler                                     0/5 nodes are available: 2 Too many pods, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        35m (x10 over 48m)   default-scheduler                                     0/5 nodes are available: 1 Too many pods, 4 node(s) had volume node affinity conflict.
  Normal   NotTriggerScaleUp       30m (x163 over 58m)  cluster-autoscaler                                    pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict
  Warning  FailedScheduling        30m (x3 over 33m)    default-scheduler                                     0/5 nodes are available: 5 node(s) had volume node affinity conflict.
  Normal   SuccessfulAttachVolume  29m                  attachdetach-controller                               AttachVolume.Attach succeeded for volume "pvc-23168a78-2286-40b7-aa71-194ca58e0005"
  Normal   Pulling                 28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Pulling image "redis:alpine"
  Normal   Pulled                  28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Successfully pulled image "redis:alpine" in 3.843908086s
  Normal   Created                 28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Created container redis
  Normal   Started                 28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Started container redis

PVC:

Name:          data-redis-0
Namespace:     ***-staging
StorageClass:  gp2
Status:        Bound
Volume:        pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels:        app=redis
               name=redis-gp2
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
               volume.kubernetes.io/selected-node: ip-***-***-***-***.sa-east-1.compute.internal
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    redis-0
Events:        <none>

PV:

Name:              pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels:            failure-domain.beta.kubernetes.io/region=sa-east-1
                   failure-domain.beta.kubernetes.io/zone=sa-east-1b
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             ***-staging/data-redis-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [sa-east-1b]
                   failure-domain.beta.kubernetes.io/region in [sa-east-1]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://sa-east-1b/vol-061fd23a65185d42c
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

This happend in 4 of my 6 StatefulSets.

Question:

If I create PVs and PVCs manually setting:

volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - sa-east-1

will the scale up/down not mess up with StatefulSets?

If not what can I do to avoid this problem ?

-- Felipe Colussi-oliva
amazon-eks
amazon-web-services
kubernetes
kubernetes-pvc
kubernetes-statefulset

2 Answers

1/3/2022

First of all, it's better to move allowedTopologies stanza to StorageClass. It's more flexible because you can create multiple zone-specific storage classes.

And yes, this should obviously solve your one problem and create another. You basically want to sacrifice high availability to costs/convenience. It's totally up to you, there is no one-size-fits-all recommendation here but I just want to make sure you know the options.

You may still have volumes not tied to specific zones if you always have enough node capacity in every AZ. This can be achieved using cluster-autoscaler. Generally, you create separate node groups per each AZ and autoscaler will do the rest.

Another option is to build distributed storage like Ceph or Portworx that allows to mount volumes from another AZ. That will greatly increase your cross-AZ traffic costs and needs to be maintained properly but I know companies that do that.

-- Vasili Angapov
Source: StackOverflow

1/3/2022

You can also avoid this problem by separating your Kubernetes workload with nodepool segregation and affinity options as mentioned in this external article.

In a case where only a portion of your workload requires PVs/PVCs I would suggest using a dedicated nodepool for your statefulsets.

-- Piotr Malec
Source: StackOverflow