How to delay Deployment Pod restart

12/7/2021

I'm using simple pattern where one Node had one Pod in it, and that Pod is controlled by a Deployment with one replicas set.

Deployment is there to ensure Pod restarts when it gets evicted by DiskPressureEviction. The problem I'm facing is caused by Deployment retrying to restart the Pod too fast. As the Pod is set to be in specific Node that hasn't cleaned up DiskPressure yet, restarting Pod fails sequentially before Node is ready to accept new Pod:

NAME                                 READY   STATUS              RESTARTS   AGE
deployment-adid-7bb998fccc-4v9dx     0/1     Evicted             0          6m17s
deployment-adid-7bb998fccc-59kvv     0/1     Evicted             0          6m20s
deployment-adid-7bb998fccc-59zzl     0/1     Evicted             0          6m20s
deployment-adid-7bb998fccc-dmm9k     0/1     Evicted             0          6m16s
deployment-adid-7bb998fccc-gn59z     0/1     Evicted             0          6m20s
deployment-adid-7bb998fccc-j4v25     0/1     Evicted             0          6m18s
deployment-adid-7bb998fccc-mw4ps     0/1     Evicted             0          6m20s
deployment-adid-7bb998fccc-n7krq     0/1     Evicted             0          18h
deployment-adid-7bb998fccc-rm4tr     0/1     Evicted             0          6m18s
deployment-adid-7bb998fccc-vn44q     0/1     ContainerCreating   0          6m15s

Here, 8 Pods are created and evicted in like 5 seconds before 9th get accepted by the designated Node.

While the last Pod finally becomes Running I don't like making garbage Pods. Would be nice if Pod can wait for the Node becoming ready, yet if it's impossible, I suppose restarting could be simply delayed. Presumably by describing waitTime before start recreating a Pod, or describing restartInterval that says in which interval should Deployment try to restart Pod.

So how can I set this kind of control in Deployment's spec?

ADDITION:

With excluding meaningless labels, Deployment spec is something like this:

deployment_template = {
    'apiVersion': 'apps/v1',
    'kind': 'Deployment',
    'metadata': {
        'name': 'first',
    },
    'spec': {
        'replicas': '1',
        'selector': {
            'matchLabels': {
                "podName" : "first"
            }
        },
        'template': {
            'metadata': {
                'labels': {
                    "podName" : "first"
                }
            },
            'spec': {
                'nodeSelector': {
                    "node": "1"
                },
                'restartPolicy': 'Always',
                'hostNetwork': True,
                'dnsPolicy': 'ClusterFirstWithHostNet',
                'containers': [
                    {
                        'name': 'containername',
                        'image': "somecontainerimage",
                        'imagePullPolicy': 'Always',
                    }
                ]
            }
        }
    }
}
-- 김기영
kubernetes
kubernetes-deployment
kubernetes-pod

2 Answers

12/7/2021

...should Node reject Pods when it is resolving DiskPressure?

When a node is under disk pressure, you should see the node being tainted with node.kubernetes.io/disk-pressure automatically. Unless you purposely tolerate such taint in your deployment spec, the scheduler will not deploy your pod on such node.

Note the worker node will not perform disk clean-up by itself.

-- gohm'c
Source: StackOverflow

12/15/2021

First I'd suggest updating to the newest, supported Kubernetes version. The maintenance support for version 1.17 that you are using ended 11 months ago. The actual version (as of today 15.12.2021) is v1.23. Since Kubernetes v1.18 the feature TaintBasedEvictions is in stable mode.

Another thing is that, instead of trying to delay the deployment which is kind of a workaround and not the best practice and better to fix a main issue which is disk pressure eviction that you are occurring. You should consider changing behaviour of your application, or at least try to avoid disk pressure on node by increasing it's storage size.

Anyway, If you want to keep it in that way, you may try to setup some additional parameters. You can't itself delay the deployment, but you can change the behaviour of the kubelet agent on your node.


Below example is for the Kubernetes version 1.23. Keep in mind that for version 1.17 it may differ.

I created a cluster with one master node and one worker node, the pods are only scheduled on the worker node. I am fulfilling worker storage to create node.kubernetes.io/disk-pressure. By default the behaviour is similar to yours, many pods are created in Evicted state, which, worth to note, it's totally normal and it's expected behaviour. They are creating until node get taint disk-pressure, which is occurring after ~10 seconds by default:

nodeStatusUpdateFrequency is the frequency that kubelet computes node status. ... Default: "10s"

After that time, as you can observe, there are no pods created in Evicted state. The taint is deleted (i.e in you case the disk storage on node is back to the proper value) after ~5 min, it is defined by evictionPressureTransitionPeriod parameter:

evictionPressureTransitionPeriod is the duration for which the kubelet has to wait before transitioning out of an eviction pressure condition. ... Default: "5m"

Okay, let's change some configuration by editing kubelet config file on the worker node- it is located at /var/lib/kubelet/config.yaml for kubeadm.

I will change three parameters:

The file var/lib/kubelet/config.yaml - only the changed / added fields:

evictionPressureTransitionPeriod: 120s
evictionSoftGracePeriod: 
  nodefs.available: 60s
evictionSoft:
  nodefs.available: 15Gi 

To sum up - after my node storage is less than 15 GB, the pod will be in running state for 60 seconds. After that, is storage is still less than 15 GB, pods will enter into Evicted / Completed state, the new pods will occur in Pending state:

NAME                                   READY   STATUS      RESTARTS   AGE
my-nginx-deployment-6cf77b6d6b-2hr2s   0/1     Completed   0          115m
my-nginx-deployment-6cf77b6d6b-8f8wv   0/1     Completed   0          115m
my-nginx-deployment-6cf77b6d6b-9kpc9   0/1     Pending     0          108s
my-nginx-deployment-6cf77b6d6b-jbx5g   0/1     Pending     0          107s

After the available storage is higher than 15 GB, it will take 2 minutes to remove the taint and create new pods.

If during these 60 seconds the available storage will be again higher than 15GB, then no action will be done, the pods will be still in Running state.

If you have any garbage pods running, run this command to delete them:

kubectl get pods | grep -e "ContainerStatusUnknown" -e "Evicted" -e "Completed" -e "Error" | awk '{print $1}' | xargs kubectl delete pod

Keep in mind that pod eviction may behave differently for different QoS classes and priority classes- check this article -> Node-pressure Eviction - Pod selection for kubelet eviction for more information.

You should try to monitor how exactly the disk pressure is happening on your node and you can adjust the kubelet configuration accordingly. Also check these articles:

-- Mikolaj S.
Source: StackOverflow