I'm using simple pattern where one Node had one Pod in it, and that Pod is controlled by a Deployment with one replicas set.
Deployment is there to ensure Pod restarts when it gets evicted by DiskPressureEviction
.
The problem I'm facing is caused by Deployment retrying to restart the Pod too fast. As the Pod is set to be in specific Node that hasn't cleaned up DiskPressure
yet, restarting Pod fails sequentially before Node is ready to accept new Pod:
NAME READY STATUS RESTARTS AGE
deployment-adid-7bb998fccc-4v9dx 0/1 Evicted 0 6m17s
deployment-adid-7bb998fccc-59kvv 0/1 Evicted 0 6m20s
deployment-adid-7bb998fccc-59zzl 0/1 Evicted 0 6m20s
deployment-adid-7bb998fccc-dmm9k 0/1 Evicted 0 6m16s
deployment-adid-7bb998fccc-gn59z 0/1 Evicted 0 6m20s
deployment-adid-7bb998fccc-j4v25 0/1 Evicted 0 6m18s
deployment-adid-7bb998fccc-mw4ps 0/1 Evicted 0 6m20s
deployment-adid-7bb998fccc-n7krq 0/1 Evicted 0 18h
deployment-adid-7bb998fccc-rm4tr 0/1 Evicted 0 6m18s
deployment-adid-7bb998fccc-vn44q 0/1 ContainerCreating 0 6m15s
Here, 8 Pods are created and evicted in like 5 seconds before 9th get accepted by the designated Node.
While the last Pod finally becomes Running
I don't like making garbage Pods. Would be nice if Pod can wait for the Node becoming ready, yet if it's impossible, I suppose restarting could be simply delayed. Presumably by describing waitTime
before start recreating a Pod, or describing restartInterval
that says in which interval should Deployment try to restart Pod.
So how can I set this kind of control in Deployment's spec?
ADDITION:
With excluding meaningless labels, Deployment spec is something like this:
deployment_template = {
'apiVersion': 'apps/v1',
'kind': 'Deployment',
'metadata': {
'name': 'first',
},
'spec': {
'replicas': '1',
'selector': {
'matchLabels': {
"podName" : "first"
}
},
'template': {
'metadata': {
'labels': {
"podName" : "first"
}
},
'spec': {
'nodeSelector': {
"node": "1"
},
'restartPolicy': 'Always',
'hostNetwork': True,
'dnsPolicy': 'ClusterFirstWithHostNet',
'containers': [
{
'name': 'containername',
'image': "somecontainerimage",
'imagePullPolicy': 'Always',
}
]
}
}
}
}
...should Node reject Pods when it is resolving DiskPressure?
When a node is under disk pressure, you should see the node being tainted with node.kubernetes.io/disk-pressure
automatically. Unless you purposely tolerate such taint in your deployment spec, the scheduler will not deploy your pod on such node.
Note the worker node will not perform disk clean-up by itself.
First I'd suggest updating to the newest, supported Kubernetes version. The maintenance support for version 1.17 that you are using ended 11 months ago. The actual version (as of today 15.12.2021) is v1.23. Since Kubernetes v1.18 the feature TaintBasedEvictions
is in stable mode.
Another thing is that, instead of trying to delay the deployment which is kind of a workaround and not the best practice and better to fix a main issue which is disk pressure eviction that you are occurring. You should consider changing behaviour of your application, or at least try to avoid disk pressure on node by increasing it's storage size.
Anyway, If you want to keep it in that way, you may try to setup some additional parameters. You can't itself delay the deployment, but you can change the behaviour of the kubelet agent on your node.
Below example is for the Kubernetes version 1.23. Keep in mind that for version 1.17 it may differ.
I created a cluster with one master node and one worker node, the pods are only scheduled on the worker node. I am fulfilling worker storage to create node.kubernetes.io/disk-pressure
. By default the behaviour is similar to yours, many pods are created in Evicted
state, which, worth to note, it's totally normal and it's expected behaviour. They are creating until node get taint disk-pressure
, which is occurring after ~10 seconds by default:
nodeStatusUpdateFrequency is the frequency that kubelet computes node status. ... Default: "10s"
After that time, as you can observe, there are no pods created in Evicted
state. The taint is deleted (i.e in you case the disk storage on node is back to the proper value) after ~5 min, it is defined by evictionPressureTransitionPeriod
parameter:
evictionPressureTransitionPeriod is the duration for which the kubelet has to wait before transitioning out of an eviction pressure condition. ... Default: "5m"
Okay, let's change some configuration by editing kubelet config file on the worker node- it is located at /var/lib/kubelet/config.yaml
for kubeadm.
I will change three parameters:
evictionPressureTransitionPeriod
parameter set to 120s so taint will be deleted fasterevictionSoft
to define a soft eviction - in my case it will occur when worker node has available less than 15GB of the storageevictionSoftGracePeriod
to define a period after pod will enter into eviction state if defined evictionSoft
occurs, in my case it's 60 secondsThe file var/lib/kubelet/config.yaml
- only the changed / added fields:
evictionPressureTransitionPeriod: 120s
evictionSoftGracePeriod:
nodefs.available: 60s
evictionSoft:
nodefs.available: 15Gi
To sum up - after my node storage is less than 15 GB, the pod will be in running state for 60 seconds. After that, is storage is still less than 15 GB, pods will enter into Evicted
/ Completed
state, the new pods will occur in Pending
state:
NAME READY STATUS RESTARTS AGE
my-nginx-deployment-6cf77b6d6b-2hr2s 0/1 Completed 0 115m
my-nginx-deployment-6cf77b6d6b-8f8wv 0/1 Completed 0 115m
my-nginx-deployment-6cf77b6d6b-9kpc9 0/1 Pending 0 108s
my-nginx-deployment-6cf77b6d6b-jbx5g 0/1 Pending 0 107s
After the available storage is higher than 15 GB, it will take 2 minutes to remove the taint and create new pods.
If during these 60 seconds the available storage will be again higher than 15GB, then no action will be done, the pods will be still in Running
state.
If you have any garbage pods running, run this command to delete them:
kubectl get pods | grep -e "ContainerStatusUnknown" -e "Evicted" -e "Completed" -e "Error" | awk '{print $1}' | xargs kubectl delete pod
Keep in mind that pod eviction may behave differently for different QoS classes and priority classes- check this article -> Node-pressure Eviction - Pod selection for kubelet eviction for more information.
You should try to monitor how exactly the disk pressure is happening on your node and you can adjust the kubelet configuration accordingly. Also check these articles: