I am trying to find an elegant way to resolve the following scenario.
We have a ASW Kubernetes cluster with 6 nodes of 16G RAM each one, The cluster has various of pods with different resources requirements between 1G to 6G minimum requested memory.
There is a scenario when we get a pod pending due to insufficient memory. It happens when we need to upgrade a few pods with different memory requirements. The pod with 6G is pending since no node has 6G available.
What I would expect from the Kubernetes to rearrange pods between nodes in order to free 6G on a specific node rather than hold 5G free on two diff notes (in total 10G) and returns insufficient memory.
Is there a way to instruct the Kubernetes to initialise the memory better and handle this automatically.
I was thinking about the pod priorities capability. The less memory request the low prioritised. Wondering if basing on this setting the Kubernetes will be able to restart the less important (the small) pod once the bigger is deployed, in this way to rearrange them between nodes.
Any idea will be appreciated
By default Kubernetes scheduler will never kill any containers to accommodate newer containers. That's because if it does that, running container may be forced to reschedule on other nodes which is undesirable. Kubernetes will respect current state of cluster and tries to keep stable environment.
What you can do about this issue is, when you deploy 6G RAM app you can deploy it then delete 1G RAM requesting pods, so Kubernetes scheduler can deploy bigger app on available nodes first and deploy other pods to other nodes. This is also Kubernetes scheduler's default action, always tries to put bigger pieces first so it can also put smaller ones better.
There is no silver bullet solution but there are combination things you can do maybe using Pod Affinity/Pod AntiAffinity, Node Affinity, and Pod Topology Spread Constraints. It also depends on your workload priorities.
If you have 6 nodes you can have something like this:
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready <none> 4m26s v1.16.0 node=node1,type=heavy
node2 Ready <none> 3m58s v1.16.0 node=node2,type=heavy
node3 Ready <none> 3m17s v1.16.0 node=node3,type=heavy
node4 Ready <none> 2m43s v1.16.0 node=node4,type=light
node5 Ready <none> 3m17s v1.16.0 node=node5,type=light
node6 Ready <none> 2m43s v1.16.0 node=node6,type=light
Then in your 6G Pod spec, which will schedule on node1-node6 with a skew of 3 on the heavy nodes based on the heavy pods having PodAffinity.
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
workload: heavy
spec:
topologySpreadConstraints:
- maxSkew: 3
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
workload: heavy
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: workload
operator: In
values:
- heavy
topologyKey: type
containers:
- name: myheavyapp
image: myapp:latest
...
Then you can use NodeAffinity just to schedule your light 1G pods on the light nodes only.
kind: Pod
apiVersion: v1
metadata:
name: mylightpod
labels:
workload: light
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: In
values:
- light
...
This is just an example, you can change labels, and skews to fit whatever your use case is.
Additionally, to prevent downtime you can configure a PodDisruptionBudget