Kubernetes pod transferring on insufficient memory

6/26/2020

I am trying to find an elegant way to resolve the following scenario.

We have a ASW Kubernetes cluster with 6 nodes of 16G RAM each one, The cluster has various of pods with different resources requirements between 1G to 6G minimum requested memory.

There is a scenario when we get a pod pending due to insufficient memory. It happens when we need to upgrade a few pods with different memory requirements. The pod with 6G is pending since no node has 6G available.

What I would expect from the Kubernetes to rearrange pods between nodes in order to free 6G on a specific node rather than hold 5G free on two diff notes (in total 10G) and returns insufficient memory.

Is there a way to instruct the Kubernetes to initialise the memory better and handle this automatically.

I was thinking about the pod priorities capability. The less memory request the low prioritised. Wondering if basing on this setting the Kubernetes will be able to restart the less important (the small) pod once the bigger is deployed, in this way to rearrange them between nodes.

Any idea will be appreciated

-- Denis Voloshin
amazon-eks
kubernetes
kubernetes-pod

2 Answers

6/26/2020

By default Kubernetes scheduler will never kill any containers to accommodate newer containers. That's because if it does that, running container may be forced to reschedule on other nodes which is undesirable. Kubernetes will respect current state of cluster and tries to keep stable environment.

What you can do about this issue is, when you deploy 6G RAM app you can deploy it then delete 1G RAM requesting pods, so Kubernetes scheduler can deploy bigger app on available nodes first and deploy other pods to other nodes. This is also Kubernetes scheduler's default action, always tries to put bigger pieces first so it can also put smaller ones better.

-- Akin Ozer
Source: StackOverflow

6/27/2020

There is no silver bullet solution but there are combination things you can do maybe using Pod Affinity/Pod AntiAffinity, Node Affinity, and Pod Topology Spread Constraints. It also depends on your workload priorities.

If you have 6 nodes you can have something like this:

NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4m26s   v1.16.0   node=node1,type=heavy
node2   Ready    <none>   3m58s   v1.16.0   node=node2,type=heavy
node3   Ready    <none>   3m17s   v1.16.0   node=node3,type=heavy
node4   Ready    <none>   2m43s   v1.16.0   node=node4,type=light
node5   Ready    <none>   3m17s   v1.16.0   node=node5,type=light
node6   Ready    <none>   2m43s   v1.16.0   node=node6,type=light

Then in your 6G Pod spec, which will schedule on node1-node6 with a skew of 3 on the heavy nodes based on the heavy pods having PodAffinity.

kind: Pod
apiVersion: v1
metadata:
  name: mypod
  labels:
    workload: heavy
spec:
  topologySpreadConstraints:
  - maxSkew: 3
    topologyKey: type
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        workload: heavy
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: workload
            operator: In
            values:
            - heavy
        topologyKey: type
  containers:
  - name: myheavyapp
    image: myapp:latest
  ...

Then you can use NodeAffinity just to schedule your light 1G pods on the light nodes only.

kind: Pod
apiVersion: v1
metadata:
  name: mylightpod
  labels:
    workload: light
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: type
            operator: In
            values:
            - light
    ...

This is just an example, you can change labels, and skews to fit whatever your use case is.

Additionally, to prevent downtime you can configure a PodDisruptionBudget

-- Rico
Source: StackOverflow