Here is what I am working with. I have 3 nodepools on GKE
I have pods that will require any of the following memory requests. Assume limits are very close to requests.
1GB, 2GB, 4GB, 6GB, 8GB, 10GB, 12GB, 14GB
How best can I associate a pod to a nodepool for max efficiency?
So far I have 3 strategies.
For each pod config, determine the “rightful nodepool”. This is the smallest nodepool that can accommodate the pod config in an ideal world. So for 2GB pod it's n1s1 but for 4GB pod it'd be n1s2.
Which of these or any other strategies will minimize wasting resources?
\=======
Why would you have 3 pools like that in the first place? You generally want to use the largest instance type you can that gets you under 110 pods per node (which is the default hard cap). The job of the scheduler is to optimize the packing for you, and it's pretty good at that with the default settings.
I would use a mix of Taints and Tolerations and Node affinity.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
You can set a taint on a node kubectl taint nodes node1 key=value:NoSchedule
The taint has key
key
, valuevalue
, and taint effectNoSchedule
. This means that no pod will be able to schedule ontonode1
unless it has a matching toleration.
While you are writing a pod
yaml you can specify PodSpec
and add toleration which will match the taint created on node1
which will allow pod
with either toleration to be scheduled onto node1
tolerations:
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
or
tolerations:
- key: "key"
operator: "Exists"
effect: "NoSchedule"
Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn’t be running. A few of the use cases are
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a particular set of users, you can add a taint to those nodes (say,
kubectl taint nodes nodename dedicated=groupName:NoSchedule
) and then add a corresponding toleration to their pods (this would be done most easily by writing a custom admission controller). The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as well as any other nodes in the cluster. If you want to dedicate the nodes to them and ensure they only use the dedicated nodes, then you should additionally add a label similar to the taint to the same set of nodes (e.g.dedicated=groupName
), and the admission controller should additionally add a node affinity to require that the pods can only schedule onto nodes labeled withdedicated=groupName
.Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don’t need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware (e.g.
kubectl taint nodes nodename special=true:NoSchedule
orkubectl taint nodes nodename special=true:PreferNoSchedule
) and adding a corresponding toleration to pods that use the special hardware. As in the dedicated nodes use case, it is probably easiest to apply the tolerations using a custom admission controller. For example, it is recommended to use Extended Resources to represent the special hardware, taint your special hardware nodes with the extended resource name and run the ExtendedResourceToleration admission controller. Now, because the nodes are tainted, no pods without the toleration will schedule on them. But when you submit a pod that requests the extended resource, theExtendedResourceToleration
admission controller will automatically add the correct toleration to the pod and that pod will schedule on the special hardware nodes. This will make sure that these special hardware nodes are dedicated for pods requesting such hardware and you don’t have to manually add tolerations to your pods.Taint based Evictions: A per-pod-configurable eviction behavior when there are node problems, which is described in the next section.
As for node affinity
:
is conceptually similar to
nodeSelector
– it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution
andpreferredDuringSchedulingIgnoredDuringExecution
. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just likenodeSelector
but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee. The “IgnoredDuringExecution” part of the names means that, similar to hownodeSelector
works, if labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod will still continue to run on the node. In the future we plan to offerrequiredDuringSchedulingRequiredDuringExecution
which will be just likerequiredDuringSchedulingIgnoredDuringExecution
except that it will evict pods from nodes that cease to satisfy the pods’ node affinity requirements.Thus an example of
requiredDuringSchedulingIgnoredDuringExecution
would be “only run the pod on nodes with Intel CPUs” and an examplepreferredDuringSchedulingIgnoredDuringExecution
would be “try to run this set of pods in failure zone XYZ, but if it’s not possible, then allow some to run elsewhere”.Node affinity is specified as field
nodeAffinity
of fieldaffinity
in the PodSpec....
The new node affinity syntax supports the following operators:
In
,NotIn
,Exists
,DoesNotExist
,Gt
,Lt
. You can useNotIn
andDoesNotExist
to achieve node anti-affinity behavior, or use node taints to repel pods from specific nodes.If you specify both
nodeSelector
andnodeAffinity
, both must be satisfied for the pod to be scheduled onto a candidate node.If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node only if allnodeSelectorTerms
can be satisfied.If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node if one of thematchExpressions
is satisfied.