How to control the scheduler of pods to specific node without change of dc?

4/25/2018

I have some openshift nodes, some nodes have some Tesla P40s and should be dedicated to ML use through nvidia device plugin. But i don't want to let users add some taints or node affinity in their original DeploymentConfigs which could lead to a messy. How could i achieve this implicitly?

what i want to achieve is:

  1. only ML pods could stay on these nodes which has GPU
  2. ML users doesn't need to change their dc except the resource limit of "nvidia.com/gpu".

if a scheduler is the only way, then how to write it ? Thanks

-- 白栋天
kubernetes
openshift

1 Answer

4/25/2018

My understanding of your problem is that you're trying to do two things:

  1. Schedule GPU workloads correctly onto nodes that have GPUs available
  2. Make sure pods that don't need GPUs are not scheduled onto nodes that have GPUs.

You're doing (1) With the NVidia device plugin, which seems to be the correct way, since it uses the concept of Extended Resources.

To do (2), Taints and Tolerations are indeed the recommended way. The docs even talk explicitly about the GPU use case - Quoting the documentation:

Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don’t need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule or kubectl taint nodes nodename special=true:PreferNoSchedule) and adding a corresponding toleration to pods that use the special hardware. As in the dedicated nodes use case, it is probably easiest to apply the tolerations using a custom admission controller). For example, it is recommended to use Extended Resources to represent the special hardware, taint your special hardware nodes with the extended resource name and run the ExtendedResourceToleration admission controller. Now, because the nodes are tainted, no pods without the toleration will schedule on them. But when you submit a pod that requests the extended resource, the ExtendedResourceToleration admission controller will automatically add the correct toleration to the pod and that pod will schedule on the special hardware nodes. This will make sure that these special hardware nodes are dedicated for pods requesting such hardware and you don’t have to manually add tolerations to your pods.

Only those users that explicitly need the GPUs need to add a toleration for it in their pod spec, and it's fairly straightforward to do so. It looks like this (ref: Advanced Scheduling in Kubernetes):

tolerations:
- key: "gpu"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"

So usually this is an acceptable trade-off.

However if you absolutely do not want to let users have to add that toleration. What you need is an Admission Controller.

An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized.

In particular you want the special AdmissionController known as MutatingAdmissionWebhook.

Your custom MutatingAdmissionWebhook can take a look at the pod spec, look for:

resources:
        limits:
          nvidia.com/gpu: 2

and then automatically add the required "Toleration" to the pod spec, all without letting the user know. You still end up using Taints and Tolerations, the users just don't see them anymore. You don't need to write a new scheduler for this.

Here's an example of how to write an admission controller webhook, is available in the official kubernetes repository as part of the e2e tests.

-- ffledgling
Source: StackOverflow