Kubernetes cpu requests/limits in heterogeneous cluster

8/14/2018

Kubernetes allows to specify the cpu limit and/or request for a POD.

Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to:

1 AWS vCPU
1 GCP Core
1 Azure vCore
1 IBM vCPU
1 Hyperthread on a bare-metal Intel processor with Hyperthreading

Unfortunately, when using an heterogeneous cluster (for instance with different processors), the cpu limit/request depends on the node on which the POD is assigned; especially for real time applications.

If we assume that:

  • we can compute a fined tuned cpu limit for the POD for each kind of hardware of the cluster
  • we want to let the Kubernetes scheduler choose a matching node in the whole cluster

Is there a way to launch the POD so that the cpu limit/request depends on the node chosen by the Kubernetes scheduler (or a Node label)?

The obtained behavior should be (or equivalent to):

  • Before assigning the POD, the scheduler chooses the node by checking different cpu requests depending on the Node (or a Node Label)
  • At runtime, Kublet checks a specific cpu limit depending on the Node (or a Node Label)
-- Noxis_Style
cluster-computing
cpu
heterogeneous
kubernetes
request

2 Answers

5/6/2020

Before assigning the POD, the scheduler chooses the node by checking different cpu requests depending on the Node (or a Node Label)

Not with default scheduler, the closest option is using node-affinity like Marcin suggested, so you can pick the node based on a node label. Like below:

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: podname
    image: k8s.gcr.io/pause:2.0

In this case, you would tag the Nodes with labels to identify their type or purpose, e.g: db, cache, web and so on. Then you set the affinity to match the node types you expect.

requiredDuringSchedulingIgnoredDuringExecution means the pod won't be scheduled in the node if the conditions are not meet.

preferredDuringSchedulingIgnoredDuringExecution means the scheduler will try to find a node that also matches that condition, but will schedule the pod anywhere possible if no nodes fit the condition specified.

Your other alternative, is writting your custom scheduler.

apiVersion: v1
kind: Pod
metadata:
  name: annotation-default-scheduler
  labels:
    name: multischeduler-example
spec:
  schedulerName: default-scheduler
  containers:
  - name: pod-with-default-annotation-container
    image: k8s.gcr.io/pause:2.0

Kubernetes ships with a default scheduler that is described here. If the default scheduler does not suit your needs you can implement your own scheduler. This way you can write a complex scheduling logic to decide where each POD should go, only recommended for something that are not possible using the default scheduler

Keep in mind, one of the most important components in Kubernetes is the scheduler, the default scheduler is battle tested and really flexible to handle most of the applications. Writing your own scheduler lose the features provided by the default one, like load balancing, policies, filtering. To know more about the features provided by default scheduler, check the docs here.

If you are willing to take the risks and want to write a custom scheduler, take a look in the docs in here.

At runtime, Kublet checks a specific cpu limit depending on the Node (or a Node Label)

Before receiving the request to allocate a pod, the scheduler checks for resource availability in the node then assign the pod to a node. Each node have it's own kubelet which check for pods that should initialize in that node and the only thing the kubelet does is start these pods, it does not decide which node a pod it should go.

Kubelet also check for resources before initializing a POD, In case the kubelet can't initialize the pod it will just fail and the scheduler will take an action to schedule pods elsewhere.

-- Diego Mendes
Source: StackOverflow

8/14/2018

No, you can't have different requests per node type. What you can do is create a pod manifest with a node affinity for a specific kind of node, and requests which makes sense for that node type. For a second kind of node, you will need a second pod manifest which makes sense for that node type. These pod manifests will differ only in their affinity spec and resource requests - so it would be handy to parameterize them. You could do this with Helm, or write a simple script to do it.

This approach would let you launch a pod within a subset of your nodes with resource requests which make sense on those nodes, but there's no way to globally adjust its requests/limits based on where it ends up.

-- Marcin Romaszewicz
Source: StackOverflow