Kubernetes node autoscaling and fine control over pods per node

8/26/2018

I am trying to replicate the Azure Batch API within Kubernetes, I have a web api that runs as a service and that in turn uses the Kubernetes API to create batch jobs dynamically.

So far so good.

Where i am coming unstuck is typically each task in these jobs is some pretty hard hitting TensorFlow deep learning so ideally i would want Kubernetes to schedule only a single pod per node and then in combination with a node autoscaler it scales up my cluster as required.

In Azure Batch on a per job basis you can specify tasks per VM, analogous to pods per node in Kubernetes. It seems that there is no support for this in the Kubernetes API and is only available via the kubelet max pods configuration which is not ideal as thats more hard coded than i would like.

So my question is there a way using some sort of metrics on a job spec to force Kubernetes to limit pod instances per node. Ideally this would be a proactive decision by the scheduler in that it doesnt schedule a pod only to realise later it is getting no resource.

-- user1371314
kubernetes

1 Answer

8/27/2018

You can use pod affinity/anti-affinity rules to ensure that once a pod of a specific application is scheduled on one node, then no other pod of the same application is scheduled on that node.

Copying the example deployment of Redis from docs website:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

This will ensure that on a single node - only one instance of Redis cache is running. Some key things to note:

  1. The label app=store is important in identifying the application

  2. Using the label - the hostname of node is matched to decide scheduling: topologyKey: "kubernetes.io/hostname"

  3. The experssion requiredDuringSchedulingIgnoredDuringExecution ensures that this is a hard decision during scheduling and no scheduling of pod will be done if the criteria are not met.

Do check out various options for scheduling here for more details.

-- Vishal Biyani
Source: StackOverflow