I am trying to replicate the Azure Batch API within Kubernetes, I have a web api that runs as a service and that in turn uses the Kubernetes API to create batch jobs dynamically.
So far so good.
Where i am coming unstuck is typically each task in these jobs is some pretty hard hitting TensorFlow deep learning so ideally i would want Kubernetes to schedule only a single pod per node and then in combination with a node autoscaler it scales up my cluster as required.
In Azure Batch on a per job basis you can specify tasks per VM, analogous to pods per node in Kubernetes. It seems that there is no support for this in the Kubernetes API and is only available via the kubelet max pods configuration which is not ideal as thats more hard coded than i would like.
So my question is there a way using some sort of metrics on a job spec to force Kubernetes to limit pod instances per node. Ideally this would be a proactive decision by the scheduler in that it doesnt schedule a pod only to realise later it is getting no resource.
You can use pod affinity/anti-affinity rules to ensure that once a pod of a specific application is scheduled on one node, then no other pod of the same application is scheduled on that node.
Copying the example deployment of Redis from docs website:
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
This will ensure that on a single node - only one instance of Redis cache is running. Some key things to note:
The label app=store
is important in identifying the application
Using the label - the hostname of node is matched to decide scheduling: topologyKey: "kubernetes.io/hostname"
The experssion requiredDuringSchedulingIgnoredDuringExecution
ensures that this is a hard decision during scheduling and no scheduling of pod will be done if the criteria are not met.
Do check out various options for scheduling here for more details.