I have two node pools, one with GPU and one with CPU only, and run 2 types of jobs, both of which should spawn a node from their relevant node pool:
apiVersion: batch/v1
kind: Job
metadata:
generateName: cpu-job-
spec:
template:
spec:
containers:
- name: cpujob
image: gcr.io/asd
imagePullPolicy: Always
command: ["/bin/sh"]
args: ["-c", REDACTED]
resources:
requests:
memory: "16000Mi"
cpu: "8000m"
limits:
memory: "32000Mi"
cpu: "16000m"
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-nodepool: cpujobs
backoffLimit: 4
apiVersion: batch/v1
kind: Job
metadata:
generateName: GPU-job-
spec:
template:
spec:
containers:
- name: gpu-job
image: gcr.io/fliermapper/agisoft-image:latest
imagePullPolicy: Always
command: ["/bin/sh"]
args: ["-c" REDACTED]
resources:
requests:
memory: "16000Mi"
cpu: "8000m"
limits:
memory: "32000Mi"
cpu: "16000m"
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-nodepool: gpujobs
tolerations:
- key: nvidia.com/gpu
value: present
operator: Equal
backoffLimit: 4
It works fine for the gpujobs pool but for the CPU I get the following error
Warning FailedScheduling 4m40s (x1619 over 29h) default-scheduler 0/1 nodes are available: 1 node(s) didn'tmatch Pod's node affinity/selector.
Normal NotTriggerScaleUp 3m53s (x8788 over 29h) cluster-autoscaler pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 in backoff after failed scale-up
I have the nodeSelector defined for the CPU pool, so why does it not recognise the correct node pool and scale up, it says the Pod's node affinity/selector didn't match? I have created the node pools and they are available. Do I need to define tolerations or taints to make this work?