GKE Cluster with 2 node pools doesn't scale up for one of the node pools

1/15/2022

I have two node pools, one with GPU and one with CPU only, and run 2 types of jobs, both of which should spawn a node from their relevant node pool:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: cpu-job-
spec:
  template:
    spec:
      containers:
        - name: cpujob
          image: gcr.io/asd
          imagePullPolicy: Always
          command: ["/bin/sh"]
          args: ["-c", REDACTED]
          resources:
            requests:
              memory: "16000Mi"
              cpu: "8000m"
            limits:
              memory: "32000Mi"
              cpu: "16000m"
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-nodepool: cpujobs
  backoffLimit: 4
apiVersion: batch/v1
kind: Job
metadata:
  generateName: GPU-job-
spec:
  template:
    spec:
      containers:
        - name: gpu-job
          image: gcr.io/fliermapper/agisoft-image:latest
          imagePullPolicy: Always
          command: ["/bin/sh"]
          args: ["-c" REDACTED]
          resources:
            requests:
              memory: "16000Mi"
              cpu: "8000m"
            limits:
              memory: "32000Mi"
              cpu: "16000m"
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-nodepool: gpujobs
      tolerations:
        - key: nvidia.com/gpu
          value: present
          operator: Equal
  backoffLimit: 4

It works fine for the gpujobs pool but for the CPU I get the following error

Warning  FailedScheduling   4m40s (x1619 over 29h)  default-scheduler   0/1 nodes are available: 1 node(s) didn'tmatch Pod's node affinity/selector.
  Normal   NotTriggerScaleUp  3m53s (x8788 over 29h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 in backoff after failed scale-up

I have the nodeSelector defined for the CPU pool, so why does it not recognise the correct node pool and scale up, it says the Pod's node affinity/selector didn't match? I have created the node pools and they are available. Do I need to define tolerations or taints to make this work?

-- Walter Morawa
google-kubernetes-engine
kubernetes

0 Answers