I created a small cluster with GPU nodes on GKE like so:
# create cluster and CPU nodes
gcloud container clusters create clic-cluster \
--zone us-west1-b \
--machine-type n1-standard-1 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 3 \
--num-nodes 2
# add GPU nodes
gcloud container node-pools create gpu-pool \
--zone us-west1-b \
--machine-type n1-standard-2 \
--accelerator type=nvidia-tesla-k80,count=1 \
--cluster clic-cluster \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 2 \
--num-nodes 1
When I submit a GPU job it successfully ends up running on the GPU node. However, when I submit a second job I get an UnexpectedAdmissionError
from kubernetes:
Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected.
I would have expected the cluster to start the second GPU node and place the job there. Any idea why this didn't happen? My job spec looks roughly like this:
apiVersion: batch/v1
kind: Job
metadata:
name: <job_name>
spec:
template:
spec:
initContainers:
- name: decode
image: "<decoder_image>"
resources:
limits:
nvidia.com/gpu: 1
command: [...]
[...]
containers:
- name: evaluate
image: "<evaluation_image>"
command: [...]
The resource constraint needs to be added to the containers
spec as well:
piVersion: batch/v1
kind: Job
metadata:
name: <job_name>
spec:
template:
spec:
initContainers:
- name: decode
image: "<decoder_image>"
resources:
limits:
nvidia.com/gpu: 1
command: [...]
[...]
containers:
- name: evaluate
image: "<evaluation_image>"
resources:
limits:
nvidia.com/gpu: 1
command: [...]
I only required a GPU in one of the initContainers
, but this seems to confuse the scheduler. Now autoscaling and scheduling works as expected.