Kubernetes and GPU node cluster implementation practices

6/22/2018

I am trying to understand the K8s gpu practices better, and implementing a small K8s GPU cluster which is suppose to work like below.

This going to be little long explanation, but I hope it will help to have many questions at once place to understand GPU practices better in Kubernetes.

Application Requirement

I want to create a K8s autoscale cluster.
Pods are running the models say a tensorflow based deep learning program.
Pods are waiting for a message in pub sub queue to appear and it can proceed its execution once it recieves a message.
Now a message is queued in a PUB/SUB queue.
As message is available, pods reads it and execute deep learning program.

Cluster requirement

If no message is present in queue and none of the GPU based pods are executing program( i mean not using gpu), then gpu node pool should scale down to 0.

Design 1

Create a gpu node pool. Each node contains N GPU, where N >= 1. Assign model trainer pod to each gpu. That is 1:1 mapping of pods and GPU. When I tried assigning 2 pods to 2 GPU machine where each pod is suppose to run a mnist program.

What I noticed is

1 pod got allocated and executes the program and later it went into crash loop. May be I am doing some mistake as my docker image is suppose to run program once only as I was just doing feasibility test of running 2 pods simultaneously on 2 gpu of same node.Below is the error

Message Reason First Seen Last Seen Count Back-off restarting failed container BackOff Jun 21, 2018, 3:18:15 PM Jun 21, 2018, 4:16:42 PM 143 pulling image "nkumar15/mnist" Pulling Jun 21, 2018, 3:11:33 PM Jun 21, 2018, 3:24:52 PM 5 Successfully pulled image "nkumar15/mnist" Pulled Jun 21, 2018, 3:12:46 PM Jun 21, 2018, 3:24:52 PM 5 Created container Created Jun 21, 2018, 3:12:46 PM Jun 21, 2018, 3:24:52 PM 5 Started container Started Jun 21, 2018, 3:12:46 PM Jun 21, 2018, 3:24:52 PM 5

The other pod didn't get assigned at all to GPU. Below is the message from pod events

0/3 nodes are available: 3 Insufficient nvidia.com/gpu.

Design 2

Have multiple GPU machines in gpu node pool with each node having only 1 GPU.
K8s, will assign each pod to each available GPU in node and hopefully there won't be any issue. I am yet to try this.

Questions

Is there any suggested practice to design above type of system in kubernetes as of version 1.10?
Is Design 1 approach not feasible as of 1.10 release? For eg, I have 2 GPU node with 24 GB GPU memory, is it possible such that K8s assign 1 pod to each GPU and each pods execute its own workload with 12GB memory limit each?
How do I scale down gpu node pool to 0 size through autoscaler?
In Design 2, say what if I run out of GPU memory? as curently in GCP 1 GPU node doesn't have more than 16 GB memory.

Again apologies for such a long question, but I hope it will help other also.

Updates

For question 2 I created a new cluster to reproduce same issue which I faced multiple times before, I am not sure what changed this time but 2nd pod is successfully allocated a GPU. I think with this result I can confirm that 1gpu-1pod mapping is allowed in a multi gpu single node However restricting memory per gpu process is not feasible as of 1.10.

-- Neeraj

google-kubernetes-engine

kubernetes

tensorflow

1 Answer

6/24/2018

Both designs are supported in 1.10. I view design 2 as a special case of 1. You don't necessarily need to have 1 GPU per node. In case your pod needs more GPUs and memory, you have to have multiple GPUs per node, as you mentioned in question (4). I'd go with 1 unless there's a reason not to.

I think the best practice would be create a new cluster with no GPUs (a cluster has a default node pool), and then create a GPU node pool and attach it to the cluster. Your non-GPU workload can run in the default pool, and the GPU workload can run in the GPU pool. To support scaling-down to 0 GPUs, you need to set --num-nodes and --min-nodes to be 0 when creating the GPU node pool.

Docs:

Create a cluster with no GPUs: https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-cluster#creating_a_cluster

Create a GPU node pool for an existing cluster: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus#gpu_pool

-- user571470

Source: StackOverflow

K
Q

Kubernetes and GPU node cluster implementation practices

Application Requirement

Cluster requirement

Design 1

Design 2

Questions

Updates

Similar Questions

1 Answer

KQ

Kubernetes and GPU node cluster implementation practices

Application Requirement

Cluster requirement

Design 1

Design 2

Questions

Updates

Similar Questions

1 Answer

K
Q