Running Deep Learning model using Kubernetes

2/27/2019

I am in stage of deploying a deep learning model using Kubernetes. My questions are as follows:

1: Does kubernetes supports parallel processing?. After preprocessing the data, need to run deep learning model with different set of hyper parameters. Is it possible to run it parallely on different pods and whats the python code for it.?

  1. If a particular pod fails or breaks while running, will it enable another pod (copy of the original pod) to run automatically?

  2. Also, if a particular pod reaches certain percentage of GPU (threshold), will it make another pod to run automatically?.

I need your help on this. I am finding much tutorial on this. Also, looking for python code to perform all these actions.

Thanks

-- Sarvan
deployment
kubernetes

1 Answer

2/28/2019

This seems like no prior research has been put into this question, and you are not a new member - so for the future please try to ask concrete questions on what problems did you meet as it shows that you have put an effort before asking question. I will try to answer from Kubernetes perspective as I had no opportunity to use deep learning on Kubernetes, yet.

  1. Kubernetes does support parallel processing. The cluster is a set of "independent" nodes each with its own memory and CPU but they are connected through a network and can be used all together to solve a common task. You can have multiple pods/jobs running what you need. More about it in context of ML here and an example of Deep Learning on Kubernetes here.

  2. Pod as a basic building block in Kubernetes is also a representation of a running process on your cluster. They are what we could call cattle. We consider them ephemeral entities that can be replaced or discarded at will. Common practice is to not create Pod itself but for example Deployments which will make sure you always have a specified number of Pods running (when one dies, another is created on it's place to keep the specified number). You can find more about controllers here:

They can be responsible for replication and roll-out and providing self-healing capabilities at cluster scope.

  1. Not sure about GPU as I did not use it too much, but Kubernetes definitely can scale based on CPU and Memory, you can also set Resources and Limits to control this. I believe Cluster Autoscaler can scale based on GPU as there is a specific metric for GPU:

Minimum and maximum number of different GPUs in cluster, in the format ::. Cluster autoscaler will not scale the cluster beyond these numbers. Can be passed multiple times. CURRENTLY THIS FLAG ONLY WORKS ON GKE.

-- aurelius
Source: StackOverflow