Distributed training with Tensorflow 2.x on a K8S cluster

4/18/2020

I am trying to figure out how to perform a distributed training for a tensorflow (2.x) script. Googling around, I just found rather old repositories based on tensorflow 1.x, and from the official documentation (https://www.tensorflow.org/guide/distributed_training) it seems everything mostly focused on having multiple GPU cards on the same machine, the versions with multiple workers (e.g. MultiWorkerMirroredStrategy, ParameterServerStrategy) being still experimental .

Does anyone has a better suggestion about? Is there any provider specific solution for it?

My ideal would be to create an image to run as multiple autoscaled pods on a k8s cluster, i.e. something similar as one could find in https://github.com/learnk8s/distributed-tensorflow-on-k8s, but more up to date . I would like to avoid to start digging into it, just to find later on that there was a better way.

-- Neo
distributed-computing
docker
kubernetes
tensorflow
tensorflow2.0

0 Answers