I am trying to figure out how to perform a distributed training for a tensorflow (2.x) script. Googling around, I just found rather old repositories based on tensorflow 1.x, and from the official documentation (https://www.tensorflow.org/guide/distributed_training) it seems everything mostly focused on having multiple GPU cards on the same machine, the versions with multiple workers (e.g. MultiWorkerMirroredStrategy, ParameterServerStrategy) being still experimental .
Does anyone has a better suggestion about? Is there any provider specific solution for it?
My ideal would be to create an image to run as multiple autoscaled pods on a k8s cluster, i.e. something similar as one could find in https://github.com/learnk8s/distributed-tensorflow-on-k8s, but more up to date . I would like to avoid to start digging into it, just to find later on that there was a better way.