How to use large volumes of data in Kubeflow?

4/12/2019

I have 1TB of images stored in GCS (data is splitted into 3 classes). I want to train custom Tensor Flow model on this data in Kubeflow. Currently, I have pipeline components for training and persisting the model but I don't know how to correctly feed this data into the classifier.

It seems to me like downloading this data from GCS (gsutil cp / something other) every time I run (possibly with fail) the pipeline is not a proper way to do this.

How to use large volumes of data in Kubeflow pipelines without downloading them every time? How to express access to this data using Kubeflow DSL?

-- Marcin Zablocki
google-cloud-platform
kubeflow
kubernetes

2 Answers

4/12/2019

Can you mount the volume on host machine?

If yes, mount the volume on host and then mount this directory to containers as hostPath so images are already mounted to node and whenever new container is up it can mount volume to container and start the process avoiding data transfer on each container startup.

-- Akash Sharma
Source: StackOverflow

4/12/2019

Additionally, if your data is in GCS, then TensorFlow supports the ability to access data in (and write to) GCS. The tf.data api lets you set up a performant data input pipeline.

-- Amy U.
Source: StackOverflow