Integrate Tensorboard in KUBEFLOW pipeline using viewers

5/16/2019

I'm using KUBEFLOW pipelines for training KERAS models with TF and I'm starting from a very simple one.

Model is training fine and the pipeline works properly, but I'm not able to use the output viewer for TENSORBOARD properly. Reading from the documentation it seems that by just adding a proper json file in the root path of the training container (/mlpipeline-ui-metadata.json) should be enough but even when I do so, nothing appears in the artifact section of my experiment run (while KERAS logs can be seen correctly).

Here's how I configured it:

mlpipeline-ui-metadata.json (added from the DOCKERFILE directly)

{
    "version": 1,
    "outputs": [
    {
        "type": "tensorboard",
        "source": "/tf-logs"  #Just a placeholder at the moment
    }
    ]
}

pipeline

import kfp
from kfp import dsl

from kubernetes.client.models import V1EnvVar

def train_op(epochs,batch_size,dropout,first_layer_size,second_layer_size):
    dsl.ContainerOp(
        image='MY-IMAGE',
        name='my-train',
        container_kwargs={"image_pull_policy": "Always", 'env': [
            V1EnvVar('TRAIN_EPOCHS', epochs),
            V1EnvVar('TRAIN_BATCH_SIZE', batch_size),
            V1EnvVar('TRAIN_DROPOUT', dropout),
            V1EnvVar('TRAIN_FIRST_LAYER_SIZE', first_layer_size),
            V1EnvVar('TRAIN_SECOND_LAYER_SIZE', second_layer_size),
            ]},
        command=['sh', '-c', '/src/init_script.sh'],
    ).set_memory_request('2G').set_cpu_request('2')

@dsl.pipeline(
    name='My model pipeline',
    description='Pipeline for model training'
)
def my_model_pipeline(epochs,batch_size,dropout,first_layer_size,second_layer_size):

    train_task = train_op(epochs,batch_size,dropout,first_layer_size,second_layer_size)

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(my_model_pipeline, 'my_model.zip')

I've already tried to access to the running POD (kubectl exec ..) and I verified that the file is actually in the right spot.

By the way I'm using KUBEFLOW v0.5

-- luke035
kubeflow
kubernetes
tensorboard

1 Answer

10/6/2019

TL;DR: The source section should point to a location on a shared storage, not the pod's local file system path

The source section in mlpipeline-ui-metadata.json should point to a location where the pipelines-ui pod can later reference it, i.e. it should be on a shared storage, s3 (if on AWS), mounted Kubernetes volume (if on-prem).

The way Kubeflow works is, at the end of the run it just zips mlpipeline-ui-metadata.json and stores it in a minio storage. When you click on the Artifacts section, the UI looks for this source section in the zipped json and tries to read the tf events files. If the tf events file are not moved to shared storage from the pod they won't be read since it's on the ephemeral pods file system alone.

-- santiago92
Source: StackOverflow