Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob

7/2/2019

i am rather new to both kubernetes and tensorflow, trying to run basic kubeflow distributed-tensorflow example from this link (https://github.com/learnk8s/distributed-tensorflow-on-k8s). I am currently running local bare-metal kubernetes cluster with 2-nodes (1-master & 1-worker). Everything works fine when i run it in minikube (following the documentation), both training and serving run successfully. But running the job on local cluster is giving me this error!

Any help would be appreciated.

For this setup, i created a pod for nfs-storage that would be used by the jobs. Because local cluster doesn't have dynamic provisioning enabled, i created persistent volume manually (the files used are attached).

Nfs pod-storage file:

kind: Service
apiVersion: v1
metadata:
  name: nfs-service
spec:
  selector:
    role: nfs-service
  ports:
    # Open the ports required by the NFS server
    - name: nfs
      port: 2049
    - name: mountd
      port: 20048
    - name: rpcbind
      port: 111
---

kind: Pod
apiVersion: v1
metadata:
  name: nfs-server-pod
  labels:
    role: nfs-service
spec:
  containers:
    - name: nfs-server-container
      image: cpuguy83/nfs-server
      securityContext:
        privileged: true
      args:
        # Pass the paths to share to the Docker image
        - /exports

Persistent Volume & PVC file:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs
spec:
  storageClassName: "standard"
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: 10.96.72.11
    path: "/"

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: "standard"
  resources:
    requests:
      storage: 10Gi

TFJob File:

apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
  name: tfjob1
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              args:
                - --model_dir
                - ./out/vars
                - --export_dir
                - ./out/models
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
              args:
                - --model_dir
                - ./out/vars
                - --export_dir
                - ./out/models
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure

When i run the job, it give me this error

error: unable to recognize "kube/tfjob.yaml": no matches for kind "TFJob" in version "kubeflow.org/v1alpha1"

After searching a little, someone pointed "v1alpha1" could be out-dated so you should use "v1beta1" (strangely this "v1alpha1" was working with my minikube setup so i am very confused!). But with that although the tfjob gets created, i do not see any new containers starting as opposed to the minikube run, where new pods start and finish successfully. When i describe the Tfjob, i see this error

 Type     Reason            Age   From         Message
  ----     ------            ----  ----         -------
  Warning  InvalidTFJobSpec  22s   tf-operator  Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob"

Since the only difference is the nfs-storage, i think there might be something wrong with my manual setup. Please let me know if i messed up somewhere because i do not have enough background!

-- Ali Tariq
kubeflow
kubernetes
python
tensorflow

1 Answer

7/17/2019

I found the issue that was causing specific error. First, the api-version changed so i had to move from v1alpha1 to v1beta2. Second, the tutorial i followed was using kubeflow v0.1.2 (rather old) and the syntax for defining tfjob in the yaml file has changed ever since (not exactly sure in which version the change happened!). So by looking at the latest example in the git i was able to update the job spec. Here are the files for someone interested!

Tutorial version:

apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
  name: tfjob1
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              args:
                - --model_dir
                - ./out/vars
                - --export_dir
                - ./out/models
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure

updated version:

apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
  name: tfjob1
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              args:
                - --model_dir
                - ./out/vars
                - --export_dir
                - ./out/models
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
    PS:
      replicas: 1
      template:
        spec:
          volumes:
            - name: nfs-volume
              persistentVolumeClaim:
                claimName: nfs
          containers:
            - name: tensorflow
              image: learnk8s/mnist:1.0.0
              imagePullPolicy: IfNotPresent
              volumeMounts:
                - mountPath: /app/out
                  name: nfs-volume
          restartPolicy: OnFailure
-- Ali Tariq
Source: StackOverflow