i am rather new to both kubernetes and tensorflow, trying to run basic kubeflow distributed-tensorflow example from this link (https://github.com/learnk8s/distributed-tensorflow-on-k8s). I am currently running local bare-metal kubernetes cluster with 2-nodes (1-master & 1-worker). Everything works fine when i run it in minikube (following the documentation), both training and serving run successfully. But running the job on local cluster is giving me this error!
Any help would be appreciated.
For this setup, i created a pod for nfs-storage that would be used by the jobs. Because local cluster doesn't have dynamic provisioning enabled, i created persistent volume manually (the files used are attached).
Nfs pod-storage file:
kind: Service
apiVersion: v1
metadata:
name: nfs-service
spec:
selector:
role: nfs-service
ports:
# Open the ports required by the NFS server
- name: nfs
port: 2049
- name: mountd
port: 20048
- name: rpcbind
port: 111
---
kind: Pod
apiVersion: v1
metadata:
name: nfs-server-pod
labels:
role: nfs-service
spec:
containers:
- name: nfs-server-container
image: cpuguy83/nfs-server
securityContext:
privileged: true
args:
# Pass the paths to share to the Docker image
- /exports
Persistent Volume & PVC file:
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs
spec:
storageClassName: "standard"
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: 10.96.72.11
path: "/"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs
spec:
accessModes:
- ReadWriteMany
storageClassName: "standard"
resources:
requests:
storage: 10Gi
TFJob File:
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
name: tfjob1
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
When i run the job, it give me this error
error: unable to recognize "kube/tfjob.yaml": no matches for kind "TFJob" in version "kubeflow.org/v1alpha1"
After searching a little, someone pointed "v1alpha1" could be out-dated so you should use "v1beta1" (strangely this "v1alpha1" was working with my minikube setup so i am very confused!). But with that although the tfjob gets created, i do not see any new containers starting as opposed to the minikube run, where new pods start and finish successfully. When i describe the Tfjob, i see this error
Type Reason Age From Message
---- ------ ---- ---- -------
Warning InvalidTFJobSpec 22s tf-operator Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob"
Since the only difference is the nfs-storage, i think there might be something wrong with my manual setup. Please let me know if i messed up somewhere because i do not have enough background!
I found the issue that was causing specific error. First, the api-version changed so i had to move from v1alpha1
to v1beta2
. Second, the tutorial i followed was using kubeflow v0.1.2 (rather old) and the syntax for defining tfjob in the yaml file has changed ever since (not exactly sure in which version the change happened!). So by looking at the latest example in the git i was able to update the job spec. Here are the files for someone interested!
Tutorial version:
apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: tfjob1
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
updated version:
apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
name: tfjob1
spec:
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
args:
- --model_dir
- ./out/vars
- --export_dir
- ./out/models
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
PS:
replicas: 1
template:
spec:
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: nfs
containers:
- name: tensorflow
image: learnk8s/mnist:1.0.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /app/out
name: nfs-volume
restartPolicy: OnFailure