I'm using Kubernetes 1.7 and Python Client 2.0. I have a hello world machine learning program MNIST in TensorFlow running under K8 cluster. It has one Worker and one Parameter Server. It is deployed as a kind: Job
(in the manifest). The custom scheduler, written in python, watches for pending pods using list_namespaced_pod
and schedules them based on the availability of resources. Since it's a stream of events coming in, how can I make sure that all pending pods under one job get scheduled or not? In other words, I don't want to schedule a job partially, either to schedule all pods of the pending jobs or none.
Also, is there a way in Kubernetes to catch/find/watch all events of the same job (i.e. deployed under one manifest file) at a time? I also tried list_namespaced_event
but it also reports events one after another. As a result, it is likely to happen that one pod of the job can be scheduled and the latter one can't. A small version of the custom scheduler is available here.
my-mnist.yml file (a smaller version)
---
apiVersion: batch/v1
kind: Job
metadata:
name: my-ps
labels:
name: my-ps
jobName: my-ps-mnist_dist
namespace: my-namespace
spec:
template:
metadata:
labels:
name: my-ps
jobName: my-ps-mnist_dist
jobId: 5b2a6cd25b02821468e41571
manifestFile: my-mnist.yml
jobTrainingType: distributed
jobTaskName: "my-ps"
jobTaskIndex: "0"
jobWorkerInstances: "1"
namespace: my-namespace
spec:
nodeSelector:
gpu: "no"
dlts: "yes"
containers:
- name: my-ps
image: "123.456.789.10:1234/myimg/5b2a6cd25b02821468e41571"
imagePullPolicy: Always
tty: true
stdin: true
env:
- name: JOB_TASK_NAME
value: "ps"
- name: JOB_ID
value: "5b2a6cd25b02821468e41571"
- name: JOB_LD_LIBRARY_PATH
value: "/usr/local/cuda-9.0/lib64:/usr/lib64/nvidia:/usr/local/cuda-9.0/targets/x86_64-linux/lib"
- name: JOB_PYTHON_VERSION
value: "3"
---
apiVersion: batch/v1
kind: Job
metadata:
name: my-wkr
labels:
name: my-wkr
jobName: wkr0-mnist_dist
namespace: my-namespace
spec:
template:
metadata:
labels:
name: my-wkr
jobName: wkr0-mnist_dist
jobId: 5b2a6cd25b02821468e41571
manifestFile: my-mnist.yml
jobTrainingType: distributed
jobTaskName: "worker"
jobTaskIndex: "0"
jobWorkerInstances: "1"
namespace: my-namespace
spec:
nodeSelector:
gpu: "yes"
dlts: "yes"
containers:
- name: my-wkr
image: "123.456.789.10:1234/myimg/5b2a6cd25b02821468e41571"
imagePullPolicy: Always
tty: true
stdin: true
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2
env:
- name: JOB_TASK_NAME
value: "worker"
- name: JOB_TASK_INDEX
value: "0"
- name: JOB_ID
value: "5b2a6cd25b02821468e41571"
- name: JOB_LD_LIBRARY_PATH
value: "/usr/local/cuda-9.0/lib64:/usr/lib64/nvidia:/usr/local/cuda-9.0/targets/x86_64-linux/lib"
- name: JOB_PYTHON_VERSION
value: "3"
Also, is there a way in Kubernetes to catch/find/watch all events of the same job (i.e. deployed under one manifest file) at a time?
The short answer is no. All pod events, in any case, go one after another.
There is one opportunity that comes into my mind:
Because the pods that require custom scheduler can't be scheduled by any other scheduler, your custom scheduler can collect a list of pods related to the same job and schedule them one after another, then go to the list related to the next job. This way you can ensure that resources intended to be used by pods of the first job will not be allocated for one of the pods related to another job before all pods related to the first job is scheduled to nodes.
There are annotations and labels in the event the scheduler receives. I didn't check results of the list_namespaces_pod
or list_namespaces_event
, but I think annotations and labels should be there also. It is possible to set the configuration of the job in the annotations, like number of pods in the job or labels for each pod in the job (e.g.: labels:{job_ID:100,role:master,uid:xxx}, annotations:{[job_ID:none, master:none, worker1:none worker2:none]}
). When scheduler sees the first pod with the annotations for a job that he doesn't have yet, he creates a new list of pods for the job ([job_ID:100, master:xxx, worker1:none worker2:none]
). When next events appear, scheduler fills this list using pod labels and schedules only lists that are filled up completely ([job_ID:100, master:uid1, worker1:uid2: worker2:uid3]
).