Kubernetes Python Client - Find pending jobs and schedule all pods at a time or how to schedule a pending job

6/27/2018

I'm using Kubernetes 1.7 and Python Client 2.0. I have a hello world machine learning program MNIST in TensorFlow running under K8 cluster. It has one Worker and one Parameter Server. It is deployed as a kind: Job (in the manifest). The custom scheduler, written in python, watches for pending pods using list_namespaced_pod and schedules them based on the availability of resources. Since it's a stream of events coming in, how can I make sure that all pending pods under one job get scheduled or not? In other words, I don't want to schedule a job partially, either to schedule all pods of the pending jobs or none.

Also, is there a way in Kubernetes to catch/find/watch all events of the same job (i.e. deployed under one manifest file) at a time? I also tried list_namespaced_event but it also reports events one after another. As a result, it is likely to happen that one pod of the job can be scheduled and the latter one can't. A small version of the custom scheduler is available here.

my-mnist.yml file (a smaller version)

---

apiVersion: batch/v1
kind: Job
metadata:
  name: my-ps 
  labels:
    name: my-ps 
    jobName: my-ps-mnist_dist
  namespace: my-namespace
spec:
  template:
    metadata:
      labels:
        name: my-ps 
        jobName: my-ps-mnist_dist
        jobId: 5b2a6cd25b02821468e41571
        manifestFile: my-mnist.yml
        jobTrainingType: distributed
        jobTaskName: "my-ps"
        jobTaskIndex: "0"
        jobWorkerInstances: "1"
      namespace: my-namespace
    spec:
      nodeSelector:
        gpu: "no"
        dlts: "yes"
      containers:
        - name: my-ps
          image: "123.456.789.10:1234/myimg/5b2a6cd25b02821468e41571"
          imagePullPolicy: Always
          tty: true
          stdin: true
          env:
            - name: JOB_TASK_NAME
              value: "ps"
            - name: JOB_ID
              value: "5b2a6cd25b02821468e41571"
            - name: JOB_LD_LIBRARY_PATH
              value: "/usr/local/cuda-9.0/lib64:/usr/lib64/nvidia:/usr/local/cuda-9.0/targets/x86_64-linux/lib"
            - name: JOB_PYTHON_VERSION
              value: "3"


---

apiVersion: batch/v1
kind: Job 
metadata:
  name: my-wkr
  labels:
    name: my-wkr
    jobName: wkr0-mnist_dist
  namespace: my-namespace
spec:
  template:
    metadata:
      labels:
        name: my-wkr
        jobName: wkr0-mnist_dist
        jobId: 5b2a6cd25b02821468e41571
        manifestFile: my-mnist.yml
        jobTrainingType: distributed
        jobTaskName: "worker"
        jobTaskIndex: "0"
        jobWorkerInstances: "1" 
      namespace: my-namespace
    spec:
      nodeSelector:
        gpu: "yes"
        dlts: "yes"
      containers:
        - name: my-wkr
          image: "123.456.789.10:1234/myimg/5b2a6cd25b02821468e41571" 
          imagePullPolicy: Always
          tty: true
          stdin: true
          resources:
            limits:
              alpha.kubernetes.io/nvidia-gpu: 2
          env:
            - name: JOB_TASK_NAME
              value: "worker"
            - name: JOB_TASK_INDEX
              value: "0"
            - name: JOB_ID
              value: "5b2a6cd25b02821468e41571"
            - name: JOB_LD_LIBRARY_PATH
              value: "/usr/local/cuda-9.0/lib64:/usr/lib64/nvidia:/usr/local/cuda-9.0/targets/x86_64-linux/lib"
            - name: JOB_PYTHON_VERSION
              value: "3"
-- Abu Shoeb
kubernetes
python
yaml

1 Answer

6/28/2018

Also, is there a way in Kubernetes to catch/find/watch all events of the same job (i.e. deployed under one manifest file) at a time?

The short answer is no. All pod events, in any case, go one after another.

There is one opportunity that comes into my mind:
Because the pods that require custom scheduler can't be scheduled by any other scheduler, your custom scheduler can collect a list of pods related to the same job and schedule them one after another, then go to the list related to the next job. This way you can ensure that resources intended to be used by pods of the first job will not be allocated for one of the pods related to another job before all pods related to the first job is scheduled to nodes.

There are annotations and labels in the event the scheduler receives. I didn't check results of the list_namespaces_pod or list_namespaces_event, but I think annotations and labels should be there also. It is possible to set the configuration of the job in the annotations, like number of pods in the job or labels for each pod in the job (e.g.: labels:{job_ID:100,role:master,uid:xxx}, annotations:{[job_ID:none, master:none, worker1:none worker2:none]}). When scheduler sees the first pod with the annotations for a job that he doesn't have yet, he creates a new list of pods for the job ([job_ID:100, master:xxx, worker1:none worker2:none]). When next events appear, scheduler fills this list using pod labels and schedules only lists that are filled up completely ([job_ID:100, master:uid1, worker1:uid2: worker2:uid3]).

-- VAS
Source: StackOverflow