We have a kubernetes cluster (version 1.18.x) running on ubuntu 18.04, and we mainly use this cluster to run AI jobs.
We want the cluster to schedule jobs based on bin packing policy (nvidia gpu resources have the highest weight), and I have done this as told by this article. But after I do all the staff, the pod can't be scheduled anymore, it always stucks in Pending!!
Our command to run the scheduler is as below:
/opt/kube/bin/kube-scheduler --address=127.0.0.1 --kubeconfig=/etc/kubernetes/kube-scheduler.kubeconfig --leader-elect=true --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --tls-min-version=VersionTLS12 --v=2 --config=/path/to/my_policy_file.yaml
Command to restart kube-scheduler:
systemctl daemon-reload
systemctl stop kube-scheduler
systemctl start kube-scheduler
My Policy file:
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
clientConnection:
kubeconfig: /etc/kubernetes/kube-scheduler.kubeconfig
profiles:
- schedulerName: kube-scheduler
plugins:
score:
enabled:
- name: RequestedToCapacityRatio
weight: 100
pluginConfig:
- name: RequestedToCapacityRatio
args:
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
resources:
- name: cpu
weight: 1
- name: nvidia.com/gpu
weight: 100
But after I apply this file to the default scheduler, it can't schedule pods anymore. Pod always stucks in pending. here is the yaml file I use to test:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- command: ["/bin/bash", "-c", "sleep", "3600"]
image: ubuntu:18.04
name: test
So how to correctlly turn on binpack feature? Why the job can't be scheduled?
I solved this issue. I typed a wrong scheduler name. The name of default scheduler of kubernetes is default-scheduler, other than kube-scheduler.