I am trying have kubernetes create new pods on the most requested nodes instead of pods spreading the load across available nodes. The rationale is that this simplifies scale down scenarios and relaunching of application if pods gets moved and a node gets killed during autoscaling.
The preferred strategy for descaling is - 1) Never kill a node if there is any running pod 2) New pods are created preferentially on the most requested Nodes 3) The pods will self destruct after job completion. This should, over time, result in free nodes after the tasks are completed and thus descaling will be safe and I don't need to worry about resilience of the running jobs.
For this, is there any way I can specify the NodeAffinity in the pod spec, something like:
spec:
affinity:
nodeAffinity:
RequiredDuringSchedulingIgnoredDuringExecution:
- weight: 100
nodeAffinityTerm: {MostRequestedPriority}
The above code has no effect. The documentation for NodeAffinity
doesn't specify if I can use MostRequestedPriority
in this context. MostRequestedPriority
is an option in the kubernetes scheduler service spec. But I am trying to see if I can directly put t in the pod spec, instead of creating a new custom kubernetes scheduler.
Unfortunately there is no option to pass MostRequestedPriority
to nodeAffinity field. However you can create simple scheduler to manage pod scheduling. Following configuration will be just enough.
First, you have to create Service Account
and ClusterRoleBinding
for this scheduler:
apiVersion: v1
kind: ServiceAccount
metadata:
name: own-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: own-scheduler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: own-scheduler
namespace: kube-system
Then create config map with desired policy
field including MostRequestedPriority
. Each field in predicates can be modified to suit your needs best and basically what it does is it filters the nodes to find where a pod can be placed, for example, the PodFitsResources
filter checks whether a Node has enough available resource to meet a Pod’s specific resource requests:
apiVersion: v1
kind: ConfigMap
metadata:
labels:
k8s-addon: scheduler.addons.k8s.io
name: own-scheduler
namespace: kube-system
data:
policy.cfg: |-
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsHostPorts"},
{"name" : "PodFitsResources"},
{"name" : "NoDiskConflict"},
{"name" : "PodMatchNodeSelector"},
{"name" : "PodFitsHost"}
],
"priorities" : [
{"name" : "MostRequestedPriority", "weight" : 1},
{"name" : "EqualPriorityMap", "weight" : 1}
]
}
Then wrap it up in Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: scheduler
tier: control-plane
name: own-scheduler
namespace: kube-system
spec:
selector:
matchLabels:
component: scheduler
tier: control-plane
replicas: 1
template:
metadata:
labels:
component: scheduler
tier: control-plane
version: second
spec:
serviceAccountName: own-scheduler
containers:
- command:
- /usr/local/bin/kube-scheduler
- --address=0.0.0.0
- --leader-elect=false
- --scheduler-name=own-scheduler
- --policy-configmap=own-scheduler
image: k8s.gcr.io/kube-scheduler:v1.15.4
livenessProbe:
httpGet:
path: /healthz
port: 10251
initialDelaySeconds: 15
name: kube-second-scheduler
readinessProbe:
httpGet:
path: /healthz
port: 10251
resources:
requests:
cpu: '0.1'
securityContext:
privileged: false
volumeMounts: []
hostNetwork: false
hostPID: false
volumes: []