K8s pod priority & outOfPods

1/8/2020

we had the situation that the k8s-cluster was running out of pods after an update (kubernetes or more specific: ICP) resulting in "OutOfPods" error messages. The reason was a lower "podsPerCore"-setting which we corrected afterwards. Until then there were pods with a provided priorityClass (1000000) which cannot be scheduled. Others - without a priorityClass (0) - were scheduled. I assumed a different behaviour. I thought that the K8s scheduler would kill pods with no priority so that a pod with priority can be scheduled. Was I wrong?

Thats just a question for understanding because I want to guarantee that the priority pods are running, no matter what.

Thanks


Pod with Prio:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: ibm-anyuid-hostpath-psp
  creationTimestamp: "2019-12-16T13:39:21Z"
  generateName: dms-config-server-555dfc56-
  labels:
    app: config-server
    pod-template-hash: 555dfc56
    release: dms-config-server
  name: dms-config-server-555dfc56-2ssxb
  namespace: dms
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: dms-config-server-555dfc56
    uid: c29c40e1-1da7-11ea-b646-005056a72568
  resourceVersion: "65065735"
  selfLink: /api/v1/namespaces/dms/pods/dms-config-server-555dfc56-2ssxb
  uid: 7758e138-2009-11ea-9ff4-005056a72568
spec:
  containers:
  - env:
    - name: CONFIG_SERVER_GIT_USERNAME
      valueFrom:
        secretKeyRef:
          key: username
          name: dms-config-server-git
    - name: CONFIG_SERVER_GIT_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: dms-config-server-git
    envFrom:
    - configMapRef:
        name: dms-config-server-app-env
    - configMapRef:
        name: dms-config-server-git
    image: docker.repository..../infra/config-server:2.0.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /actuator/health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 90
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: config-server
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /actuator/health
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 20
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: 250m
        memory: 600Mi
      requests:
        cpu: 10m
        memory: 300Mi
    securityContext:
      capabilities:
        drop:
        - MKNOD
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-v7tpv
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: kub-test-worker-02
  priority: 1000000
  priorityClassName: infrastructure
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-v7tpv
    secret:
      defaultMode: 420
      secretName: default-token-v7tpv

Pod without Prio (just an example within the same namespace):

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: ibm-anyuid-hostpath-psp
  creationTimestamp: "2019-09-10T09:09:28Z"
  generateName: produkt-service-57d448979d-
  labels:
    app: produkt-service
    pod-template-hash: 57d448979d
    release: dms-produkt-service
  name: produkt-service-57d448979d-4x5qs
  namespace: dms
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: produkt-service-57d448979d
    uid: 4096ab97-5cee-11e9-97a2-005056a72568
  resourceVersion: "65065755"
  selfLink: /api/v1/namespaces/dms/pods/produkt-service-57d448979d-4x5qs
  uid: b112c5f7-d3aa-11e9-9b1b-005056a72568
spec:
  containers:
  - image: docker-snapshot.repository..../dms/produkt-    service:0b6e0ecc88a28d2a91ffb1db61f8ca99c09a9d92
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /actuator/health
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: produkt-service
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /actuator/health
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources: {}
    securityContext:
      capabilities:
        drop:
        - MKNOD
      procMount: Default
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-v7tpv
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: kub-test-worker-02
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-v7tpv
    secret:
      defaultMode: 420
      secretName: default-token-v7tpv
-- Fdot
kubernetes

1 Answer

1/10/2020

There could be a lot of circumstances that will alter the work of the scheduler. There is a documentation talking about it: Pod priority and preemption.

Be aware of the fact that this features were deemed stable at version 1.14.0

From the IBM perspective please take in mind that the version 1.13.9 will be supported until 19 of February 2020!.

You are correct that pods with lower priority should be replaced with higher priority pods.

Let me elaborate on that with an example:

Example

Let's assume a Kubernetes cluster with 3 nodes (1 master and 2 nodes):

  • By default you cannot schedule normal pods on master node
  • The only worker node that can schedule the pods has 8GB of RAM.
  • 2nd worker node has a taint that disables scheduling.

This example will base on RAM usage but it can be used in the same manner as CPU time.

Priority Class

There are 2 priority classes:

  • zero-priority (0)
  • high-priority (1 000 000)

YAML definition of zero priority class:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: zero-priority
value: 0
globalDefault: false
description: "This is priority class for hello pod"

globalDefault: false is used for objects that do not have assigned priority class. It will assign this class by default.

YAML definition of high priority class:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This is priority class for goodbye pod"

To apply this priority classes you will need to invoke: $ kubectl apply -f FILE.yaml

Deployments

With above objects you can create deployments:

  • Hello - deployment with low priority
  • Goodbye - deployment with high priority

YAML definition of hello deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  selector:
    matchLabels:
      app: hello
      version: 1.0.0
  replicas: 10
  template:
    metadata:
      labels:
        app: hello
        version: 1.0.0
    spec:
      containers:
      - name: hello
        image: "gcr.io/google-samples/hello-app:1.0"
        env:
        - name: "PORT"
          value: "50001"
        resources:
          requests:
            memory: "128Mi"
      priorityClassName: zero-priority

Please take a specific look on this fragment:

        resources:
          requests:
            memory: "128Mi"
      priorityClassName: zero-priority

It will limit number of pods because of the requested resources as well as it will assign low priority to this deployment.

YAML definition of goodbye deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: goodbye
spec:
  selector:
    matchLabels:
      app: goodbye
      version: 2.0.0
  replicas: 3
  template:
    metadata:
      labels:
        app: goodbye
        version: 2.0.0
    spec:
      containers:
      - name: goodbye
        image: "gcr.io/google-samples/hello-app:2.0"
        env:
        - name: "PORT"
          value: "50001"
        resources:
          requests:
            memory: "6144Mi"
      priorityClassName: high-priority

Also please take a specific look on this fragment:

        resources:
          requests:
            memory: "6144Mi"
      priorityClassName: high-priority

This pods will have much higher request for RAM and high priority.

Testing and troubleshooting

There is no enough information to properly troubleshoot issues like this. Without extensive logs of many components starting from kubelet to pods,nodes and deployments itself.

Apply hello deployment and see what happens: $ kubectl apply -f hello.yaml

Get basic information about the deployment with command:

$ kubectl get deployments hello

Output should look like that after a while:

NAME    READY   UP-TO-DATE   AVAILABLE   AGE
hello   10/10   10           10          9s

As you can see all of the pods are ready and available. The requested resources were assigned to them.

To get more details for troubleshooting purposes you can invoke:

  • $ kubectl describe deployment hello
  • $ kubectl describe node NAME_OF_THE_NODE

Example information about allocated resources from the above command:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                250m (12%)    0 (0%)
  memory             1280Mi (17%)  0 (0%)
  ephemeral-storage  0 (0%)        0 (0%)

Apply goodbye deployment and see what happens: $ kubectl apply -f goodbye.yaml

Get basic information about the deployments with command: $ kubectl get deployments

NAME      READY   UP-TO-DATE   AVAILABLE   AGE
goodbye   1/3     3            1           25s
hello     9/10    10           9           11m

As you can see there is goodbye deployment but only 1 pod is available. And despite the fact that the goodbye has much higher priority, the hello pods are still working.

Why it is like that?:

$ kubectl describe node NAME_OF_THE_NODE

Non-terminated Pods:          (13 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  default                     goodbye-575968c8d6-bnrjc                    0 (0%)        0 (0%)      6Gi (83%)        0 (0%)         15m
  default                     hello-fdfb55c96-6hkwp                       0 (0%)        0 (0%)      128Mi (1%)       0 (0%)         27m
  default                     hello-fdfb55c96-djrwf                       0 (0%)        0 (0%)      128Mi (1%)       0 (0%)         27m

Take a look at requested memory for goodbye pod. It is as described above as 6Gi.

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                250m (12%)    0 (0%)
  memory             7296Mi (98%)  0 (0%)
  ephemeral-storage  0 (0%)        0 (0%)
Events:              <none>

The memory usage is near 100%.

Getting information about specific goodbye pod that is in Pending state will yield some more information $ kubectl describe pod NAME_OF_THE_POD_IN_PENDING_STATE:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  38s (x3 over 53s)  default-scheduler  0/3 nodes are available: 1 Insufficient memory, 2 node(s) had taints that the pod didn't tolerate.

Goodbye pod was not created because there were not enough resources that could be satisfied. But there still was some left resources for hello pods.

There is possibility for a situation that will kill lower priority pods and schedule higher priority pods.

Change the requested memory for goodbye pod to 2304Mi. It will allow scheduler to assign of all required pods (3):

        resources:
          requests:
            memory: "2304Mi"

You can delete the previous deployment and apply new one with memory parameter changed.

Invoke command: $ kubectl get deployments

NAME      READY   UP-TO-DATE   AVAILABLE   AGE
goodbye   3/3     3            3           5m59s
hello     3/10    10           3           48m

As you can see all of the goodbye pods are available.

Hello pods got reduced to make space for pods with higher priority (goodbye).

-- Dawid Kruk
Source: StackOverflow