we had the situation that the k8s-cluster was running out of pods after an update (kubernetes or more specific: ICP) resulting in "OutOfPods" error messages. The reason was a lower "podsPerCore"-setting which we corrected afterwards. Until then there were pods with a provided priorityClass (1000000) which cannot be scheduled. Others - without a priorityClass (0) - were scheduled. I assumed a different behaviour. I thought that the K8s scheduler would kill pods with no priority so that a pod with priority can be scheduled. Was I wrong?
Thats just a question for understanding because I want to guarantee that the priority pods are running, no matter what.
Thanks
Pod with Prio:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: ibm-anyuid-hostpath-psp
creationTimestamp: "2019-12-16T13:39:21Z"
generateName: dms-config-server-555dfc56-
labels:
app: config-server
pod-template-hash: 555dfc56
release: dms-config-server
name: dms-config-server-555dfc56-2ssxb
namespace: dms
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: dms-config-server-555dfc56
uid: c29c40e1-1da7-11ea-b646-005056a72568
resourceVersion: "65065735"
selfLink: /api/v1/namespaces/dms/pods/dms-config-server-555dfc56-2ssxb
uid: 7758e138-2009-11ea-9ff4-005056a72568
spec:
containers:
- env:
- name: CONFIG_SERVER_GIT_USERNAME
valueFrom:
secretKeyRef:
key: username
name: dms-config-server-git
- name: CONFIG_SERVER_GIT_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: dms-config-server-git
envFrom:
- configMapRef:
name: dms-config-server-app-env
- configMapRef:
name: dms-config-server-git
image: docker.repository..../infra/config-server:2.0.8
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: config-server
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 250m
memory: 600Mi
requests:
cpu: 10m
memory: 300Mi
securityContext:
capabilities:
drop:
- MKNOD
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-v7tpv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kub-test-worker-02
priority: 1000000
priorityClassName: infrastructure
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-v7tpv
secret:
defaultMode: 420
secretName: default-token-v7tpv
Pod without Prio (just an example within the same namespace):
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: ibm-anyuid-hostpath-psp
creationTimestamp: "2019-09-10T09:09:28Z"
generateName: produkt-service-57d448979d-
labels:
app: produkt-service
pod-template-hash: 57d448979d
release: dms-produkt-service
name: produkt-service-57d448979d-4x5qs
namespace: dms
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: produkt-service-57d448979d
uid: 4096ab97-5cee-11e9-97a2-005056a72568
resourceVersion: "65065755"
selfLink: /api/v1/namespaces/dms/pods/produkt-service-57d448979d-4x5qs
uid: b112c5f7-d3aa-11e9-9b1b-005056a72568
spec:
containers:
- image: docker-snapshot.repository..../dms/produkt- service:0b6e0ecc88a28d2a91ffb1db61f8ca99c09a9d92
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: produkt-service
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/health
port: 8080
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
capabilities:
drop:
- MKNOD
procMount: Default
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-v7tpv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kub-test-worker-02
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-v7tpv
secret:
defaultMode: 420
secretName: default-token-v7tpv
There could be a lot of circumstances that will alter the work of the scheduler. There is a documentation talking about it: Pod priority and preemption.
Be aware of the fact that this features were deemed stable at version 1.14.0
From the IBM perspective please take in mind that the version 1.13.9 will be supported until 19 of February 2020!.
You are correct that pods with lower priority should be replaced with higher priority pods.
Let me elaborate on that with an example:
Let's assume a Kubernetes cluster with 3 nodes (1 master and 2 nodes):
This example will base on RAM usage but it can be used in the same manner as CPU time.
There are 2 priority classes:
YAML definition of zero priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: zero-priority
value: 0
globalDefault: false
description: "This is priority class for hello pod"
globalDefault: false
is used for objects that do not have assigned priority class. It will assign this class by default.
YAML definition of high priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This is priority class for goodbye pod"
To apply this priority classes you will need to invoke: $ kubectl apply -f FILE.yaml
With above objects you can create deployments:
YAML definition of hello deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello
spec:
selector:
matchLabels:
app: hello
version: 1.0.0
replicas: 10
template:
metadata:
labels:
app: hello
version: 1.0.0
spec:
containers:
- name: hello
image: "gcr.io/google-samples/hello-app:1.0"
env:
- name: "PORT"
value: "50001"
resources:
requests:
memory: "128Mi"
priorityClassName: zero-priority
Please take a specific look on this fragment:
resources:
requests:
memory: "128Mi"
priorityClassName: zero-priority
It will limit number of pods because of the requested resources as well as it will assign low priority to this deployment.
YAML definition of goodbye deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: goodbye
spec:
selector:
matchLabels:
app: goodbye
version: 2.0.0
replicas: 3
template:
metadata:
labels:
app: goodbye
version: 2.0.0
spec:
containers:
- name: goodbye
image: "gcr.io/google-samples/hello-app:2.0"
env:
- name: "PORT"
value: "50001"
resources:
requests:
memory: "6144Mi"
priorityClassName: high-priority
Also please take a specific look on this fragment:
resources:
requests:
memory: "6144Mi"
priorityClassName: high-priority
This pods will have much higher request for RAM and high priority.
There is no enough information to properly troubleshoot issues like this. Without extensive logs of many components starting from kubelet
to pods
,nodes
and deployments
itself.
Apply hello
deployment and see what happens: $ kubectl apply -f hello.yaml
Get basic information about the deployment with command:
$ kubectl get deployments hello
Output should look like that after a while:
NAME READY UP-TO-DATE AVAILABLE AGE
hello 10/10 10 10 9s
As you can see all of the pods are ready and available. The requested resources were assigned to them.
To get more details for troubleshooting purposes you can invoke:
$ kubectl describe deployment hello
$ kubectl describe node NAME_OF_THE_NODE
Example information about allocated resources from the above command:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 1280Mi (17%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Apply goodbye
deployment and see what happens: $ kubectl apply -f goodbye.yaml
Get basic information about the deployments with command: $ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
goodbye 1/3 3 1 25s
hello 9/10 10 9 11m
As you can see there is goodbye deployment but only 1 pod is available. And despite the fact that the goodbye has much higher priority, the hello pods are still working.
Why it is like that?:
$ kubectl describe node NAME_OF_THE_NODE
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default goodbye-575968c8d6-bnrjc 0 (0%) 0 (0%) 6Gi (83%) 0 (0%) 15m
default hello-fdfb55c96-6hkwp 0 (0%) 0 (0%) 128Mi (1%) 0 (0%) 27m
default hello-fdfb55c96-djrwf 0 (0%) 0 (0%) 128Mi (1%) 0 (0%) 27m
Take a look at requested memory for goodbye pod. It is as described above as 6Gi
.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (12%) 0 (0%)
memory 7296Mi (98%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
The memory usage is near 100%.
Getting information about specific goodbye pod that is in Pending
state will yield some more information $ kubectl describe pod NAME_OF_THE_POD_IN_PENDING_STATE
:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 38s (x3 over 53s) default-scheduler 0/3 nodes are available: 1 Insufficient memory, 2 node(s) had taints that the pod didn't tolerate.
Goodbye pod was not created because there were not enough resources that could be satisfied. But there still was some left resources for hello pods.
There is possibility for a situation that will kill lower priority pods and schedule higher priority pods.
Change the requested memory for goodbye pod to 2304Mi
. It will allow scheduler to assign of all required pods (3):
resources:
requests:
memory: "2304Mi"
You can delete the previous deployment and apply new one with memory parameter changed.
Invoke command: $ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
goodbye 3/3 3 3 5m59s
hello 3/10 10 3 48m
As you can see all of the goodbye pods are available.
Hello pods got reduced to make space for pods with higher priority (goodbye).