1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate

12/17/2021

My kubernetes K3s cluster gives this error:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  17m   default-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate.
  Warning  FailedScheduling  17m   default-scheduler  0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate.

In order to list the taints in the cluster I executed:

kubectl get nodes -o json | jq '.items[].spec'

which outputs:

{
  "podCIDR": "10.42.0.0/24",
  "podCIDRs": [
    "10.42.0.0/24"
  ],
  "providerID": "k3s://antonis-dell",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "node.kubernetes.io/disk-pressure",
      "timeAdded": "2021-12-17T10:54:31Z"
    }
  ]
}
{
  "podCIDR": "10.42.1.0/24",
  "podCIDRs": [
    "10.42.1.0/24"
  ],
  "providerID": "k3s://knodea"
}

When I use kubectl describe node antonis-dell I get:

Name:               antonis-dell
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=antonis-dell
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/master=true
                    node.kubernetes.io/instance-type=k3s
Annotations:        csi.volume.kubernetes.io/nodeid: {"ch.ctrox.csi.s3-driver":"antonis-dell"}
                    flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"f2:d5:6c:6a:85:0a"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.1.XX
                    k3s.io/hostname: antonis-dell
                    k3s.io/internal-ip: 192.168.1.XX
                    k3s.io/node-args: ["server"]
                    k3s.io/node-config-hash: YANNMDBIL7QEFSZANHGVW3PXY743NWWRVFKBKZ4FXLV5DM4C74WQ====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/e61cd97f31a54dbcd9893f8325b7133cfdfd0229ff3bfae5a4f845780a93e84c","K3S_KUBECONFIG_MODE":"644"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 17 Dec 2021 12:11:39 +0200
Taints:             node.kubernetes.io/disk-pressure:NoSchedule

where it seems that node has a disk-pressure taint.

This command doesn't work: kubectl taint node antonis-dell node.kubernetes.io/disk-pressure:NoSchedule- and it seems to me that even if it worked, this is not a good solution because the taint assigned by the control plane.

Furthermore in the end of command kubectl describe node antonis-dell I observed this:

Events:
  Type     Reason               Age                  From     Message
  ----     ------               ----                 ----     -------
  Warning  FreeDiskSpaceFailed  57m                  kubelet  failed to garbage collect required amount of images. Wanted to free 32967806976 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed  52m                  kubelet  failed to garbage collect required amount of images. Wanted to free 32500092928 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed  47m                  kubelet  failed to garbage collect required amount of images. Wanted to free 32190205952 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed  42m                  kubelet  failed to garbage collect required amount of images. Wanted to free 32196628480 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed  37m                  kubelet  failed to garbage collect required amount of images. Wanted to free 32190926848 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed  2m21s (x7 over 32m)  kubelet  (combined from similar events): failed to garbage collect required amount of images. Wanted to free 30909374464 bytes, but freed 0 bytes

Maybe the disk-pressure is related to this? How can I delete the unwanted images?

-- e7lT2P
k3s
kubectl
kubernetes
kubernetes-pod

1 Answer

12/27/2021

Posting the answer as a community wiki, feel free to edit and expand.


node.kubernetes.io/disk-pressure:NoSchedule taint indicates that some disk pressure happens (as it's called).

The kubelet detects disk pressure based on imagefs.available, imagefs.inodesFree, nodefs.available and nodefs.inodesFree(Linux only) observed on a Node. The observed values are then compared to the corresponding thresholds that can be set on the kubelet to determine if the Node condition and taint should be added/removed.

More details on disk-pressure are available in Efficient Node Out-of-Resource Management in Kubernetes under How Does Kubelet Decide that Resources Are Low? section:

memory.available — A signal that describes the state of cluster memory. The default eviction threshold for the memory is 100 Mi. In other words, the kubelet starts evicting Pods when the memory goes down to 100 Mi.

nodefs.available — The nodefs is a filesystem used by the kubelet for volumes, daemon logs, etc. By default, the kubelet starts reclaiming node resources if the nodefs.available < 10%.

nodefs.inodesFree — A signal that describes the state of the nodefs inode memory. By default, the kubelet starts evicting workloads if the nodefs.inodesFree < 5%.

imagefs.available — The imagefs filesystem is an optional filesystem used by a container runtime to store container images and container-writable layers. By default, the kubelet starts evicting workloads if the imagefs.available < 15 %.

imagefs.inodesFree — The state of the imagefs inode memory. It has no default eviction threshold.


What to check

There are different things that can help, such as:

  • prune unused objects like images (with Docker CRI) - prune images.

    The docker image prune command allows you to clean up unused images. By default, docker image prune only cleans up dangling images. A dangling image is one that is not tagged and is not referenced by any container.

  • check files/logs on the node if they take a lot of space.

  • any another reason why disk space was consumed.
-- moonkotte
Source: StackOverflow