I have a cluster running on GCP that currently consists entirely of preemtible nodes. We're experiencing issues where kube-dns becomes unavailable (presumably because a node has been preempted). We'd like to improve the resilience of DNS by moving kube-dns
pods to more stable nodes.
Is it possible to schedule system cluster critical pods like kube-dns
(or all pods in the kube-system
namespace) on a node pool of only non-preemptible nodes? I'm wary of using affinity or anti-affinity or taints, since these pods are auto-created at cluster bootstrapping and any changes made could be clobbered by a Kubernetes version upgrade. Is there a way do do this that will persist across upgrades?
The solution was to use taints and tolerations in conjunction with node affinity. We created a second node pool, and added a taint to the preemptible pool.
Terraform config:
resource "google_container_node_pool" "preemptible_worker_pool" {
node_config {
...
preemptible = true
labels {
preemptible = "true"
dedicated = "preemptible-worker-pool"
}
taint {
key = "dedicated"
value = "preemptible-worker-pool"
effect = "NO_SCHEDULE"
}
}
}
We then used a toleration
and nodeAffinity
to allow our existing workloads to run on the tainted node pool, effectively forcing the cluster-critical pods to run on the untainted (non-preemtible) node pool.
Kubernetes config:
spec:
template:
spec:
# The affinity + tolerations sections together allow and enforce that the workers are
# run on dedicated nodes tainted with "dedicated=preemptible-worker-pool:NoSchedule".
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values:
- preemptible-worker-pool
tolerations:
- key: dedicated
operator: "Equal"
value: preemptible-worker-pool
effect: "NoSchedule"