I have a multi-regional testing setup on GKE k8s 1.9.4. Every cluster has:
kubemci
system
(1vCPU / 2GB RAM)frontend
(2vCPU / 2GB RAM)backend
(1vCPU / 600Mb RAM)So stuff like prometheus-operator
, prometheus-server
, custom-metrics-api-server
and kube-state-metrics
attached to a node with system
label.
Frontend and backend pod attached to nodes with frontend
and backend
labels respectively (single pod to a single node), see podantiaffinity.
After autoscaling scales backend
or frontend
pods down, them nodes remains to stay, as there appear to be pods from kube-system
namespace, i.e heapster
. This leads to a situation when node with frontend
/ backend
label stays alive after downscaling even there's no backend or frontend pod left on it.
The question is: how can I avoid creating kube-system
pods on the nodes, that serving my application (if this is really sane and possible)?
Guess, I should use taints and tolerations for backend
and frontend
nodes, but how it can be combined with HPA and in-cluster node autoscaler?
The first thing I would recommend to check is that the amount of requested resources you have in PodSpec is enough to carry the load and that there is enough resources on system nodes to schedule all system pods.
You may try to prevent scheduling system pods to frontend or backend autoscaled nodes using either more simple nodeSelector
or more flexible Node Affinity
.
You can find good explanation and examples in document “Assigning Pods to Nodes”
Taints and Toleration
features are similar to Node Affinity
, but more from node perspective. They allow a node to repel a set of pods. Check the document “Taints and Tolerations” if you choose this way.
When you create node pool for autoscaling you can add labels
and taints
, so they will apply to nodes when Cluster Autoscaler (CA) upscale the pool.
In addition to restricting system
pods from scheduling on frontend
/backend
nodes it would be a good idea to configure PodDisruptionBudget
and autoscaler safe-to-evict
option for pods that could prevent CA from removing a node during downscale.
According to Cluster Autoscaler FAQ there are several types of pods that may prevent CA to downscale your cluster:
*Unless the pod has the following annotation (supported in CA 1.0.3 or later):
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
Prior to version 0.6, Cluster Autoscaler was not touching nodes that were running important kube-system pods like DNS, Heapster, Dashboard etc.
If these pods landed on different nodes, CA could not scale the cluster down and the user could end up with a completely empty 3 node cluster.
In 0.6, was added an option to tell CA that some system pods can be moved around. If the user configures a PodDisruptionBudget for the kube-system pod, then the default strategy of not touching the node running this pod is overridden with PDB settings.
So, to enable kube-system pods migration, one should set minAvailable to 0 (or <= N if there are N+1 pod replicas.)
See also I have a couple of nodes with low utilization, but they are not scaled down. Why?
Cluster Autoscaler FAQ can help you choose correct version for you cluster.
To have better understanding of what is laying under the hood of Cluster Autoscaler check the official documentation
Seems like taints and tolerations did the trick.
Create a cluster with a default node pool (for monitoring and kube-system
):
gcloud container --project "my-project-id" clusters create "app-europe" \
--zone "europe-west1-b" --username="admin" --cluster-version "1.9.4-gke.1" --machine-type "custom-2-4096" \
--image-type "COS" --disk-size "10" --num-nodes "1" --network "default" --enable-cloud-logging --enable-cloud-monitoring \
--maintenance-window "01:00" --node-labels=region=europe-west1,role=system
Create node pool for your application:
gcloud container --project "my-project-id" node-pools create "frontend" \
--cluster "app-europe" --zone "europe-west1-b" --machine-type "custom-2-2048" --image-type "COS" \
--disk-size "10" --node-labels=region=europe-west1,role=frontend \
--node-taints app=frontend:NoSchedule \
--enable-autoscaling --num-nodes "1" --min-nodes="1" --max-nodes="3"
then add nodeAffinity
and tolerations
sections to a pods template spec
in your deployment manifest:
tolerations:
- key: "app"
operator: "Equal"
value: "frontend"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- custom-2-2048
- matchExpressions:
- key: role
operator: In
values:
- frontend