I am running a cluster on GKE with a single node pool. It has 3 nodes and can scale from 1 to 99 nodes. The cluster uses the nginx-ingress
controller
On this cluster, I want to deploy apps. An app is scoped by a namespace and consists of 3 deployments
and one ingress
(defining paths to access the application from the internet). Each deployment runs a single replica of a container.
Deploying a couple of apps works fine, but deploying many apps (requiring the node pool to scale up) breaks everything:
All pods start having warnings (including those successfully deployed earlier)
kubectl get pods --namespace bcd
NAME READY STATUS RESTARTS AGE
actions-664b7d79f5-7qdkw 1/1 Unknown 1 35m
actions-664b7d79f5-v8s2m 1/1 Running 1 18m
core-85cb74f89b-ns49z 1/1 Unknown 1 35m
core-85cb74f89b-qqzfp 1/1 Running 1 18m
nlu-77899ddbf-8pd7k 1/1 Running 1 27m
All nodes becomes unready:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-clients-projects-default-pool-f9af73d4-gzwr NotReady <none> 42m v1.9.7-gke.6
gke-clients-projects-default-pool-f9af73d4-p5l2 NotReady <none> 21m v1.9.7-gke.6
gke-clients-projects-default-pool-f9af73d4-wnxc NotReady <none> 37m v1.9.7-gke.6
Deleting the namespace to remove all resources from the cluster also seems to fail as after a long while the pods remain active but still in an unknown state.
How can I safely add more apps and let the cluster autoscale?
The reason seems to be that not knowing the resources needed for each pod, the scheduler schedules them on any available node, potentially exhausting available resources and putting the Docker daemon in an inconsistent state.
The solution is to specify resources requests and limits: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container