Autoscaler not scaling up leaving nodes in NotReady state and pods in Unknown state

10/10/2018

I am running a cluster on GKE with a single node pool. It has 3 nodes and can scale from 1 to 99 nodes. The cluster uses the nginx-ingress controller

On this cluster, I want to deploy apps. An app is scoped by a namespace and consists of 3 deployments and one ingress (defining paths to access the application from the internet). Each deployment runs a single replica of a container.

Deploying a couple of apps works fine, but deploying many apps (requiring the node pool to scale up) breaks everything:

All pods start having warnings (including those successfully deployed earlier)

kubectl get pods --namespace bcd
NAME                       READY     STATUS    RESTARTS   AGE
actions-664b7d79f5-7qdkw   1/1       Unknown   1          35m
actions-664b7d79f5-v8s2m   1/1       Running   1          18m
core-85cb74f89b-ns49z      1/1       Unknown   1          35m
core-85cb74f89b-qqzfp      1/1       Running   1          18m
nlu-77899ddbf-8pd7k        1/1       Running   1          27m

All nodes becomes unready:

kubectl get nodes
NAME                                              STATUS     ROLES     AGE       VERSION
gke-clients-projects-default-pool-f9af73d4-gzwr   NotReady   <none>    42m       v1.9.7-gke.6
gke-clients-projects-default-pool-f9af73d4-p5l2   NotReady   <none>    21m       v1.9.7-gke.6
gke-clients-projects-default-pool-f9af73d4-wnxc   NotReady   <none>    37m       v1.9.7-gke.6

Deleting the namespace to remove all resources from the cluster also seems to fail as after a long while the pods remain active but still in an unknown state.

How can I safely add more apps and let the cluster autoscale?

-- znat
google-cloud-platform
google-kubernetes-engine
kubernetes

1 Answer

10/11/2018

The reason seems to be that not knowing the resources needed for each pod, the scheduler schedules them on any available node, potentially exhausting available resources and putting the Docker daemon in an inconsistent state.

The solution is to specify resources requests and limits: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container

-- znat
Source: StackOverflow