As we all know, AWS has per node Pod IP restriction and kubernetes doesn't care this while scheduling, pods get scheduled in nodes where no pod IPs can be allocated and pods get stuck at ContainerCreating state as following:
Normal Scheduled 114s default-scheduler Successfully assigned default/whoami-deployment-9f9c86c4f-r4flx to ip-192-168-15-248.ec2.internal
Warning FailedCreatePodSandBox 111s kubelet, ip-192-168-15-248.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8d4b5f98f9b600ad9ec486f994fa2f9223d5224842df7f78802616f014b52970" network for pod "whoami-deployment-9f9c86c4f-r4flx": NetworkPlugin cni failed to set up pod "whoami-deployment-9f9c86c4f-r4flx_default" network: add cmd: failed to assign an IP address to container
Normal SandboxChanged 86s (x12 over 109s) kubelet, ip-192-168-15-248.ec2.internal Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 61s (x4 over 76s) kubelet, ip-192-168-15-248.ec2.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e2a3c54ba7d9a33a45248f7c276f4a2d5b0c8ba6c3deb5184392156b35638553" network for pod "whoami-deployment-9f9c86c4f-r4flx": NetworkPlugin cni failed to set up pod "whoami-deployment-9f9c86c4f-r4flx_default" network: add cmd: failed to assign an IP address to container
So I tried overcoming the issue by tainting nodes with key=value:NoSchedule, so that default scheduler doesn't schedule pods to the nodes which already reached pod IP limit and deleted all pods which were stuck at ContainerCreating state. I was hoping that it will make the scheduler not to schedule any more pods to tainted nodes and that's what happened but, since pods are not scheduled I was also hoping, cluster-autoscaler will scale ASG and my pods will run on new nodes and that's what didn't happen.
When I do describe pod I see:
Warning FailedScheduling 40s (x5 over 58s) default-scheduler 0/5 nodes are available: 5 node(s) had taints that the pod didn't tolerate.
Normal NotTriggerScaleUp 5s (x6 over 56s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had taints that the pod didn't tolerate
When I look at cluster-autoscaler logs I see:
I1108 16:30:47.521026 1 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"whoami-deployment-9f9c86c4f-x5h4d", UID:"158cc806-0245-11ea-a67a-0efb4254edc4", APIVersion:"v1", ResourceVersion:"2483839", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had taints that the pod didn't tolerate
Now, I tried an alternative way to mark my nodes unschedulable by removing the above NoSchedule taint and patching nodes by:
kubectl patch nodes node1.internal -p '{"spec": {"unschedulable": true}}'
And this is the logs I see in cluster-autoscaler:
I1109 10:47:50.894680 1 static_autoscaler.go:138] Starting main loop
W1109 10:47:50.894719 1 static_autoscaler.go:562] Cluster has no ready nodes.
I1109 10:47:50.901157 1 event.go:209] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"7c949105-0153-11ea-9a39-12e5fc698b6e", APIVersion:"v1", ResourceVersion:"2629645", FieldPath:""}): type: 'Warning' reason: 'ClusterUnhealthy' Cluster has no ready nodes.
So, my idea of overcoming the issue made no sense. How shall I overcome this?
Kubernetes version: 1.14 Cluster Autoscaler: 1.14.6
Let me know if you guys need more details.