I am currently trying to understand how and why pods are scheduled or not on the controller with kube-aws 0-9-6 (1.6.2) After installing a clean stack, querying the kube-system namespace I see the following:
kubectl --kubeconfig=kubeconfig --namespace kube-system get pod NAME READY STATUS RESTARTS AGE heapster-v1.3.0-690018220-tvr45 0/2 Pending 0 1h kube-apiserver-ip-10-0-0-17.eu-west-1.compute.internal 1/1 Running 0 3h kube-controller-manager-ip-10-0-0-17.eu-west-1.compute.internal 1/1 Running 0 3h kube-dns-1455470676-tlrlf 0/3 Pending 0 3h kube-dns-autoscaler-1106974149-xvdw5 0/1 Pending 0 1h kube-proxy-ip-10-0-0-17.eu-west-1.compute.internal 1/1 Running 0 3h kube-scheduler-ip-10-0-0-17.eu-west-1.compute.internal 1/1 Running 0 1h kubernetes-dashboard-v1.5.1-50n8s 1/1 Running 0 7s
Now we see that some of the pods are running and some are pending. The pending pods are pending due to:
No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (1).
Firstly looking at the node, I see the following:
Taints: node.alpha.kubernetes.io/role=master:NoSchedule
Which is fine, the controller node is not schedulable, now, I wanted to see why pods are scheduled and why others aren't. Firstly looking at the kube-apiserver deployment we see:
tolerations: - effect: NoExecute operator: Exists
Firstly this does not appear in the controller user data, I wonder where it comes from, but even if it's there, it makes no sense that this toleration satisfies the taint of the NoSchedule
Then, If we look at other pods that are in pending state we can see the following:
tolerations: - key: CriticalAddonsOnly operator: Exists
This is perfectly clear why they cannot be scheduled and they are in pending state. it does not satisfy the taint.
From this point on, no matter what I do (except satisfying the NoSchedule). Nothing changes.
Adding the NoExecute effect to any of the pending nodes do not bring them up which is correct because they do not satisfy anything.
I can't find any justification for the api-server, controller-manager, proxy and scheduler to be running and not pending (can't see anything special in the user-data as well)
Can anyone please explain to me what is going on?
Thanks
The tolerations & taints should be defined in the yaml of the deployed objects (e.g. the scheduler, controller, etc). I wouldn't expect them to be in the UserData of the instance.
Do you have any nodes in your cluster other than the master? Seems like the other addons (dns, etc) would run on nodes in the cluster whereas the core components (scheduler, etc) are set to run on the master.