My cluster is currently down and I cannot launch new pods on it. I attempted to upgrade from 1.9.1 to 1.9.3 with kops and add the pvc resize admissionControl. As the rolling upgrade occurred, I noticed the new nodes were not coming online properly (even though the rollingupgrade thought they were) . I aborted the rollingupgrade. I have found that the pods are complaining about :
open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
The kube api server is showing:
I0524 14:27:43.871432 1 rbac.go:116] RBAC DENY: user "system:kube-proxy" groups ["system:authenticated"] cannot "get" resource "nodes" named "ip-10-23-2-5.ec2.internal" cluster-wide
I0524 14:27:43.873562 1 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "list" resource "nodes" cluster-wide
I0524 14:27:43.873783 1 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "list" resource "services" cluster-wide
I0524 14:27:43.887303 1 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "replicasets.extensions" cluster-wide
I0524 14:27:43.887569 1 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "persistentvolumeclaims" cluster-wide
I0524 14:27:43.949818 1 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "list" resource "pods" cluster-wide
I0524 14:27:43.956233 1 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "statefulsets.apps" cluster-wide
I0524 14:27:43.958076 1 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "services" cluster-wide
I0524 14:27:43.958564 1 rbac.go:116] RBAC DENY: user "system:kube-scheduler" groups ["system:authenticated"] cannot "list" resource "nodes" cluster-wide
I0524 14:27:43.972226 1 rbac.go:116] RBAC DENY: user "kubelet" groups ["system:nodes" "system:authenticated"] cannot "create" resource "nodes" cluster-wide
Please help
Finally resolved this issue. The errors in api log are misleading and persist due to not having a service account with proper permissions associated to certain pods.
The fundamental problem was that rolling upgrade left one master 'Ready' yet the apiserver was running without the ServiceAccount admissionControl. So new pods were being routed there and not coming up. Resolved the issue by correcting the admissionControl across all masters.