Rancher: kube-system pods stuck on ContainerCreating

8/28/2020

I'm trying to spin up a cluster with one node (VM machine) but I'm getting some pods for kube-system stuck as ContainerCreating

> kubectl get pods,svc -owide --all-namespaces
NAMESPACE       NAME                                          READY   STATUS              RESTARTS   AGE     IP            NODE            NOMINATED NODE   READINESS GATES
cattle-system   pod/cattle-cluster-agent-7db88c6b68-bz5dp     0/1     ContainerCreating   0          7m13s   <none>        hdn-dev-app66   <none>           <none>
cattle-system   pod/cattle-node-agent-ccntw                   1/1     Running             0          7m13s   10.105.1.76   hdn-dev-app66   <none>           <none>
cattle-system   pod/kube-api-auth-9kdpw                       1/1     Running             0          7m13s   10.105.1.76   hdn-dev-app66   <none>           <none>
ingress-nginx   pod/default-http-backend-598b7d7dbd-rwvhm     0/1     ContainerCreating   0          7m29s   <none>        hdn-dev-app66   <none>           <none>
ingress-nginx   pod/nginx-ingress-controller-62vhq            1/1     Running             0          7m29s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/coredns-849545576b-w87zr                  0/1     ContainerCreating   0          7m39s   <none>        hdn-dev-app66   <none>           <none>
kube-system     pod/coredns-autoscaler-5dcd676cbd-pj54d       0/1     ContainerCreating   0          7m38s   <none>        hdn-dev-app66   <none>           <none>
kube-system     pod/kube-flannel-d9m6q                        2/2     Running             0          7m43s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/metrics-server-697746ff48-q7cpx           0/1     ContainerCreating   0          7m33s   <none>        hdn-dev-app66   <none>           <none>
kube-system     pod/rke-coredns-addon-deploy-job-npjll        0/1     Completed           0          7m40s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/rke-ingress-controller-deploy-job-b9rs4   0/1     Completed           0          7m30s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/rke-metrics-addon-deploy-job-5rpbj        0/1     Completed           0          7m35s   10.105.1.76   hdn-dev-app66   <none>           <none>
kube-system     pod/rke-network-plugin-deploy-job-lvk2q       0/1     Completed           0          7m50s   10.105.1.76   hdn-dev-app66   <none>           <none>

NAMESPACE       NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
default         service/kubernetes             ClusterIP   10.43.0.1      <none>        443/TCP                  8m19s   <none>
ingress-nginx   service/default-http-backend   ClusterIP   10.43.144.25   <none>        80/TCP                   7m29s   app=default-http-backend
kube-system     service/kube-dns               ClusterIP   10.43.0.10     <none>        53/UDP,53/TCP,9153/TCP   7m39s   k8s-app=kube-dns
kube-system     service/metrics-server         ClusterIP   10.43.251.47   <none>        443/TCP                  7m34s   k8s-app=metrics-server

when I will do describe on failing pods I'm getting that:

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "345460c8f6399a0cf20956d8ea24d52f5a684ae47c3e8ec247f83d66d56b2baa" network for pod "cattle-cluster-agent-7db88c6b68-bz5dp": networkPlugin cni failed to set up pod "cattle-cluster-agent-7db88c6b68-bz5dp_cattle-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope, failed to clean up sandbox container "345460c8f6399a0cf20956d8ea24d52f5a684ae47c3e8ec247f83d66d56b2baa" network for pod "cattle-cluster-agent-7db88c6b68-bz5dp": networkPlugin cni failed to teardown pod "cattle-cluster-agent-7db88c6b68-bz5dp_cattle-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope]

Had try to re-registry that node once more time but no luck. Any thoughts?

-- JackTheKnife
kubernetes
rancher
rancher-rke
rke

2 Answers

8/28/2020

As it says unauthorized so you have to give rbac permissions to make it work.

Try adding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:calico-node
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: calico-node
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes
-- Saurabh Nigam
Source: StackOverflow

8/28/2020

Fixed problem with following article from https://rancher.com/docs/rancher/v2.x/en/cluster-admin/cleaning-cluster-nodes/ on how to recycle broken node.

-- JackTheKnife
Source: StackOverflow