I'm trying to install kubernetes on an Ubuntu 16.04 VM, followed instructions at https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/, using weave as my pod network add-on.
I'm seeing similar issue as coredns pods have CrashLoopBackOff or Error state, but I didn't see a solution there, and the versions I'm using are different:
kubeadm 1.11.4-00
kubectl 1.11.4-00
kubelet 1.11.4-00
kubernetes-cni 0.6.0-00
Docker version 1.13.1-cs8, build 91ca5f2
weave script 2.5.0
weave 2.5.0
I'm running behind a corporate firewall, so I set my proxy variables, then ran kubeadm init as follows:
# echo $http_proxy
http://135.28.13.11:8080
# echo $https_proxy
http://135.28.13.11:8080
# echo $no_proxy
127.0.0.1,135.21.27.139,135.0.0.0/8,10.96.0.0/12,10.32.0.0/12
# kubeadm init --pod-network-cidr=10.32.0.0/12
# kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
# kubectl taint nodes --all node-role.kubernetes.io/master-
Both coredns pods stay in CrashLoopBackOff
# kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
default hostnames-674b556c4-2b5h2 1/1 Running 0 5h 10.32.0.6 mtpnjvzonap001 <none>
default hostnames-674b556c4-4bzdj 1/1 Running 0 5h 10.32.0.5 mtpnjvzonap001 <none>
default hostnames-674b556c4-64gx5 1/1 Running 0 5h 10.32.0.4 mtpnjvzonap001 <none>
kube-system coredns-78fcdf6894-s7rvx 0/1 CrashLoopBackOff 18 1h 10.32.0.7 mtpnjvzonap001 <none>
kube-system coredns-78fcdf6894-vxwgv 0/1 CrashLoopBackOff 80 6h 10.32.0.2 mtpnjvzonap001 <none>
kube-system etcd-mtpnjvzonap001 1/1 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-apiserver-mtpnjvzonap001 1/1 Running 0 1h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-controller-manager-mtpnjvzonap001 1/1 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-proxy-2c4tx 1/1 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-scheduler-mtpnjvzonap001 1/1 Running 0 1h 135.21.27.139 mtpnjvzonap001 <none>
kube-system weave-net-bpx22 2/2 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
coredns pods have this entry in their log
E1114 20:59:13.848196 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
This suggests to me that coredns cannot access apiserver pod using its cluster IP:
# kubectl describe svc/kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.96.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 135.21.27.139:6443
Session Affinity: None
Events: <none>
I also went through the troubleshooting steps at https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/
So in short, I created the hostnames service which had a cluster IP in 10.96.0.0/12 space (as expected), and it works, but for some reason, pods cannot access the apiserver's cluster IP of 10.96.0.1, though from the node I can access 10.96.0.1:
# wget --no-check-certificate https://10.96.0.1/hello
--2018-11-14 21:44:25-- https://10.96.0.1/hello
Connecting to 10.96.0.1:443... connected.
WARNING: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 403 Forbidden
2018-11-14 21:44:25 ERROR 403: Forbidden.
Some other things I checked, based on advice from others who reported a similar problem:
# sysctl net.ipv4.conf.all.forwarding
net.ipv4.conf.all.forwarding = 1
# sysctl net.bridge.bridge-nf-call-iptables
net.bridge.bridge-nf-call-iptables = 1
# iptables-save | egrep ':INPUT|:OUTPUT|:POSTROUTING|:FORWARD'
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [11:692]
:POSTROUTING ACCEPT [11:692]
:INPUT ACCEPT [1697:364811]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [1652:363693]
# ls -l /usr/sbin/conntrack
-rwxr-xr-x 1 root root 65632 Jan 24 2016 /usr/sbin/conntrack
# systemctl status firewalld
● firewalld.service
Loaded: not-found (Reason: No such file or directory)
Active: inactive (dead)
I checked the log for kube-proxy, did not see any errors. I also tried deleting coredns pods, apiserver pod; they are recreated (as expected), but the problem remains.
Here's a copy of the log from the weave container
# kubectl logs -n kube-system weave-net-bpx22 weave
DEBU: 2018/11/14 15:56:10.909921 [kube-peers] Checking peer "aa:53:be:75:71:f7" against list &{[]}
Peer not in list; removing persisted data
INFO: 2018/11/14 15:56:11.041807 Command line options: map[name:aa:53:be:75:71:f7 nickname:mtpnjvzonap001 ipalloc-init:consensus=1 ipalloc-range:10.32.0.0/12 db-prefix:/weavedb/weave-net docker-api: expect-npc:true host-root:/host http-addr:127.0.0.1:6784 metrics-addr:0.0.0.0:6782 conn-limit:100 datapath:datapath no-dns:true port:6783]
INFO: 2018/11/14 15:56:11.042230 weave 2.5.0
INFO: 2018/11/14 15:56:11.198348 Bridge type is bridged_fastdp
INFO: 2018/11/14 15:56:11.198372 Communication between peers is unencrypted.
INFO: 2018/11/14 15:56:11.203206 Our name is aa:53:be:75:71:f7(mtpnjvzonap001)
INFO: 2018/11/14 15:56:11.203249 Launch detected - using supplied peer list: [135.21.27.139]
INFO: 2018/11/14 15:56:11.216398 Checking for pre-existing addresses on weave bridge
INFO: 2018/11/14 15:56:11.229313 [allocator aa:53:be:75:71:f7] No valid persisted data
INFO: 2018/11/14 15:56:11.233391 [allocator aa:53:be:75:71:f7] Initialising via deferred consensus
INFO: 2018/11/14 15:56:11.233443 Sniffing traffic on datapath (via ODP)
INFO: 2018/11/14 15:56:11.234120 ->[135.21.27.139:6783] attempting connection
INFO: 2018/11/14 15:56:11.234302 ->[135.21.27.139:49182] connection accepted
INFO: 2018/11/14 15:56:11.234818 ->[135.21.27.139:6783|aa:53:be:75:71:f7(mtpnjvzonap001)]: connection shutting down due to error: cannot connect to ourself
INFO: 2018/11/14 15:56:11.234843 ->[135.21.27.139:49182|aa:53:be:75:71:f7(mtpnjvzonap001)]: connection shutting down due to error: cannot connect to ourself
INFO: 2018/11/14 15:56:11.236010 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2018/11/14 15:56:11.236424 Listening for metrics requests on 0.0.0.0:6782
INFO: 2018/11/14 15:56:11.990529 [kube-peers] Added myself to peer list &{[{aa:53:be:75:71:f7 mtpnjvzonap001}]}
DEBU: 2018/11/14 15:56:11.995901 [kube-peers] Nodes that have disappeared: map[]
10.32.0.1
135.21.27.139
DEBU: 2018/11/14 15:56:12.075738 registering for updates for node delete events
INFO: 2018/11/14 15:56:41.279799 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 74.125.196.121:443: i/o timeout
INFO: 2018/11/14 20:52:47.025412 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 74.125.196.121:443: i/o timeout
INFO: 2018/11/15 01:46:32.842792 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 74.125.196.121:443: i/o timeout
INFO: 2018/11/15 09:06:03.624359 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 172.217.9.147:443: i/o timeout
INFO: 2018/11/15 14:34:02.070893 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 172.217.9.147:443: i/o timeout
Here are the events for the 2 coredns pods
# kubectl get events -n kube-system --field-selector involvedObject.name=coredns-78fcdf6894-6f9q6
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
41m 20h 245 coredns-78fcdf6894-6f9q6.1568eab25f0acb02 Pod spec.containers{coredns} Normal Killing kubelet, mtpnjvzonap001 Killing container with id docker://coredns:Container failed liveness probe.. Container will be killed and recreated.
26m 20h 248 coredns-78fcdf6894-6f9q6.1568ea920f72ddd4 Pod spec.containers{coredns} Normal Pulled kubelet, mtpnjvzonap001 Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
5m 20h 1256 coredns-78fcdf6894-6f9q6.1568eaa1fd9216d2 Pod spec.containers{coredns} Warning Unhealthy kubelet, mtpnjvzonap001 Liveness probe failed: HTTP probe failed with statuscode: 503
1m 19h 2963 coredns-78fcdf6894-6f9q6.1568eb75f2b1af3e Pod spec.containers{coredns} Warning BackOff kubelet, mtpnjvzonap001 Back-off restarting failed container
# kubectl get events -n kube-system --field-selector involvedObject.name=coredns-78fcdf6894-skjwz
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
6m 20h 1259 coredns-78fcdf6894-skjwz.1568eaa181fbeffe Pod spec.containers{coredns} Warning Unhealthy kubelet, mtpnjvzonap001 Liveness probe failed: HTTP probe failed with statuscode: 503
1m 19h 2969 coredns-78fcdf6894-skjwz.1568eb7578188f24 Pod spec.containers{coredns} Warning BackOff kubelet, mtpnjvzonap001 Back-off restarting failed container
#
Any help or further troubleshooting steps are welcome
I had the same problem and needed to allow several ports in my firewall: 22, 53, 6443, 6783, 6784, 8285.
I copied the rules from an existing healthy cluster. Probably only 6443, shown above as the target port for the coredns service, is required for this error and the others are for other services I run in my cluster.
With Ubuntu this was uncomplicated firewall
ufw allow 22/tcp # allowed for ssh, included in case you had firewall disabled altogether
ufw allow 6443
ufw allow 53
ufw allow 8285
ufw allow 6783
ufw allow 6784