I've been trying to setup a Kubernetes cluster for a few months now, but I have no luck so far.
I'm trying to set it up on 4 bare metal PCs running coreOS. I've just clean installed everything again, but I get to the same problem as before. I'm following this tutorial. I think I've configured everything correctly, but am not 100% sure. When I reboot any of the machines, kubelet and flanneld services are running, but I see the following errors for them when checking service status with systemctl status
:
kubelet error: Process: 1246 ExecStartPre=/usr/bin/rkt rm --uuid-file=/var/run/kubelet-pod.uuid (code=exited, status=254)
flanneld error: Process: 1057 ExecStartPre=/usr/bin/rkt rm --uuid-file=/var/lib/coreos/flannel-wrapper.uuid (code=exited, status=254)
If I restart both services, they work, or at least look like they work - I get no errors.
Everything else seems to work fine, so the only problem (I think) left are the kube-proxy service on all nodes.
If I run kubectl get pods
I see all pods running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kube-apiserver-kubernetes-4 1/1 Running 4 6m
kube-controller-manager-kubernetes-4 1/1 Running 6 6m
kube-proxy-kubernetes-1 1/1 Running 4 18h
kube-proxy-kubernetes-2 1/1 Running 5 26m
kube-proxy-kubernetes-3 1/1 Running 4 19m
kube-proxy-kubernetes-4 1/1 Running 4 18h
kube-scheduler-kubernetes-4 1/1 Running 6 18h
The answer to this question suggest to check if kubectl get node
returns same names that are registered on kubelet. As far as I checked the logs, nodes are registered correctly, and this is the output of kubectl get node
$ kubectl get node
NAME STATUS AGE VERSION
kubernetes-1 Ready 18h v1.6.1+coreos.0
kubernetes-2 Ready 36m v1.6.1+coreos.0
kubernetes-3 Ready 29m v1.6.1+coreos.0
kubernetes-4 Ready,SchedulingDisabled 18h v1.6.1+coreos.0
The tutorial I've used (linked above) suggest I use --hostname-override
but I couldn't get node info on master node (kubernetes-4) If i tried to curl it locally. So I removed it and I can get node info normally now.
Someone suggested it might be a flannel problem and that I should check the flannel ports. Using netstat -lntu
I get the following output:
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:2379 0.0.0.0:* LISTEN
tcp 0 0 MASTER_IP:2379 0.0.0.0:* LISTEN
tcp 0 0 MASTER_IP:2380 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN
tcp6 0 0 :::4194 :::* LISTEN
tcp6 0 0 :::10250 :::* LISTEN
tcp6 0 0 :::10251 :::* LISTEN
tcp6 0 0 :::10252 :::* LISTEN
tcp6 0 0 :::10255 :::* LISTEN
tcp6 0 0 :::22 :::* LISTEN
tcp6 0 0 :::443 :::* LISTEN
udp 0 0 0.0.0.0:8472 0.0.0.0:*
So I assume the ports are fine?
Also etcd2 works, etcdctl cluster-health
shows that all nodes are healthy
This is the part of cloud-config that starts the etcd2 on reboot, besides that I only store ssh keys and node username/password/groups in it:
#cloud-config
coreos:
etcd2:
name: "kubernetes-4"
initial-advertise-peer-urls: "http://NODE_IP:2380"
listen-peer-urls: "http://NODE_IP:2380"
listen-client-urls: "http://NODE_IP,http://127.0.0.1:2379"
advertise-client-urls: "http://NODE_IP:2379"
initial-cluster-token: "etcd-cluster-1"
initial-cluster: "kubernetes-4=http://MASTER_IP:2380,kubernetes-1=http://WORKER_1_IP:2380,kubernetes-2=http://WORKER_2_IP:2380,kubernetes-3=http://WORKER_3_IP:2380"
initial-cluster-state: "new"
units:
- name: etcd2.service
command: start
This is the content of /etc/flannel/options.env
file:
FLANNELD_IFACE=NODE_IP
FLANNELD_ETCD_ENDPOINTS=http://MASTER_IP:2379,http://WORKER_1_IP:2379,http://WORKER_2_IP:2379,http://WORKER_3_IP:2379
The same endpoints are under --etcd-servers
in kube-apiserver.yaml
file
Any ideas/suggestion what could be the problem? Also if there are some details missing let me know, I'll add them to the post.
Edit: I forgot to include kube-proxy logs.
Master node kube-proxy log:
$ kubectl logs kube-proxy-kubernetes-4
I0615 07:47:45.250631 1 server.go:225] Using iptables Proxier.
W0615 07:47:45.286923 1 server.go:469] Failed to retrieve node info: Get http://127.0.0.1:8080/api/v1/nodes/kubernetes-4: dial tcp 127.0.0.1:8080: getsockopt: connection refused
W0615 07:47:45.303576 1 proxier.go:304] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
W0615 07:47:45.303593 1 proxier.go:309] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0615 07:47:45.303646 1 server.go:249] Tearing down userspace rules.
E0615 07:47:45.357276 1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: Get http://127.0.0.1:8080/api/v1/endpoints?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
E0615 07:47:45.357278 1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: Get http://127.0.0.1:8080/api/v1/services?resourceVersion=0: dial tcp 127.0.0.1:8080: getsockopt: connection refused
Worker nodes kube-proxy log:
$ kubectl logs kube-proxy-kubernetes-1
I0615 07:47:33.667025 1 server.go:225] Using iptables Proxier.
W0615 07:47:33.697387 1 server.go:469] Failed to retrieve node info: Get https://MASTER_IP/api/v1/nodes/kubernetes-1: dial tcp MASTER_IP:443: getsockopt: connection refused
W0615 07:47:33.712718 1 proxier.go:304] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
W0615 07:47:33.712734 1 proxier.go:309] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0615 07:47:33.712773 1 server.go:249] Tearing down userspace rules.
E0615 07:47:33.787122 1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:49: Failed to list *api.Endpoints: Get https://MASTER_IP/api/v1/endpoints?resourceVersion=0: dial tcp MASTER_IP:443: getsockopt: connection refused
E0615 07:47:33.787144 1 reflector.go:201] k8s.io/kubernetes/pkg/proxy/config/api.go:46: Failed to list *api.Service: Get https://MASTER_IP/api/v1/services?resourceVersion=0: dial tcp MASTER_IP:443: getsockopt: connection refused
Did you try the scripts here? These are the condensed versions of the tutorial you used, for various platforms. The scripts worked perfectly for me on bare metal for k8s v1.6.4. I have a tweaked script with better encryption.
kube-apiserver
isn't running which explains the error dial tcp 127.0.0.1:8080: getsockopt: connection refused
. When I was debugging kube-apiserver
, this was what I would do in the node:
/etc/kubernetes/manifests/kube-apiserver.yaml
.Manually run a hyperkube
container. Depending on your config, you will have to mount additional volumes (ie. -v
) to expose files to the container. Update the image version to the one you use.
docker run --net=host -it -v /etc/kubernetes/ssl:/etc/kubernetes/ssl quay.io/coreos/hyperkube:v1.6.2_coreos.0
The above command will launch a shell in the hyperkube
container. Now, launch kube-apiserver
with the flags in your kube-apiserver.yaml
manifest. It should look similar to this example:
/hyperkube apiserver \ --bind-address=0.0.0.0 \ --etcd-cafile=/etc/kubernetes/ssl/apiserver/ca.pem \ --etcd-certfile=/etc/kubernetes/ssl/apiserver/client.pem \ --etcd-keyfile=/etc/kubernetes/ssl/apiserver/client-key.pem \ --etcd-servers=https://10.246.40.20:2379,https://10.246.40.21:2379,https://10.246.40.22:2379 \ ...
In any case, I suggest that you tear down the cluster and try the scripts first. It might just work ootb.