I have deployed a 3-node CoreOS Vagrant VMs following this guide, modified as described here.
The VMs are healthy and running; K8s controller/worker nodes are fine; and I can deploy Pods; ReplicaSets; etc.
However, DNS does not seem to work and, when I look at the state of the flannel
pods, they are positively unhealthy:
$ kubectl get po --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
apps frontend-cluster-q4gvm 1/1 Running 1 1h
apps frontend-cluster-tl5ts 1/1 Running 0 1h
apps frontend-cluster-xgktz 1/1 Running 1 1h
kube-system kube-apiserver-172.17.8.101 1/1 Running 2 32d
kube-system kube-controller-manager-172.17.8.101 1/1 Running 2 32d
kube-system kube-flannel-ds-6csjl 0/1 CrashLoopBackOff 46 31d
kube-system kube-flannel-ds-f8czg 0/1 CrashLoopBackOff 48 31d
kube-system kube-flannel-ds-qbtlc 0/1 CrashLoopBackOff 52 31d
kube-system kube-proxy-172.17.8.101 1/1 Running 2 32d
kube-system kube-proxy-172.17.8.102 1/1 Running 0 6m
kube-system kube-proxy-172.17.8.103 1/1 Running 0 2m
kube-system kube-scheduler-172.17.8.101 1/1 Running 2 32d
further, when I try to deploy kubedns
those fail too, with the same failure mode:
$ kubectl logs kube-flannel-ds-f8czg -n kube-system
I0608 23:03:32.526331 1 main.go:475] Determining IP address of default interface
I0608 23:03:32.528108 1 main.go:488] Using interface with name eth0 and address 10.0.2.15
I0608 23:03:32.528135 1 main.go:505] Defaulting external address to interface address (10.0.2.15)
E0608 23:04:02.627348 1 main.go:232] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-f8czg': Get https://10.3.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-f8czg: dial tcp 10.3.0.1:443: i/o timeout
So, it appears that the controller service, running off the 10.3.0.1
IP is not reachable from other pods:
$ kubectl get svc --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
apps frontend ClusterIP 10.3.0.170 <none> 80/TCP,443/TCP 1h
default kubernetes ClusterIP 10.3.0.1 <none> 443/TCP 32d
My guess were around either Flannel's etcd
configurations; or the kube-proxy
YAML; so, I added the following to all the nodes:
core@core-01 ~ $ etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.1.80.0-24
/flannel/network/subnets/10.1.76.0-24
/flannel/network/subnets/10.3.0.0-16
/flannel/network/subnets/10.1.34.0-24
core@core-01 ~ $ etcdctl get /flannel/network/subnets/10.3.0.0-16
{"PublicIP": "172.17.8.101"}
and restarted flanneld
:
core@core-01 ~ $ sudo systemctl restart flanneld
However, that does not appear to do any good; from within a running Pod:
# This is expected (no client certs):
root@frontend-cluster-q4gvm:/opt/simple# curl -k https://172.17.8.101/api/v1/pods
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
}
# But this one just times out:
root@frontend-cluster-q4gvm:/opt/simple# curl -k https://10.3.0.1/api/v1/pods
Then I looked into the kube-proxy.yaml
and suspected that the --master
configuration (for the worker nodes) was not correct, somehow?
core@core-02 /etc/kubernetes/manifests $ cat kube-proxy.yaml
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-proxy
image: quay.io/coreos/hyperkube:v1.10.1_coreos.0
command:
- /hyperkube
- proxy
>>>>> Should it be like this?
- --master=https://172.17.8.101
>>>>> or like this?
- --master=http://127.0.0.1:8080
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/ssl/certs
name: ssl-certs-host
readOnly: true
volumes:
- hostPath:
path: /usr/share/ca-certificates
name: ssl-certs-host
the 127.0.0.1:8080
configuration would appear to work only (at best) for the controller node, but would surely lead nowhere on the other nodes?
Modifying the --master
as indicated above and restarting the pods however, does not do any good either.
Bottom line is, how do I make the Controller API reachable on 10.3.0.1
? How can I enable KubeDNS (I tried the instructions on the "Hard Way" guide, but got exactly the same failure mode as above).
Many thanks in advance!
Update
This is the file with the flanneld
options:
$ cat /etc/flannel/options.env
FLANNELD_IFACE=172.17.8.101
FLANNELD_ETCD_ENDPOINTS=http://172.17.8.102:2379,http://172.17.8.103:2379,http://172.17.8.101:2379
I have now removed the flannel daemon set:
kc delete ds kube-flannel-ds -n kube-system
and deployed Kube DNS, following these instructions: the service is defined here, and the deployment here:
$ kc -n kube-system get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.3.0.10 <none> 53/UDP,53/TCP 4d
$ kc get po -n kube-system
NAME READY STATUS RESTARTS AGE
kube-apiserver-172.17.8.101 1/1 Running 5 36d
kube-controller-manager-172.17.8.101 1/1 Running 5 36d
kube-dns-7868b65c7b-ntc95 3/4 Running 2 3m
kube-proxy-172.17.8.101 1/1 Running 5 36d
kube-proxy-172.17.8.102 1/1 Running 3 4d
kube-proxy-172.17.8.103 1/1 Running 2 4d
kube-scheduler-172.17.8.101 1/1 Running 5 36d
However, I'm still getting the timeout error (actually, a bunch of them):
E0613 19:02:27.193691 1 sync.go:105] Error getting ConfigMap kube-system:kube-dns err: Get https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp 10.3.0.1:443: i/o timeout
Update #2
On a system setup similarly, I have the following flanneld
configuration:
core@core-02 ~ $ etcdctl get /flannel/network/config
{ "Network": "10.1.0.0/16" }
core@core-02 ~ $ etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.1.5.0-24
/flannel/network/subnets/10.1.66.0-24
/flannel/network/subnets/10.1.6.0-24
core@core-02 ~ $ etcdctl get /flannel/network/subnets/10.1.66.0-24
{"PublicIP":"172.17.8.102"}
(and similarly for the others, pointing to 101
and 103
) - should there be something in config
for the 10.3.0.0/16
subnet? Also, should there be an entry (pointing to to 172.17.8.101
) for the Controller API at 10.3.0.1
? Something along the lines of:
/flannel/network/subnets/10.3.0.0-24
{"PublicIP":"172.17.8.101"}
Does anyone know where to find good flanneld
documentation (CoreOS docs are truly insufficient and feel somewhat "abandoned")? Or something else to use that actually works?
Thanks!