Flannel pod failure (and DNS) for Kubernetes on CoreOS VMs

6/8/2018

I have deployed a 3-node CoreOS Vagrant VMs following this guide, modified as described here.

The VMs are healthy and running; K8s controller/worker nodes are fine; and I can deploy Pods; ReplicaSets; etc.

However, DNS does not seem to work and, when I look at the state of the flannel pods, they are positively unhealthy:

$ kubectl get po --all-namespaces                          
NAMESPACE     NAME                                   READY     STATUS             RESTARTS   AGE
apps          frontend-cluster-q4gvm                 1/1       Running            1          1h
apps          frontend-cluster-tl5ts                 1/1       Running            0          1h
apps          frontend-cluster-xgktz                 1/1       Running            1          1h
kube-system   kube-apiserver-172.17.8.101            1/1       Running            2          32d
kube-system   kube-controller-manager-172.17.8.101   1/1       Running            2          32d
kube-system   kube-flannel-ds-6csjl                  0/1       CrashLoopBackOff   46         31d
kube-system   kube-flannel-ds-f8czg                  0/1       CrashLoopBackOff   48         31d
kube-system   kube-flannel-ds-qbtlc                  0/1       CrashLoopBackOff   52         31d
kube-system   kube-proxy-172.17.8.101                1/1       Running            2          32d
kube-system   kube-proxy-172.17.8.102                1/1       Running            0          6m
kube-system   kube-proxy-172.17.8.103                1/1       Running            0          2m
kube-system   kube-scheduler-172.17.8.101            1/1       Running            2          32d

further, when I try to deploy kubedns those fail too, with the same failure mode:

$ kubectl logs kube-flannel-ds-f8czg -n kube-system        
I0608 23:03:32.526331       1 main.go:475] Determining IP address of default interface
I0608 23:03:32.528108       1 main.go:488] Using interface with name eth0 and address 10.0.2.15
I0608 23:03:32.528135       1 main.go:505] Defaulting external address to interface address (10.0.2.15)
E0608 23:04:02.627348       1 main.go:232] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-f8czg': Get https://10.3.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-f8czg: dial tcp 10.3.0.1:443: i/o timeout

So, it appears that the controller service, running off the 10.3.0.1 IP is not reachable from other pods:

$ kubectl get svc --all-namespaces 
NAMESPACE   NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)          AGE
apps        frontend     ClusterIP   10.3.0.170   <none>        80/TCP,443/TCP   1h
default     kubernetes   ClusterIP   10.3.0.1     <none>        443/TCP          32d

My guess were around either Flannel's etcd configurations; or the kube-proxy YAML; so, I added the following to all the nodes:

core@core-01 ~ $ etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.1.80.0-24
/flannel/network/subnets/10.1.76.0-24
/flannel/network/subnets/10.3.0.0-16
/flannel/network/subnets/10.1.34.0-24

core@core-01 ~ $ etcdctl get /flannel/network/subnets/10.3.0.0-16
{"PublicIP": "172.17.8.101"}

and restarted flanneld:

core@core-01 ~ $ sudo systemctl restart flanneld

However, that does not appear to do any good; from within a running Pod:

# This is expected (no client certs):
root@frontend-cluster-q4gvm:/opt/simple# curl -k https://172.17.8.101/api/v1/pods
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}
# But this one just times out:
root@frontend-cluster-q4gvm:/opt/simple# curl -k https://10.3.0.1/api/v1/pods

Then I looked into the kube-proxy.yaml and suspected that the --master configuration (for the worker nodes) was not correct, somehow?

core@core-02 /etc/kubernetes/manifests $ cat kube-proxy.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: kube-proxy
  namespace: kube-system
spec:
  hostNetwork: true
  containers:
  - name: kube-proxy
    image: quay.io/coreos/hyperkube:v1.10.1_coreos.0
    command:
    - /hyperkube
    - proxy
>>>>> Should it be like this?
    - --master=https://172.17.8.101
>>>>> or like this?
    - --master=http://127.0.0.1:8080

    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: ssl-certs-host
      readOnly: true
  volumes:
  - hostPath:
      path: /usr/share/ca-certificates
    name: ssl-certs-host

the 127.0.0.1:8080 configuration would appear to work only (at best) for the controller node, but would surely lead nowhere on the other nodes?

Modifying the --master as indicated above and restarting the pods however, does not do any good either.

Bottom line is, how do I make the Controller API reachable on 10.3.0.1? How can I enable KubeDNS (I tried the instructions on the "Hard Way" guide, but got exactly the same failure mode as above).

Many thanks in advance!

Update

This is the file with the flanneld options:

$ cat /etc/flannel/options.env 
FLANNELD_IFACE=172.17.8.101
FLANNELD_ETCD_ENDPOINTS=http://172.17.8.102:2379,http://172.17.8.103:2379,http://172.17.8.101:2379

I have now removed the flannel daemon set:

kc delete ds kube-flannel-ds -n kube-system

and deployed Kube DNS, following these instructions: the service is defined here, and the deployment here:

$ kc -n kube-system get svc
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   10.3.0.10    <none>        53/UDP,53/TCP   4d


$ kc get po -n kube-system  
NAME                                   READY     STATUS    RESTARTS   AGE
kube-apiserver-172.17.8.101            1/1       Running   5          36d
kube-controller-manager-172.17.8.101   1/1       Running   5          36d
kube-dns-7868b65c7b-ntc95              3/4       Running   2          3m
kube-proxy-172.17.8.101                1/1       Running   5          36d
kube-proxy-172.17.8.102                1/1       Running   3          4d
kube-proxy-172.17.8.103                1/1       Running   2          4d
kube-scheduler-172.17.8.101            1/1       Running   5          36d

However, I'm still getting the timeout error (actually, a bunch of them):

E0613 19:02:27.193691       1 sync.go:105] Error getting ConfigMap kube-system:kube-dns err: Get https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps/kube-dns: dial tcp 10.3.0.1:443: i/o timeout

Update #2

On a system setup similarly, I have the following flanneld configuration:

core@core-02 ~ $ etcdctl get /flannel/network/config
{ "Network": "10.1.0.0/16" }
core@core-02 ~ $ etcdctl ls /flannel/network/subnets
/flannel/network/subnets/10.1.5.0-24
/flannel/network/subnets/10.1.66.0-24
/flannel/network/subnets/10.1.6.0-24
core@core-02 ~ $ etcdctl get /flannel/network/subnets/10.1.66.0-24
{"PublicIP":"172.17.8.102"}

(and similarly for the others, pointing to 101 and 103) - should there be something in config for the 10.3.0.0/16 subnet? Also, should there be an entry (pointing to to 172.17.8.101) for the Controller API at 10.3.0.1? Something along the lines of:

/flannel/network/subnets/10.3.0.0-24
{"PublicIP":"172.17.8.101"}

Does anyone know where to find good flanneld documentation (CoreOS docs are truly insufficient and feel somewhat "abandoned")? Or something else to use that actually works?

Thanks!

-- Marco Massenzio
kube-dns
kube-proxy
kubernetes

0 Answers