x509: certificate issues on flannel and coredns pods

11/26/2018

Appreciate any help in getting to the root cause of this failure. Brought up a kubernetes 1.11.1 cluster with 1 master and 2 nodes (master and one node on same machine) on CentOS 7.5 VMs.

$ uname -a
Linux master.home 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/centos-release
CentOS Linux release 7.5.1804 (Core)

$ docker version
Client:
 Version:      17.09.1-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:23:40 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.09.1-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:25:03 2017
 OS/Arch:      linux/amd64
 Experimental: false

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:08:34Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}

Node interface ip addresses look good to me:

$ ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:5b:19:5f brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.111/24 brd 192.168.1.255 scope global dynamic enp0s3
       valid_lft 82102sec preferred_lft 82102sec
    inet6 fe80::a00:27ff:fe5b:195f/64 scope link tentative dadfailed 
       valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:38:a6:c7:bd brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:38ff:fea6:c7bd/64 scope link 
       valid_lft forever preferred_lft forever
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 26:1e:c3:e9:a3:db brd ff:ff:ff:ff:ff:ff
    inet 10.150.69.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::241e:c3ff:fee9:a3db/64 scope link 
       valid_lft forever preferred_lft forever
12: vethc0ae215@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether 9a:1c:9d:21:18:57 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::981c:9dff:fe21:1857/64 scope link 
       valid_lft forever preferred_lft forever

etcd cluster is healthy:

$ etcdctl --endpoints=https://192.168.1.111:2379 --cert-file=/var/lib/kubernetes/kubernetes.pem --key-file=/var/lib/kubernetes/kubernetes-key.pem cluster-health
member ca38fd8eb3e17372 is healthy: got healthy result from https://192.168.1.111:2379
cluster is healthy
$ etcdctl --endpoints=https://192.168.1.111:2379 --cert-file=/var/lib/kubernetes/kubernetes.pem --key-file=/var/lib/kubernetes/kubernetes-key.pem get /atomic.io/network/config Network
{ "Network": "10.150.0.0/16", "SubnetLen": 24, "Backend": {"Type": "vxlan"}}

Also updated iptables:

$ iptables --version
iptables v1.6.2

Overview from kubectl:

$ kubectl get all --all-namespaces
NAMESPACE     NAME                              READY     STATUS             RESTARTS   AGE
kube-system   pod/coredns-55f86bf584-9vz6k      1/1       Running            11         39m
kube-system   pod/coredns-55f86bf584-z4nvv      1/1       Running            11         39m
kube-system   pod/kube-flannel-ds-amd64-kw972   0/1       CrashLoopBackOff   6          10m
kube-system   pod/kube-flannel-ds-amd64-rhv2c   0/1       CrashLoopBackOff   6          10m

NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
default       service/kubernetes   ClusterIP   10.32.0.1    <none>        443/TCP         2h
kube-system   service/kube-dns     ClusterIP   10.32.0.10   <none>        53/UDP,53/TCP   39m

NAMESPACE     NAME                                     DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
kube-system   daemonset.apps/kube-flannel-ds-amd64     2         2         0         2            0           beta.kubernetes.io/arch=amd64     10m
kube-system   daemonset.apps/kube-flannel-ds-arm       0         0         0         0            0           beta.kubernetes.io/arch=arm       10m
kube-system   daemonset.apps/kube-flannel-ds-arm64     0         0         0         0            0           beta.kubernetes.io/arch=arm64     10m
kube-system   daemonset.apps/kube-flannel-ds-ppc64le   0         0         0         0            0           beta.kubernetes.io/arch=ppc64le   10m
kube-system   daemonset.apps/kube-flannel-ds-s390x     0         0         0         0            0           beta.kubernetes.io/arch=s390x     10m

NAMESPACE     NAME                      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   2         2         2            2           39m

NAMESPACE     NAME                                 DESIRED   CURRENT   READY     AGE
kube-system   replicaset.apps/coredns-55f86bf584   2         2         2         39m

Used this manifest for coredns where i changed "Network": "10.150.0.0/16" to "Network": "10.150.0.0/16"

$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

And this for coreDNS

kubectl apply -f https://storage.googleapis.com/kubernetes-the-hard-way/coredns.yaml

Not sure why i see complains about x.509 in logs from respective pods:

$ kubectl logs kube-flannel-ds-amd64-kw972 -n kube-system
I1126 14:51:38.415251       1 main.go:475] Determining IP address of default interface
I1126 14:51:38.417393       1 main.go:488] Using interface with name enp0s3 and address 192.168.1.111
I1126 14:51:38.417535       1 main.go:505] Defaulting external address to interface address (192.168.1.111)
E1126 14:51:38.427865       1 main.go:232] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-kw972': Get https://10.32.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-kw972: x509: certificate is valid for 192.168.1.111, 127.0.0.1, not 10.32.0.1

$ kubectl logs coredns-55f86bf584-z4nvv -n kube-system
E1126 14:50:51.845470       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.32.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: x509: certificate is valid for 192.168.1.111, 127.0.0.1, not 10.32.0.1
E1126 14:50:51.850446       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.32.0.1:443/api/v1/services?limit=500&resourceVersion=0: x509: certificate is valid for 192.168.1.111, 127.0.0.1, not 10.32.0.1

Here 192.168.1.111 is my master node and 10.32.0.1 is kubernetes service ip.

I did not use kubeadm to bring up this cluster. Did most of the bootstrapping by following https://github.com/kelseyhightower/kubernetes-the-hard-way

Also not sure if SNAT is set right:

$ sudo conntrack -L -d 10.32.0.1
tcp      6 17 TIME_WAIT src=192.168.1.111 dst=10.32.0.1 sport=37862 dport=443 src=192.168.1.111 dst=192.168.1.111 sport=6443 dport=37862 [ASSURED] mark=0 use=1
conntrack v1.4.4 (conntrack-tools): 1 flow entries have been shown.
$ sudo iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  udp  -- !10.150.0.0/16        10.32.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             10.32.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-MARK-MASQ  tcp  -- !10.150.0.0/16        10.32.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             10.32.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-MARK-MASQ  tcp  -- !10.150.0.0/16        10.32.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere             10.32.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-NODEPORTS  all  --  anywhere             anywhere             /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

Flannel config:

$ cat /etc/sysconfig/flanneld 
# Flanneld configuration options  

# etcd url location.  Point this to the server where etcd runs
FLANNEL_ETCD_ENDPOINTS="https://192.168.1.111:2379"

# etcd config key.  This is the configuration key that flannel queries
# For address range assignment
FLANNEL_ETCD_PREFIX="/atomic.io/network"

# Any additional options that you want to pass
FLANNEL_OPTIONS="-v=9 --etcd-certfile=/var/lib/kubernetes/kubernetes.pem --etcd-keyfile=/var/lib/kubernetes/kubernetes-key.pem --remote-cafile=/var/lib/kubernetes/ca.pem"

Edit1: Updated title to better reflect the underlying concern. My goal is to ensure DNS is working as expected in my k8s ecosystem. Tested nslookup with busybox image 1.28.

$ kubectl exec -ti busybox -- nslookup kubernetes
Server:    10.32.0.10
Address 1: 10.32.0.10

nslookup: can't resolve 'kubernetes'
command terminated with exit code 1

Update: The x509 error is gone and coredns is up and running after upgrading docker to 18.06.1-ce and editing kubelet.service file to use: --container-runtime-endpoint=unix:///var/run/docker/containerd/docker-containerd.sock One step closer but not there yet.

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                          READY     STATUS        RESTARTS   AGE
default       busybox                       1/1       Terminating   1          1h
kube-system   coredns-55f86bf584-n84nw      1/1       Running       0          10m
kube-system   coredns-55f86bf584-zl88b      1/1       Running       0          10m
$ kubectl logs coredns-55f86bf584-n84nw -n kube-system
.:53
2018/11/26 18:49:48 [INFO] CoreDNS-1.2.2
2018/11/26 18:49:48 [INFO] linux/amd64, go1.11, eb51e8b
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2018/11/26 18:49:48 [INFO] plugin/reload: Running configuration MD5 = 2e2180a5eeb3ebf92a5100ab081a6381
-- papu
coredns
flannel
kubernetes

0 Answers