kube-dns keeps restarting with kubenetes on coreos

3/6/2017

I have Kubernetes installed on Container Linux by CoreOS alpha (1353.1.0) using hyperkube v1.5.5_coreos.0 using my fork of coreos-kubernetes install scripts at https://github.com/kfirufk/coreos-kubernetes.

I have two ContainerOS machines.

  • coreos-2.tux-in.com resolved as 192.168.1.2 as controller
  • coreos-3.tux-in.com resolved as 192.168.1.3 as worker

kubectl get pods --all-namespaces returns

NAMESPACE       NAME                                       READY     STATUS    RESTARTS   AGE
ceph            ceph-mds-2743106415-rkww4                  0/1       Pending   0          1d
ceph            ceph-mon-check-3856521781-bd6k5            1/1       Running   0          1d
kube-lego       kube-lego-3323932148-g2tf4                 1/1       Running   0          1d
kube-system     calico-node-xq6j7                          2/2       Running   0          1d
kube-system     calico-node-xzpp2                          2/2       Running   4560       1d
kube-system     calico-policy-controller-610849172-b7xjr   1/1       Running   0          1d
kube-system     heapster-v1.3.0-beta.0-2754576759-v1f50    2/2       Running   0          1d
kube-system     kube-apiserver-192.168.1.2                 1/1       Running   0          1d
kube-system     kube-controller-manager-192.168.1.2        1/1       Running   1          1d
kube-system     kube-dns-3675956729-r7hhf                  3/4       Running   3924       1d
kube-system     kube-dns-autoscaler-505723555-l2pph        1/1       Running   0          1d
kube-system     kube-proxy-192.168.1.2                     1/1       Running   0          1d
kube-system     kube-proxy-192.168.1.3                     1/1       Running   0          1d
kube-system     kube-scheduler-192.168.1.2                 1/1       Running   1          1d
kube-system     kubernetes-dashboard-3697905830-vdz23      1/1       Running   1246       1d
kube-system     monitoring-grafana-4013973156-m2r2v        1/1       Running   0          1d
kube-system     monitoring-influxdb-651061958-2mdtf        1/1       Running   0          1d
nginx-ingress   default-http-backend-150165654-s4z04       1/1       Running   2          1d

so I can see that kube-dns-782804071-h78rf keeps restarting.

kubectl describe pod kube-dns-3675956729-r7hhf --namespace=kube-system returns:

Name:       kube-dns-3675956729-r7hhf
Namespace:  kube-system
Node:       192.168.1.2/192.168.1.2
Start Time: Sat, 11 Mar 2017 17:54:14 +0000
Labels:     k8s-app=kube-dns
        pod-template-hash=3675956729
Status:     Running
IP:     10.2.67.243
Controllers:    ReplicaSet/kube-dns-3675956729
Containers:
  kubedns:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:kubedns
    Image:      gcr.io/google_containers/kubedns-amd64:1.9
    Image ID:       rkt://sha512-c7b7c9c4393bea5f9dc5bcbe1acf1036c2aca36ac14b5e17fd3c675a396c4219
    Ports:      10053/UDP, 10053/TCP, 10055/TCP
    Args:
      --domain=cluster.local.
      --dns-port=10053
      --config-map=kube-dns
      --v=2
    Limits:
      memory:   170Mi
    Requests:
      cpu:      100m
      memory:       70Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      False
    Restart Count:  981
    Liveness:       http-get http://:8080/healthz-kubedns delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:      http-get http://:8081/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:
      PROMETHEUS_PORT:  10055
  dnsmasq:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:dnsmasq
    Image:      gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1
    Image ID:       rkt://sha512-8c5f8b40f6813bb676ce04cd545c55add0dc8af5a3be642320244b74ea03f872
    Ports:      53/UDP, 53/TCP
    Args:
      --cache-size=1000
      --no-resolv
      --server=127.0.0.1#10053
      --log-facility=-
    Requests:
      cpu:      150m
      memory:       10Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      True
    Restart Count:  981
    Liveness:       http-get http://:8080/healthz-dnsmasq delay=60s timeout=5s period=10s #success=1 #failure=5
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:  <none>
  dnsmasq-metrics:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:dnsmasq-metrics
    Image:      gcr.io/google_containers/dnsmasq-metrics-amd64:1.0.1
    Image ID:       rkt://sha512-ceb3b6af1cd67389358be14af36b5e8fb6925e78ca137b28b93e0d8af134585b
    Port:       10054/TCP
    Args:
      --v=2
      --logtostderr
    Requests:
      memory:       10Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      True
    Restart Count:  981
    Liveness:       http-get http://:10054/metrics delay=60s timeout=5s period=10s #success=1 #failure=5
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:  <none>
  healthz:
    Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:healthz
    Image:      gcr.io/google_containers/exechealthz-amd64:v1.2.0
    Image ID:       rkt://sha512-3a85b0533dfba81b5083a93c7e091377123dac0942f46883a4c10c25cf0ad177
    Port:       8080/TCP
    Args:
      --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
      --url=/healthz-dnsmasq
      --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null
      --url=/healthz-kubedns
      --port=8080
      --quiet
    Limits:
      memory:   50Mi
    Requests:
      cpu:      10m
      memory:       50Mi
    State:      Running
      Started:      Sun, 12 Mar 2017 17:47:41 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Mar 2017 17:46:28 +0000
      Finished:     Sun, 12 Mar 2017 17:47:02 +0000
    Ready:      True
    Restart Count:  981
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
    Environment Variables:  <none>
Conditions:
  Type      Status
  Initialized   True 
  Ready     False 
  PodScheduled  True 
Volumes:
  default-token-zqbdp:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-zqbdp
QoS Class:  Burstable
Tolerations:    CriticalAddonsOnly=:Exists
No events.

which shows that kubedns-amd64:1.9 is in Ready: false

this is my kude-dns-de.yaml file:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
spec:
  strategy:
    rollingUpdate:
      maxSurge: 10%
      maxUnavailable: 0
  selector:
    matchLabels:
      k8s-app: kube-dns
  template:
    metadata:
      labels:
        k8s-app: kube-dns
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'
    spec:
      containers:
      - name: kubedns
        image: gcr.io/google_containers/kubedns-amd64:1.9
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        livenessProbe:
          httpGet:
            path: /healthz-kubedns
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        readinessProbe:
          httpGet:
            path: /readiness
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 3
          timeoutSeconds: 5
        args:
        - --domain=cluster.local.
        - --dns-port=10053
        - --config-map=kube-dns
        # This should be set to v=2 only after the new image (cut from 1.5) has
        # been released, otherwise we will flood the logs.
        - --v=2
        env:
        - name: PROMETHEUS_PORT
          value: "10055"
        ports:
        - containerPort: 10053
          name: dns-local
          protocol: UDP
        - containerPort: 10053
          name: dns-tcp-local
          protocol: TCP
        - containerPort: 10055
          name: metrics
          protocol: TCP
      - name: dnsmasq
        image: gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1
        livenessProbe:
          httpGet:
            path: /healthz-dnsmasq
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - --cache-size=1000
        - --no-resolv
        - --server=127.0.0.1#10053
        - --log-facility=-
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        # see: https://github.com/kubernetes/kubernetes/issues/29055 for details
        resources:
          requests:
            cpu: 150m
            memory: 10Mi
      - name: dnsmasq-metrics
        image: gcr.io/google_containers/dnsmasq-metrics-amd64:1.0.1
        livenessProbe:
          httpGet:
            path: /metrics
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 5
        args:
        - --v=2
        - --logtostderr
        ports:
        - containerPort: 10054
          name: metrics
          protocol: TCP
        resources:
          requests:
            memory: 10Mi
      - name: healthz
        image: gcr.io/google_containers/exechealthz-amd64:v1.2.0
        resources:
          limits:
            memory: 50Mi
          requests:
            cpu: 10m
            memory: 50Mi
        args:
        - --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
        - --url=/healthz-dnsmasq
        - --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null
        - --url=/healthz-kubedns
        - --port=8080
        - --quiet
        ports:
        - containerPort: 8080
          protocol: TCP
      dnsPolicy: Default

and this is my kube-dns-svc.yaml:

apiVersion: v1
kind: Service
metadata:
  name: kube-dns
  namespace: kube-system
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: "KubeDNS"
spec:
  selector:
    k8s-app: kube-dns
  clusterIP: 10.3.0.10
  ports:
  - name: dns
    port: 53
    protocol: UDP
  - name: dns-tcp
    port: 53
    protocol: TCP

any information regarding the issue would be greatly appreciated!

update

rkt list --full 2> /dev/null | grep kubedns shows:

744a4579-0849-4fae-b1f5-cb05d40f3734    kubedns             gcr.io/google_containers/kubedns-amd64:1.9      sha512-c7b7c9c4393b running 2017-03-22 22:14:55.801 +0000 UTC   2017-03-22 22:14:56.814 +0000 UTC

journalctl -m _MACHINE_ID=744a45790849b1f5cb05d40f3734 provides:

Mar 22 22:17:58 kube-dns-3675956729-sthcv kubedns[8]: E0322 22:17:58.619254       8 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.3.0.1:443: connect: network is unreachable

I tried to add - --proxy-mode=userspace to /etc/kubernetes/manifests/kube-proxy.yaml but the results are the same.

kubectl get svc --all-namespaces provides:

NAMESPACE       NAME                   CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
ceph            ceph-mon               None         <none>        6789/TCP        1h
default         kubernetes             10.3.0.1     <none>        443/TCP         1h
kube-system     heapster               10.3.0.2     <none>        80/TCP          1h
kube-system     kube-dns               10.3.0.10    <none>        53/UDP,53/TCP   1h
kube-system     kubernetes-dashboard   10.3.0.116   <none>        80/TCP          1h
kube-system     monitoring-grafana     10.3.0.187   <none>        80/TCP          1h
kube-system     monitoring-influxdb    10.3.0.214   <none>        8086/TCP        1h
nginx-ingress   default-http-backend   10.3.0.233   <none>        80/TCP          1h

kubectl get cs provides:

NAME                 STATUS    MESSAGE              ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   {"health": "true"}

my kube-proxy.yaml has the following content:

apiVersion: v1
kind: Pod
metadata:
  name: kube-proxy
  namespace: kube-system
  annotations:
    rkt.alpha.kubernetes.io/stage1-name-override: coreos.com/rkt/stage1-fly
spec:
  hostNetwork: true
  containers:
  - name: kube-proxy
    image: quay.io/coreos/hyperkube:v1.5.5_coreos.0
    command:
    - /hyperkube
    - proxy
    - --cluster-cidr=10.2.0.0/16
    - --kubeconfig=/etc/kubernetes/controller-kubeconfig.yaml
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /etc/ssl/certs
      name: "ssl-certs"
    - mountPath: /etc/kubernetes/controller-kubeconfig.yaml
      name: "kubeconfig"
      readOnly: true
    - mountPath: /etc/kubernetes/ssl
      name: "etc-kube-ssl"
      readOnly: true
    - mountPath: /var/run/dbus
      name: dbus
      readOnly: false
  volumes:
  - hostPath:
      path: "/usr/share/ca-certificates"
    name: "ssl-certs"
  - hostPath:
      path: "/etc/kubernetes/controller-kubeconfig.yaml"
    name: "kubeconfig"
  - hostPath:
      path: "/etc/kubernetes/ssl"
    name: "etc-kube-ssl"
  - hostPath:
      path: /var/run/dbus
    name: dbus

this is all the valuable information I could find. any ideas? :)

update 2

output of iptables-save on the controller ContainerOS at http://pastebin.com/2GApCj0n

update 3

I ran curl on the controller node

# curl https://10.3.0.1 --insecure
Unauthorized

means it can access it properly, i didn't add enough parameters for it to be authorized right ?

update 4

thanks to @jaxxstorm I removed calico manifests, updated their quay/cni and quay/node versions and reinstalled them.

now kubedns keeps restarting, but I think that now calico works. because for the first time it tries to install kubedns on the worker node and not on the controller node, and also when I rkt enter the kubedns pod and try to wget https://10.3.0.1 I get:

# wget https://10.3.0.1
Connecting to 10.3.0.1 (10.3.0.1:443)
wget: can't execute 'ssl_helper': No such file or directory
wget: error getting response: Connection reset by peer

which clearly shows that there is some kind of response. which is good right ?

now kubectl get pods --all-namespaces shows:

kube-system     kube-dns-3675956729-ljz2w                  4/4       Running             88         42m

so.. 4/4 ready but it keeps restarting.

kubectl describe pod kube-dns-3675956729-ljz2w --namespace=kube-system output at http://pastebin.com/Z70U331G

so it can't connect to http://10.2.47.19:8081/readiness, i'm guessing this is the ip of kubedns since it uses port 8081. don't know how to continue investigating this issue further.

thanks for everything!

-- ufk
coreos
kube-dns
kubectl
kubelet
kubernetes

2 Answers

3/13/2017

kube-dns has a readiness probe that tries resolving trough the Service IP of kube-dns. Is it possible that there is a problem with your Service network?

Check out the answer and solution here: kubernetes service IPs not reachable

-- Janos Lenart
Source: StackOverflow

3/24/2017

Lots of great debugging info here, thanks!

This is the clincher:

# curl https://10.3.0.1 --insecure
Unauthorized

You got an unauthorized response because you didn't pass a client cert, but that's fine, it's not what we're after. This proves that kube-proxy is working as expected and is accessible. Your rkt logs:

Mar 22 22:17:58 kube-dns-3675956729-sthcv kubedns[8]: E0322 22:17:58.619254       8 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.3.0.1:443: connect: network is unreachable

Are indicating that the containers that there's network connectivity issues inside the containers, which indicates to me that you haven't configured container networking/CNI.

Please have a read through this document: https://coreos.com/rkt/docs/latest/networking/overview.html

You may also have to reconfigure calico, there's some more information that here: http://docs.projectcalico.org/master/getting-started/rkt/

-- jaxxstorm
Source: StackOverflow