kube-dns pods failing readiness healthcheck after passing initial kops cluster validation

9/24/2019

Following a kops cluster build on AWS our kube-system pods initially come up but after 5-10 mins the kubedns containers in the kube-dns pods fail the readiness healthcheck:

kube-dns-55c9b74794-cmn5n                                           2/3     Running   0          10m
kube-dns-55c9b74794-qb2jb                                           2/3     Running   0          10m

We have an existing cluster running with the same config in the same AWS account & VPC which is not affected - the issue only impacts new clusters.

  1. k8s version 1.11.7
  2. kops version 1.12.1
  3. kube-dns image: k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10

Pod events as follows:

Events:
  Type     Reason     Age                    From                                                 Message
  ----     ------     ----                   ----                                                 -------
  Normal   Scheduled  22m                    default-scheduler                                    Successfully assigned kube-system/kube-dns-67964b9cfb-rdsks to ip-10-16-19-163.eu-west-2.compute.internal
  Normal   Pulling    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  pulling image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
  Normal   Pulled     22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Successfully pulled image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
  Normal   Created    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Created container
  Normal   Started    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Started container
  Normal   Pulling    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  pulling image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
  Normal   Pulled     22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Successfully pulled image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
  Normal   Created    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Created container
  Normal   Started    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Started container
  Normal   Pulling    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  pulling image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
  Normal   Started    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Started container
  Normal   Pulled     22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Successfully pulled image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
  Normal   Created    22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Created container
  Warning  Unhealthy  22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60150->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60176->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  22m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60198->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  21m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60216->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  21m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60234->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  21m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60254->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  21m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60276->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  21m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60300->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  21m                    kubelet, ip-10-16-19-163.eu-west-2.compute.internal  Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60326->100.115.188.194:8081: read: connection reset by peer
  Warning  Unhealthy  2m35s (x111 over 20m)  kubelet, ip-10-16-19-163.eu-west-2.compute.internal  (combined from similar events): Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:63086->100.115.188.194:8081: read: connection reset by peer

kube-dns deployment yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    k8s-addon: kube-dns.addons.k8s.io
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
  name: kube-dns
  namespace: kube-system
spec:
  progressDeadlineSeconds: 2147483647
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-dns
  strategy:
    rollingUpdate:
      maxSurge: 10%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/port: "10055"
        prometheus.io/scrape: "true"
        scheduler.alpha.kubernetes.io/critical-pod: ""
        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
          "operator":"Exists"}]'
      creationTimestamp: null
      labels:
        k8s-app: kube-dns
    spec:
      containers:
      - args:
        - --config-dir=/kube-dns-config
        - --dns-port=10053
        - --domain=cluster.local.
        - --v=2
        env:
        - name: PROMETHEUS_PORT
          value: "10055"
        image: k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthcheck/kubedns
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: kubedns
        ports:
        - containerPort: 10053
          name: dns-local
          protocol: UDP
        - containerPort: 10053
          name: dns-tcp-local
          protocol: TCP
        - containerPort: 10055
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readiness
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /kube-dns-config
          name: kube-dns-config
      - args:
        - -v=2
        - -logtostderr
        - -configDir=/etc/k8s/dns/dnsmasq-nanny
        - -restartDnsmasq=true
        - --
        - -k
        - --cache-size=1000
        - --dns-forward-max=150
        - --no-negcache
        - --log-facility=-
        - --server=/cluster.local/127.0.0.1#10053
        - --server=/in-addr.arpa/127.0.0.1#10053
        - --server=/in6.arpa/127.0.0.1#10053
        image: k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthcheck/dnsmasq
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: dnsmasq
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        resources:
          requests:
            cpu: 150m
            memory: 20Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/k8s/dns/dnsmasq-nanny
          name: kube-dns-config
      - args:
        - --v=2
        - --logtostderr
        - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
        - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
        image: k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: sidecar
        ports:
        - containerPort: 10054
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 10m
            memory: 20Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: Default
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-dns
      serviceAccountName: kube-dns
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: kube-dns
          optional: true
        name: kube-dns-config
status:
  conditions:
  - lastTransitionTime: "2019-09-24T13:22:06Z"
    lastUpdateTime: "2019-09-24T13:22:06Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  observedGeneration: 7
  replicas: 3
  unavailableReplicas: 3
  updatedReplicas: 2

This issue is impacting any new cluster builds in all of our AWS accounts - in each account however there are running clusters with the same config that are working without any problems.

The pods are able to connect to the readiness endpoint (curl not installed on kubedns container so using wget):

kubectl -n kube-system exec -it kube-dns-55c9b74794-cmn5n -c kubedns -- wget http://100.125.236.68:8081/readiness
Connecting to 100.125.236.68:8081 (100.125.236.68:8081)
readiness            100% |*******************************|     
-- 6869dan
amazon-web-services
kops
kubernetes

0 Answers