Following a kops cluster build on AWS our kube-system pods initially come up but after 5-10 mins the kubedns containers in the kube-dns pods fail the readiness healthcheck:
kube-dns-55c9b74794-cmn5n 2/3 Running 0 10m
kube-dns-55c9b74794-qb2jb 2/3 Running 0 10m
We have an existing cluster running with the same config in the same AWS account & VPC which is not affected - the issue only impacts new clusters.
Pod events as follows:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22m default-scheduler Successfully assigned kube-system/kube-dns-67964b9cfb-rdsks to ip-10-16-19-163.eu-west-2.compute.internal
Normal Pulling 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal pulling image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
Normal Pulled 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Successfully pulled image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10"
Normal Created 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Created container
Normal Started 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Started container
Normal Pulling 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal pulling image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
Normal Pulled 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Successfully pulled image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10"
Normal Created 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Created container
Normal Started 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Started container
Normal Pulling 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal pulling image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
Normal Started 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Started container
Normal Pulled 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Successfully pulled image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10"
Normal Created 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Created container
Warning Unhealthy 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60150->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60176->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 22m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60198->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 21m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60216->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 21m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60234->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 21m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60254->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 21m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60276->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 21m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60300->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 21m kubelet, ip-10-16-19-163.eu-west-2.compute.internal Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:60326->100.115.188.194:8081: read: connection reset by peer
Warning Unhealthy 2m35s (x111 over 20m) kubelet, ip-10-16-19-163.eu-west-2.compute.internal (combined from similar events): Readiness probe failed: Get http://100.115.188.194:8081/readiness: read tcp 10.16.19.163:63086->100.115.188.194:8081: read: connection reset by peer
kube-dns deployment yaml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-addon: kube-dns.addons.k8s.io
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
name: kube-dns
namespace: kube-system
spec:
progressDeadlineSeconds: 2147483647
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
k8s-app: kube-dns
strategy:
rollingUpdate:
maxSurge: 10%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/port: "10055"
prometheus.io/scrape: "true"
scheduler.alpha.kubernetes.io/critical-pod: ""
scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
"operator":"Exists"}]'
creationTimestamp: null
labels:
k8s-app: kube-dns
spec:
containers:
- args:
- --config-dir=/kube-dns-config
- --dns-port=10053
- --domain=cluster.local.
- --v=2
env:
- name: PROMETHEUS_PORT
value: "10055"
image: k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.10
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthcheck/kubedns
port: 10054
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: kubedns
ports:
- containerPort: 10053
name: dns-local
protocol: UDP
- containerPort: 10053
name: dns-tcp-local
protocol: TCP
- containerPort: 10055
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readiness
port: 8081
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /kube-dns-config
name: kube-dns-config
- args:
- -v=2
- -logtostderr
- -configDir=/etc/k8s/dns/dnsmasq-nanny
- -restartDnsmasq=true
- --
- -k
- --cache-size=1000
- --dns-forward-max=150
- --no-negcache
- --log-facility=-
- --server=/cluster.local/127.0.0.1#10053
- --server=/in-addr.arpa/127.0.0.1#10053
- --server=/in6.arpa/127.0.0.1#10053
image: k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.10
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthcheck/dnsmasq
port: 10054
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: dnsmasq
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
resources:
requests:
cpu: 150m
memory: 20Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/k8s/dns/dnsmasq-nanny
name: kube-dns-config
- args:
- --v=2
- --logtostderr
- --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
- --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
image: k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.10
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /metrics
port: 10054
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: sidecar
ports:
- containerPort: 10054
name: metrics
protocol: TCP
resources:
requests:
cpu: 10m
memory: 20Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: Default
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: kube-dns
serviceAccountName: kube-dns
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: kube-dns
optional: true
name: kube-dns-config
status:
conditions:
- lastTransitionTime: "2019-09-24T13:22:06Z"
lastUpdateTime: "2019-09-24T13:22:06Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
observedGeneration: 7
replicas: 3
unavailableReplicas: 3
updatedReplicas: 2
This issue is impacting any new cluster builds in all of our AWS accounts - in each account however there are running clusters with the same config that are working without any problems.
The pods are able to connect to the readiness endpoint (curl not installed on kubedns container so using wget):
kubectl -n kube-system exec -it kube-dns-55c9b74794-cmn5n -c kubedns -- wget http://100.125.236.68:8081/readiness
Connecting to 100.125.236.68:8081 (100.125.236.68:8081)
readiness 100% |*******************************|