We have 2 clusters on GKE: dev
and production
. I tried to run this command on dev
cluster:
gcloud beta container clusters update "dev" --update-addons=NodeLocalDNS=ENABLED
And everything went great, node-local-dns pods are running and all works, next morning I decided to run same command on production
cluster and node-local-dns fails to run, and I noticed that both PILLAR__LOCAL__DNS and PILLAR__DNS__SERVER in yaml aren't changed to proper IPs, I tried to change those variables in config yaml, but GKE keeps overwriting them back to yaml with PILLAR__DNS__SERVER variables...
The only difference between clusters is that dev
runs on 1.15.9-gke.24
and production 1.15.11-gke.1
.
Apparently 1.15.11-gke.1
version has a bug.
I recreated it first on 1.15.11-gke.1
and can confirm that node-local-dns
Pods
fall into CrashLoopBackOff
state:
node-local-dns-28xxt 0/1 CrashLoopBackOff 5 5m9s
node-local-dns-msn9s 0/1 CrashLoopBackOff 6 8m17s
node-local-dns-z2jlz 0/1 CrashLoopBackOff 6 10m
When I checked the logs:
$ kubectl logs -n kube-system node-local-dns-msn9s
2020/04/07 21:01:52 [FATAL] Error parsing flags - Invalid localip specified - "__PILLAR__LOCAL__DNS__", Exiting
Upgrade to 1.15.11-gke.3
helped. First you need to upgrade your master-node and then your node pool. It looks like on this version everything runs nice and smoothly:
$ kubectl get daemonsets -n kube-system node-local-dns
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-local-dns 3 3 3 3 3 addon.gke.io/node-local-dns-ds-ready=true 44m
$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME READY STATUS RESTARTS AGE
node-local-dns-8pjr5 1/1 Running 0 11m
node-local-dns-tmx75 1/1 Running 0 19m
node-local-dns-zcjzt 1/1 Running 0 19m
As it comes to manually fixing this particular daemonset yaml
file, I wouldn't recommend it as you can be sure that GKE's auto-repair and auto-upgrade features will overwrite it sooner or later anyway.
I hope it was helpful.