Enabling NodeLocalDNS fails

4/7/2020

We have 2 clusters on GKE: dev and production. I tried to run this command on dev cluster:

gcloud beta container clusters update "dev" --update-addons=NodeLocalDNS=ENABLED 

And everything went great, node-local-dns pods are running and all works, next morning I decided to run same command on production cluster and node-local-dns fails to run, and I noticed that both PILLAR__LOCAL__DNS and PILLAR__DNS__SERVER in yaml aren't changed to proper IPs, I tried to change those variables in config yaml, but GKE keeps overwriting them back to yaml with PILLAR__DNS__SERVER variables...

The only difference between clusters is that dev runs on 1.15.9-gke.24 and production 1.15.11-gke.1.

-- Kikun
google-kubernetes-engine
kube-dns
kubernetes

1 Answer

4/7/2020

Apparently 1.15.11-gke.1 version has a bug.

I recreated it first on 1.15.11-gke.1 and can confirm that node-local-dns Pods fall into CrashLoopBackOff state:

node-local-dns-28xxt                                        0/1     CrashLoopBackOff   5          5m9s
node-local-dns-msn9s                                        0/1     CrashLoopBackOff   6          8m17s
node-local-dns-z2jlz                                        0/1     CrashLoopBackOff   6          10m

When I checked the logs:

$ kubectl logs -n kube-system node-local-dns-msn9s
2020/04/07 21:01:52 [FATAL] Error parsing flags - Invalid localip specified - "__PILLAR__LOCAL__DNS__", Exiting

Solution:

Upgrade to 1.15.11-gke.3 helped. First you need to upgrade your master-node and then your node pool. It looks like on this version everything runs nice and smoothly:

$ kubectl get daemonsets -n kube-system node-local-dns 
NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                               AGE
node-local-dns   3         3         3       3            3           addon.gke.io/node-local-dns-ds-ready=true   44m

$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME                   READY   STATUS    RESTARTS   AGE
node-local-dns-8pjr5   1/1     Running   0          11m
node-local-dns-tmx75   1/1     Running   0          19m
node-local-dns-zcjzt   1/1     Running   0          19m

As it comes to manually fixing this particular daemonset yaml file, I wouldn't recommend it as you can be sure that GKE's auto-repair and auto-upgrade features will overwrite it sooner or later anyway.

I hope it was helpful.

-- mario
Source: StackOverflow