GKE node creation fails because of incomplete network setup

10/14/2020

I'm upgrading a test cluster in GKE and had issues with node pool creation initially, with one of the pools showing a red exclamation mark. Upgrades of both that pool and a seemingly healthy other pool from 1.15 kept failing, so I deleted both pools and created new ones.

Unfortunately, the nodes that are created are never added to the pool. After looking in Compute Engine to the machines, it looks like a network config issue. The cbr0 bridge is never created and neither are the veth devices, leaving the box without internet connectivity, which seems to be the cause for the Kubernetes node setup not completing.

I also found that the output of :

kubectl get pods --all-namespaces

does not show a kube-proxy pod, as it does in the working cluster.

Has anyone experienced this before and how can this be fixed? I have pretty much an identical cluster working fine, everything is created from the GUI with defaults and both have a VPC network added (VPN is up, no issues observed).

I came across: https://github.com/kubernetes/kubernetes/issues/21401 and checked :

 /var/lib/docker/network/files/local-kv.db

It did not contain cbr0 and removing it and restarting the Docker daemon did not change anything.

I have also another cluster in the same project that works without issues. It looks like the Google-managed master may be the cause of the issue but I have no idea how to go about further troubleshooting. Any help is greatly appreciated, thank you.

update Other tests:

  • create a new cluster in same project with default network - no problem
  • create a new cluster in same project with VPC network - same problem

ip a output on node that is 'stuck': ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:30:0f:eb brd ff:ff:ff:ff:ff:ff
    inet 10.48.15.235/32 scope global dynamic eth0
       valid_lft 2323sec preferred_lft 2323sec
    inet6 fe80::4001:aff:fe30:feb/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:2d:e7:41:51 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

/usr/bin/toolbox command does not work ( Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) so hard to troubleshoot further (OS used is CoreOS).

So it seems to be caused by the custom network. I'm expecting that given it's a public cluster, the network should be setup to connect to the internet. I wonder if this is a bug or a wrong setup. Does anyone know if the setup should create the cbr0 and other interfaces regardless? I'm still investigating further myself...

So creating a new cluster in the end shows:

Status details All cluster resources were brought up, but: only 0 nodes out of 3 have registered; this is likely due to Nodes failing to start correctly; try re-creating the cluster or contact support if that doesn't work. I have no commercial support so if nothing comes out of this I might use the public bug tracker.

-- Vincent Gerris
docker
google-kubernetes-engine
kubernetes

1 Answer

10/14/2020

The problem is resolved. First I found that it seemed to be a DNS problem, so when adding like 8.8.8.8 to /etc/resolv.conf, internet routing worked. For some reason shortly after the node pool suddenly started to work and show up in the GUI. After checking the node machines, the network interfaces were there and it also works when scaling up the machines in the node pool.

It's unclear of my single action caused this, or that there was an issue in the Google Kubernetes environment, I actually suspect the latter. Thanks anyone for the help, I this issue is resolved for now.

-- Vincent Gerris
Source: StackOverflow