I'm upgrading a test cluster in GKE and had issues with node pool creation initially, with one of the pools showing a red exclamation mark. Upgrades of both that pool and a seemingly healthy other pool from 1.15 kept failing, so I deleted both pools and created new ones.
Unfortunately, the nodes that are created are never added to the pool. After looking in Compute Engine to the machines, it looks like a network config issue. The cbr0 bridge is never created and neither are the veth devices, leaving the box without internet connectivity, which seems to be the cause for the Kubernetes node setup not completing.
I also found that the output of :
kubectl get pods --all-namespaces
does not show a kube-proxy pod, as it does in the working cluster.
Has anyone experienced this before and how can this be fixed? I have pretty much an identical cluster working fine, everything is created from the GUI with defaults and both have a VPC network added (VPN is up, no issues observed).
I came across: https://github.com/kubernetes/kubernetes/issues/21401 and checked :
/var/lib/docker/network/files/local-kv.db
It did not contain cbr0 and removing it and restarting the Docker daemon did not change anything.
I have also another cluster in the same project that works without issues. It looks like the Google-managed master may be the cause of the issue but I have no idea how to go about further troubleshooting. Any help is greatly appreciated, thank you.
update Other tests:
ip a output on node that is 'stuck': ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:30:0f:eb brd ff:ff:ff:ff:ff:ff
inet 10.48.15.235/32 scope global dynamic eth0
valid_lft 2323sec preferred_lft 2323sec
inet6 fe80::4001:aff:fe30:feb/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:2d:e7:41:51 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
/usr/bin/toolbox command does not work ( Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) so hard to troubleshoot further (OS used is CoreOS).
So it seems to be caused by the custom network. I'm expecting that given it's a public cluster, the network should be setup to connect to the internet. I wonder if this is a bug or a wrong setup. Does anyone know if the setup should create the cbr0 and other interfaces regardless? I'm still investigating further myself...
So creating a new cluster in the end shows:
Status details All cluster resources were brought up, but: only 0 nodes out of 3 have registered; this is likely due to Nodes failing to start correctly; try re-creating the cluster or contact support if that doesn't work. I have no commercial support so if nothing comes out of this I might use the public bug tracker.
The problem is resolved. First I found that it seemed to be a DNS problem, so when adding like 8.8.8.8 to /etc/resolv.conf, internet routing worked. For some reason shortly after the node pool suddenly started to work and show up in the GUI. After checking the node machines, the network interfaces were there and it also works when scaling up the machines in the node pool.
It's unclear of my single action caused this, or that there was an issue in the Google Kubernetes environment, I actually suspect the latter. Thanks anyone for the help, I this issue is resolved for now.