scaling GKE K8s Cluster breaks networking

4/27/2020

Folks, When trying to increase a GKE cluster from 1 to 3 nodes, running in separate zones (us-centra1-a, b, c). The following seems apparent:

Pods scheduled on new nodes can not access resources on the internet... i.e. not able to connect to stripe apis, etc. (potentially kube-dns related, have not tested traffic attempting to leave without a DNS lookup).

Similarly, am not able to route between pods in K8s as expected. I.e. it seems cross-az calls could be failing? When testing with openvpn, unable to connect to pods scheduled on new nodes.

A separate issue I noticed was Metrics server seems wonky. kubectl top nodes shows unknown for the new nodes.

At the time of writing, master k8s version 1.15.11-gke.9

The settings am paying attention to:

VPC-native (alias IP) - disabled
Intranode visibility - disabled

gcloud container clusters describe cluster-1 --zone us-central1-a

clusterIpv4Cidr: 10.8.0.0/14
createTime: '2017-10-14T23:44:43+00:00'
currentMasterVersion: 1.15.11-gke.9
currentNodeCount: 1
currentNodeVersion: 1.15.11-gke.9
endpoint: 35.192.211.67
initialClusterVersion: 1.7.8
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/skilful-frame-180217/zones/us-central1-a/instanceGroupManagers/gke-cluster-1-default-pool-ff24932a-grp
ipAllocationPolicy: {}
labelFingerprint: a9dc16a7
legacyAbac:
  enabled: true
location: us-central1-a
locations:
- us-central1-a
loggingService: none

....

masterAuthorizedNetworksConfig: {}
monitoringService: none
name: cluster-1
network: default
networkConfig:
  network: .../global/networks/default
  subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
  provider: CALICO
nodeConfig:
  diskSizeGb: 100
  diskType: pd-standard
  imageType: COS
  machineType: n1-standard-2
  ...
nodeIpv4CidrSize: 24
nodePools:
- autoscaling: {}
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS
    machineType: n1-standard-2
    ...
  initialNodeCount: 1
  locations:
  - us-central1-a
  management:
    autoRepair: true
    autoUpgrade: true
  name: default-pool
  podIpv4CidrSize: 24
  status: RUNNING
  version: 1.15.11-gke.9
servicesIpv4Cidr: 10.11.240.0/20
status: RUNNING
subnetwork: default
zone: us-central1-a

Next troubleshooting step is creating a new pool and migrating to it. Maybe the answer is staring at me right in the face... could it be nodeIpv4CidrSize a /24?

Thanks!

-- Cmag
google-cloud-platform
google-kubernetes-engine
kubernetes

1 Answer

4/28/2020
  • In your question, the description of your cluster have the following Network Policy:
name: cluster-1
network: default
networkConfig:
  network: .../global/networks/default
  subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
  provider: CALICO
  • I deployed a cluster as similar as I could:
gcloud beta container --project "PROJECT_NAME" clusters create "cluster-1" \
--zone "us-central1-a" \
--no-enable-basic-auth \
--cluster-version "1.15.11-gke.9" \
--machine-type "n1-standard-1" \
--image-type "COS" \
--disk-type "pd-standard" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
--num-nodes "1" \
--no-enable-ip-alias \
--network "projects/owilliam/global/networks/default" \
--subnetwork "projects/owilliam/regions/us-central1/subnetworks/default" \
--enable-network-policy \
--no-enable-master-authorized-networks \
--addons HorizontalPodAutoscaling,HttpLoadBalancing \
--enable-autoupgrade \
--enable-autorepair
  • After that I got the same configuration as yours, I'll point two parts:
addonsConfig:
  networkPolicyConfig: {}
...
name: cluster-1
network: default
networkConfig:
  network: projects/owilliam/global/networks/default
  subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
networkPolicy:
  enabled: true
  provider: CALICO
...
  • In the comments you mention "in the UI, it says network policy is disabled...is there a command to drop calico?". Then I gave you the command, for which you got the error stating that Network Policy Addon is not Enabled.

Which is weird, because it's applied but not enabled. I DISABLED it on my cluster and look:

addonsConfig:
  networkPolicyConfig:
    disabled: true
...
name: cluster-1
network: default
networkConfig:
  network: projects/owilliam/global/networks/default
  subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
nodeConfig:
...
  • NetworkPolicyConfig went from {} to disabled: true and the section NetworkPolicy above nodeConfig is now gone. So, i suggest you to enable and disable it again to see if it updates the proper resources and fix your network policy issue, here is what we will do:

  • If your cluster is not on production, I'd suggest you to resize it back to 1, do the changes and then scale again, the update will be quicker. but if it is in production, leave it as it is, but it might take longer depending on your pod disrupting policy. (default-pool is the name of my cluster pool), I'll resize it on my example:

$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 1
Do you want to continue (Y/n)?  y
Resizing cluster-1...done.
  • Then enable the network policy addon itself (it will not activate it, only make available):
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=ENABLED
Updating cluster-1...done.                                                                                                                                                      
  • and we enable (activate) the network policy:
$ gcloud container clusters update cluster-1 --enable-network-policy
Do you want to continue (Y/n)?  y
Updating cluster-1...done.                                                                                                                                                      
  • Now let's undo it:
$ gcloud container clusters update cluster-1 --no-enable-network-policy
Do you want to continue (Y/n)?  y
Updating cluster-1...done.    
  • After disabling it, wait until the pool is ready and run the last command:
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=DISABLED
Updating cluster-1...done.
  • Scale it back to 3 if you had downscaled:
$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 3
Do you want to continue (Y/n)?  y
Resizing cluster-1...done.
  • Finally check again the description to see if it matches the right configuration and test the communication between the pods.

Here is the reference for this configuration: Creating a Cluster Network Policy

If you still got the issue after that, update your question with the latest cluster description and we will dig further.

-- willrof
Source: StackOverflow