Folks, When trying to increase a GKE cluster from 1 to 3 nodes, running in separate zones (us-centra1-a, b, c). The following seems apparent:
Pods scheduled on new nodes can not access resources on the internet... i.e. not able to connect to stripe apis, etc. (potentially kube-dns related, have not tested traffic attempting to leave without a DNS lookup).
Similarly, am not able to route between pods in K8s as expected. I.e. it seems cross-az calls could be failing? When testing with openvpn, unable to connect to pods scheduled on new nodes.
A separate issue I noticed was Metrics server seems wonky. kubectl top nodes
shows unknown for the new nodes.
At the time of writing, master k8s version 1.15.11-gke.9
The settings am paying attention to:
VPC-native (alias IP) - disabled
Intranode visibility - disabled
gcloud container clusters describe cluster-1 --zone us-central1-a
clusterIpv4Cidr: 10.8.0.0/14
createTime: '2017-10-14T23:44:43+00:00'
currentMasterVersion: 1.15.11-gke.9
currentNodeCount: 1
currentNodeVersion: 1.15.11-gke.9
endpoint: 35.192.211.67
initialClusterVersion: 1.7.8
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/skilful-frame-180217/zones/us-central1-a/instanceGroupManagers/gke-cluster-1-default-pool-ff24932a-grp
ipAllocationPolicy: {}
labelFingerprint: a9dc16a7
legacyAbac:
enabled: true
location: us-central1-a
locations:
- us-central1-a
loggingService: none
....
masterAuthorizedNetworksConfig: {}
monitoringService: none
name: cluster-1
network: default
networkConfig:
network: .../global/networks/default
subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
provider: CALICO
nodeConfig:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-2
...
nodeIpv4CidrSize: 24
nodePools:
- autoscaling: {}
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-2
...
initialNodeCount: 1
locations:
- us-central1-a
management:
autoRepair: true
autoUpgrade: true
name: default-pool
podIpv4CidrSize: 24
status: RUNNING
version: 1.15.11-gke.9
servicesIpv4Cidr: 10.11.240.0/20
status: RUNNING
subnetwork: default
zone: us-central1-a
Next troubleshooting step is creating a new pool and migrating to it. Maybe the answer is staring at me right in the face... could it be nodeIpv4CidrSize
a /24?
Thanks!
name: cluster-1
network: default
networkConfig:
network: .../global/networks/default
subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
provider: CALICO
gcloud beta container --project "PROJECT_NAME" clusters create "cluster-1" \
--zone "us-central1-a" \
--no-enable-basic-auth \
--cluster-version "1.15.11-gke.9" \
--machine-type "n1-standard-1" \
--image-type "COS" \
--disk-type "pd-standard" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
--num-nodes "1" \
--no-enable-ip-alias \
--network "projects/owilliam/global/networks/default" \
--subnetwork "projects/owilliam/regions/us-central1/subnetworks/default" \
--enable-network-policy \
--no-enable-master-authorized-networks \
--addons HorizontalPodAutoscaling,HttpLoadBalancing \
--enable-autoupgrade \
--enable-autorepair
addonsConfig:
networkPolicyConfig: {}
...
name: cluster-1
network: default
networkConfig:
network: projects/owilliam/global/networks/default
subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
networkPolicy:
enabled: true
provider: CALICO
...
Network Policy Addon is not Enabled
.Which is weird, because it's applied but not enabled. I DISABLED
it on my cluster and look:
addonsConfig:
networkPolicyConfig:
disabled: true
...
name: cluster-1
network: default
networkConfig:
network: projects/owilliam/global/networks/default
subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
nodeConfig:
...
NetworkPolicyConfig
went from {}
to disabled: true
and the section NetworkPolicy
above nodeConfig
is now gone. So, i suggest you to enable and disable it again to see if it updates the proper resources and fix your network policy issue, here is what we will do:
If your cluster is not on production, I'd suggest you to resize it back to 1, do the changes and then scale again, the update will be quicker. but if it is in production, leave it as it is, but it might take longer depending on your pod disrupting policy. (default-pool
is the name of my cluster pool), I'll resize it on my example:
$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 1
Do you want to continue (Y/n)? y
Resizing cluster-1...done.
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=ENABLED
Updating cluster-1...done.
$ gcloud container clusters update cluster-1 --enable-network-policy
Do you want to continue (Y/n)? y
Updating cluster-1...done.
$ gcloud container clusters update cluster-1 --no-enable-network-policy
Do you want to continue (Y/n)? y
Updating cluster-1...done.
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=DISABLED
Updating cluster-1...done.
$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 3
Do you want to continue (Y/n)? y
Resizing cluster-1...done.
Here is the reference for this configuration: Creating a Cluster Network Policy
If you still got the issue after that, update your question with the latest cluster description and we will dig further.