runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

12/27/2019

Problem you have encountered:

"runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"

What you expected to happen:

  • Upgrade should work
  • Roll back should work
  • Resize it back to 2 and all services should come up

Steps to reproduce:

Running GKE
Master version 1.14.8-gke.12
Node version: 1.14.8-gke.2
Machine type n1-standard-8

running perfectly before this upgrade issue then:

1) gcloud beta container node-pools update k-cpu-pool-v1 --cluster=k --workload-metadata-from-node=GKE_METADATA_SERVER --zone=us-central1-a # fails with 2nd node gcloud beta container node-pools rollback k-cpu-pool-v1 --cluster=k3 --zone=us-central1-a # also fails with 2nd node and many deployment won't come up 2)

trying to "Enable metadata server" per instruction
https://medium.com/@louisvernon/mapping-kubernetes-service-accounts-to-gcp-iams-using-workload-identity-b53496d543e0 
but blocked by failure of previous deployment

Other information (workarounds you have tried, documentation consulted, etc):

I tried looking at google forum issue but nothing.  Looks like a GKE issue with 
rollback when upgrade fails. double issue. Upgrade and master and node to have
same version? 

It doesn't seem to be this issue because one node came up but second does not in GKE.. (https://stackoverflow.com/questions/52675934/network-plugin-is-not-ready-cni-config-uninitialized)
-- Kenney He
google-kubernetes-engine
rollback

1 Answer

1/2/2020

I've tried to recreate your problem:

  1. create cluster and pool:

    gcloud container clusters create test-cluster --zone us-central1-a --cluster-version 1.14.8-gke.12 --node-version 1.14.8-gke.2 --num-nodes=2
    
    WARNING: Currently VPC-native is not the default mode during cluster creation. In the future, this will become the default mode and can be disabled using `--no-enable-ip-alias` flag. Use `--[no-]enable-ip-alias` flag to suppress this warning.
    WARNING: Newly created clusters and node-pools will have node auto-upgrade enabled by default. This can be disabled using the `--no-enable-autoupgrade` flag.
    WARNING: Starting in 1.12, default node pools in new clusters will have their legacy Compute Engine instance metadata endpoints disabled by default. To create a cluster with legacy instance metadata endpoints disabled in the default node pool, run `clusters create` with the flag `--metadata disable-legacy-endpoints=true`.
    WARNING: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s). 
    This will enable the autorepair feature for nodes. Please see https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for more information on node autorepairs.
    Creating cluster test-cluster in us-central1-a... Cluster is being health-checked (master is healthy)...done.              
    Created [https://container.googleapis.com/v1/projects/test-prj/zones/us-central1-a/clusters/test-cluster].
    To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-a/test-cluster?project=test-prj
    
    NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
    
    test-cluster us-central1-a 1.14.8-gke.12 XX.XX.75.247 n1-standard-1 1.14.8-gke.2 2 RUNNING
    
  2. enable Workload Identity (beta) via UI

Workload Identity Enabled

  1. scale up to 3 nodes

    gcloud container clusters resize test-cluster --node-pool default-pool --num-nodes=3 --zone=us-central1-a
    
    Pool [default-pool] for [test-cluster] will be resized to 3.
    Do you want to continue (Y/n)?  y
    Resizing test-cluster...done.                                                                                              
    Updated [https://container.googleapis.com/v1/projects/test-prj/zones/us-central1-a/clusters/test-cluster].
    
  2. upgrade nodes

    gcloud beta container node-pools update default-pool --cluster=test-cluster --workload-metadata-from-node=GKE_METADATA_SERVER --zone=us-central1-a
    
    Updating node pool default-pool... Done with 3 out of 3 nodes (100.0%): 3 succeeded...done.                                       
    Updated [https://container.googleapis.com/v1beta1/projects/test-prj/zones/us-central1-a/clusters/test-cluster/nodePools/default-pool].
    
  3. scale down to 2 nodes

    cloud container clusters resize test-cluster --node-pool default-pool --num-nodes=2 --zone=us-central1-a
    
    Pool [default-pool] for [test-cluster] will be resized to 2.
    Do you want to continue (Y/n)?  y
    Resizing test-cluster...done.                                                                                              
    Updated [https://container.googleapis.com/v1/projects/test-prj/zones/us-central1-a/clusters/test-cluster].
    
  4. disable Workload Identity (beta) 6.1. at first you should go toKubernetes clusters click on your cluster -> at Clusters go to Node pools then click on default-pool and then Edit node pool -> Edit default-pool -> go to Security and uncheck Enable GKE Metadata Server (beta). 6.2. and after that go to Kubernetes clusters click on your cluster -> at Clusters click on Edit and set Workload Identity (beta) as Disabled.

I've checked all this commands on my test cluster and found no errors or network issues. After that I've tried to repeat steps 2-5 and then rollback:

gcloud beta container node-pools rollback default-pool --cluster=test-cluster --zone=us-central1-a  

Node Pool: [default-pool], of Cluster: [test-cluster] will be 
rolled back to previous configuration. This operation is long-running 
and will block other operations on the cluster (including delete) 
until it has run to completion.

Do you want to continue (Y/n)?  y

Rolling back default-pool... Done with 1 out of 2 nodes (50.0%): 1 being processed, 1 succeeded...done.                           
Updated [https://container.googleapis.com/v1beta1/projects/test-prj/zones/us-central1-a/clusters/test-cluster/nodePools/default-pool].
operationId: operation-1577965484794-e4b2b2a6
projectId: test-prj
zone: us-central1-a

There was no errors and network problems also. And then I was able to disable Workload Identity (beta) via UI as I described in step 6.

It looks like everything works good and there's some specific problem in your configuration.

-- Serhii Rohoza
Source: StackOverflow